Journal of Animal Reproduction and Biotechnology 2024; 39(4): 267-277
Published online December 31, 2024
https://doi.org/10.12750/JARB.39.4.267
Copyright © The Korean Society of Animal Reproduction and Biotechnology.
Vincent Jaehyun Shim1 , Hosup Shim2 and Sangho Roh1,*
1Cellular Reprogramming and Embryo Biotechnology Laboratory, Dental Research Institute, Seoul National University School of Dentistry, Seoul 08826, Korea
2Department of Nanobiomedical Science, Dankook University, Cheonan 31116, Korea
Correspondence to: Sangho Roh
E-mail: sangho@snu.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: Evaluating embryo quality is crucial for the success of in vitro fertilization procedures. Traditional methods, such as the Gardner grading system, rely on subjective human assessment of morphological features, leading to potential inconsistencies and errors. Artificial intelligence-powered grading systems offer a more objective and consistent approach by reducing human biases and enhancing accuracy and reliability.
Methods: We evaluated the performance of five convolutional neural network architectures—EfficientNet-B0, InceptionV3, ResNet18, ResNet50, and VGG16— in grading blastocysts into five quality classes using only embryo images, without incorporating clinical or patient data. Transfer learning was applied to adapt pretrained models to our dataset, and data augmentation techniques were employed to improve model generalizability and address class imbalance.
Results: EfficientNet-B0 outperformed the other architectures, achieving the highest accuracy, area under the receiver operating characteristic curve, and F1-score across all evaluation metrics. Gradient-weighted Class Activation Mapping was used to interpret the models’ decision-making processes, revealing that the most successful models predominantly focused on the inner cell mass, a critical determinant of embryo quality.
Conclusions: Convolutional neural networks, particularly EfficientNet-B0, can significantly enhance the reliability and consistency of embryo grading in in vitro fertilization procedures by providing objective assessments based solely on embryo images. This approach offers a promising alternative to traditional subjective morphological evaluations.
Keywords: blastocyst, convolutional neural networks, deep learning, embryo, in vitro fertilization
Conventionally, embryo evaluation is performed by embryologists using morphological criteria, with the Gardner grading system being one of the most widely adopted methods (Alpha Scientists in Reproductive Medicine and ESHRE Special Interest Group of Embryology, 2011). This system assesses blastocyst quality based on parameters such as blastocyst expansion, Inner cell mass (ICM) quality, and trophectoderm (TE) appearance, assigning grades that correlate with implantation potential (Gardner et al., 2000). While this system provides a standardized framework, the manual evaluation process is inherently subjective and prone to inter-observer variability, which can impact the clinical outcomes (Baxter Bendus et al., 2006; Paternot et al., 2011).
Advancements in AI and deep learning offer promising avenues to enhance the objectivity and consistency of embryo grading (LeCun, Bengio, and Hinton, 2015). Recent studies have demonstrated that deep learning models can enable robust assessment and selection of human blastocysts after IVF, providing more objective and consistent evaluations compared to traditional morphological analysis (Khosravi et al., 2019). Convolutional neural network (CNN), a subset of deep learning models particularly effective in image recognition tasks, have demonstrated superior performance in various medical imaging applications (Esteva et al., 2017). In fields such as radiology, pathology, and dermatology, CNNs have achieved expert-level accuracy in image classification, segmentation, and detection tasks (Gulshan et al., 2016).
In the context of embryology, the application of CNNs for automated embryo assessment is an emerging area of research (VerMilyea et al., 2020). Previous research has predominantly focused on predicting embryo viability rather than capturing the finer distinctions in embryo quality (Manna et al., 2013; Tran et al., 2019; Zaninovic and Rosenwaks, 2020), although CNN-based morphological classification has been demonstrated using relatively limited datasets (Thirumalaraju et al., 2021), and comprehensive reviews have highlighted a variety of deep learning methods-encompassing both clinical and image-based analyses-in embryo selection (Salih et al., 2023), there remains a need for more detailed grading strategies. In response, our study introduces a five-class grading system derived solely from embryo images, providing a more granular representation of quality. By achieving greater accuracy than previously reported image-based approaches, this work may yield more clinically actionable insights and better reflect the nuanced conditions encountered in actual practice.
To achieve this, we performed a comparative analysis of five widely recognized CNN architectures-VGG16 (Simonyan and Zisserman, 2014), ResNet18 and ResNet50 (He et al., 2015), InceptionV3 (Szegedy et al., 2015), and EfficientNet-B0 (Tan and Le, 2019). These models differ in depth, width, and complexity, affecting their capacity to learn hierarchical representations from image data (Rawat and Wang, 2017).
Furthermore, to interpret and visualize the features learned by CNNs, techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) have been employed in medical imaging to highlight important regions contributing to model predictions (Selvaraju et al., 2017). In embryo assessment, applying Grad-CAM can help identify specific morphological features that are most influential in grading decisions, thereby enhancing the transparency and interpretability of deep learning models.
In this study, we focused solely on embryo image data without incorporating patient-specific clinical information. While this approach may be seen as a limitation, it allows us to isolate the performance of the CNN architectures in capturing morphological cues directly from embryo images. By leveraging transfer learning and data augmentation to address class imbalance, our analysis identifies a model that not only outperforms others in this challenging five-class grading task but also surpasses accuracy levels reported in previous studies. This work provides a clearer understanding of the relative strengths and weaknesses of different CNN architectures in embryo assessment, offering guidance for future research and potential clinical implementation.
The dataset utilized in this study was obtained from the publicly available Embryo Dataset on Kaggle, accessible at https://www.kaggle.com/datasets/bitanasiri/embryo-dataset. This dataset comprises high-resolution images of human embryos, categorized into five classes corresponding to different embryo quality grades based on morphological assessment consisted of 14,640 images. The dataset contains only embryo images and does not include any clinical or patient data.
To systematically evaluate model performance across various architectural strategies, we included five different CNN architectures: VGG16, ResNet18, ResNet50, InceptionV3, and EfficientNet-B0. These models represent a spectrum of design philosophies, layer configurations, and complexity levels. VGG16 provides a straightforward, deep convolutional baseline; ResNet18 and ResNet50 incorporate residual connections to facilitate the training of deeper networks; InceptionV3 leverages inception modules for efficiently capturing multi-scale features; and EfficientNet-B0 employs compound scaling to balance depth, width, and resolution for enhanced computational efficiency. By encompassing this range of architectures, we aimed to determine which structural characteristics would yield the most reliable and accurate results in the context of five-class embryo grading.
To enhance the generalization capabilities of the models and prevent overfitting, extensive data augmentation techniques were applied to the training dataset. Images were resized to match the input size to 244 × 244 required by the models, specifically 299 × 299 pixels for architectures like Inception-V3. The augmentation process included random horizontal flipping with a probability of 0.5, random rotations within a range of ± 10 degrees, and random resized cropping with a scale range of 80% to 100% of the original size. Color jitter adjustments were applied to modify brightness, contrast, saturation, and hue, introducing variability in color channels. After augmentation, images were converted to tensors and normalized using the mean and standard deviation values of the ImageNet dataset ([0.485, 0.456, 0.406] for mean and [0.229, 0.224, 0.225] for standard deviation) to align with the expectations of pre-trained models.
The validation dataset underwent a consistent set of transformations without augmentation to provide an unbiased evaluation. Images were resized and center-cropped to 299 × 299 pixels, converted to tensors, and normalized using the same parameters as the training set.
The dataset was split into training and validation sets using an 80:20 ratio. The training set comprised 80% of the total images, while the validation set consisted of the remaining 20%. This split ensured that models were evaluated on unseen data, providing an accurate assessment of their generalization capabilities.
All models were trained using the PyTorch framework, leveraging its flexibility and support for dynamic computational graphs. The Adam optimizer was employed for all models to adaptively adjust learning rates during training, with a consistent learning rate of 0.0001 set for all models to ensure stable convergence. This learning rate was chosen after preliminary experiments with values in the range of 1e-3 to 1e-5, and 1e-4 provided the best balance between convergence speed and stability. The Cross-Entropy Loss function was used, suitable for multi-class classification tasks. A batch size of 32 was used to balance computational efficiency with sufficient gradient diversity. This batch size was selected based on commonly used practices in image-based classification tasks, as well as GPU memory constraints noted during initial trial runs. Each model was trained for 10 epochs. While a larger number of epochs was tested (e.g., 20 and 30), early performance plateaus suggested that 10 epochs were sufficient to achieve stable model performance without overfitting. During training, models were set to training mode, enabling layers such as dropout and batch normalization. For InceptionV3, due to the presence of auxiliary outputs, the total loss was calculated as a weighted sum of the main loss and the auxiliary loss, with weights of 1.0 and 0.4, respectively. Gradient backpropagation was performed during the training phase to update model weights. Models were set to evaluation mode during validation to disable layers that could alter the data.
Models were evaluated based on their performance on the validation dataset using several metrics. Accuracy was calculated as the proportion of correctly classified samples over the total number of samples. The Cross-Entropy Loss was computed over the validation dataset. A classification report was generated, including precision, recall, F1-score, and support for each class, providing insights into the models’ performance on individual classes. A confusion matrix was constructed to illustrate the models’ ability to correctly predict each class and to identify where misclassifications occurred. Receiver Operating Characteristic (ROC) curves were plotted for each class by binarizing the labels, and the Area under the curve (AUC) scores were calculated to quantify the models’ ability to distinguish between classes.
To interpret the models and understand the regions of the images that contributed to the predictions, Grad-CAM was employed. For each model, five random images from the validation set were selected. Hooks were registered on the last convolutional layers of the models to capture gradients and activations. Grad-CAM heatmaps were computed to visualize the important regions influencing the models’ decisions.
We evaluated five CNN architectures–EfficientNet-B0, InceptionV3, ResNet18, ResNet50, and VGG16–for their effectiveness in classifying embryo images into five distinct quality grades. Key performance metrics such as accuracy, AUC, precision, recall, and F1-score were computed for each model (Table 1).
Table 1 . Embryo classification performance metrics for different CNN models
Model | Accuracy (%) | AUC | Precision | Recall | F1-score |
---|---|---|---|---|---|
EfficientNet-B0 | 88.90 | 0.98 | 0.89 | 0.89 | 0.89 |
InceptionV3 | 85.52 | 0.97 | 0.86 | 0.86 | 0.86 |
ResNet18 | 78.24 | 0.95 | 0.79 | 0.78 | 0.78 |
ResNet50 | 84.63 | 0.97 | 0.85 | 0.85 | 0.84 |
VGG16 | 75.44 | 0.96 | 0.80 | 0.75 | 0.75 |
Comparing the accuracy across the five models, EfficientNet-B0 achieved the highest accuracy of 88.9%, followed by InceptionV3 at 85.52%, and ResNet50 at 84.63%. ResNet18 and VGG16 showed lower accuracies of 78.24% and 75.44%, respectively.
In terms of AUC scores, EfficientNet-B0 again led with an AUC of 0.98, while InceptionV3 and ResNet50 both achieved AUCs of 0.97. VGG16 and ResNet18 had AUC values of 0.96 and 0.95, respectively.
For the weighted average precision, EfficientNet-B0 attained the highest score at 0.89, followed by InceptionV3 at 0.86, and ResNet50 at 0.85. VGG16 and ResNet18 had lower precision scores of 0.80 and 0.79, respectively.
Regarding weighted average recall, EfficientNet-B0 again had the highest score at 0.89, with InceptionV3 and ResNet50 following at 0.86 and 0.85. ResNet18 and VGG16 exhibited lower recall values of 0.78 and 0.75, respectively.
Comparing the weighted average F1-scores, EfficientNet-B0 achieved the highest value of 0.89, indicating balanced precision and recall. InceptionV3 and ResNet50 had F1-scores of 0.86 and 0.84, respectively, while ResNet18 and VGG16 had lower F1-scores of 0.78 and 0.75.
These results indicate that EfficientNet-B0 outperformed all other models across all evaluation metrics. It achieved the highest accuracy and AUC, as well as the highest weighted average precision, recall, and F1-score, demonstrating superior performance in classifying embryo quality. InceptionV3 and ResNet50 also showed strong but slightly lower performance, whereas ResNet18 and VGG16 were less effective in this classification task.
CNNs generally require substantial amounts of annotated image data to accurately learn features and differentiate between categories in complex classification tasks. Due to the scarcity of high-quality medical imaging datasets, we employed transfer learning by initializing our networks with pre-trained ImageNet weights. Five established CNN architectures–EfficientNet-B0, InceptionV3, ResNet50, ResNet18, and VGG16–were retrained using our dataset of 14,640 embryo images. The dataset was divided into training and testing sets using an 80:20 ratio, resulting in 11,712 images for training and 2,928 images for testing. All models were trained over 10 epochs with early stopping rules based on the lowest validation loss to minimize overfitting.
After training, we compared the validation losses and accuracies achieved by each network. EfficientNet-B0 achieved the lowest mean validation loss of 0.46 ± 0.083 and the highest validation accuracy of 84.99% ± 5.207%, indicating superior performance in classifying embryo quality grades (Table 1). InceptionV3 and ResNet50 also demonstrated strong performance, with validation losses of 0.60 ± 0.075 and 0.65 ± 0.078, and validation accuracies of 82.31% ± 3.653% and 81.59% ± 3.333%, respectively. In contrast, ResNet18 and VGG16 exhibited higher validation losses of 0.85 ± 0.218 and 0.96 ± 0.258, and lower validation accuracies of 77.61% ± 3.664% and 75.21% ± 5.922%, respectively.
The training and validation accuracy and loss curves illustrate the learning behavior of each model over the training epochs (Fig. 1). In the validation accuracy curves (Fig. 1A), EfficientNet-B0 consistently achieved higher validation accuracy throughout the 10 epochs, steadily increasing to approximately 86%. This indicates effective learning and better generalization to unseen data. In contrast, other models such as InceptionV3, ResNet18, ResNet50, and VGG16 showed fluctuations or decreases in validation accuracy during the training process. InceptionV3 initially increased in accuracy but exhibited variability after epoch 4, reaching a maximum of about 83%. ResNet18 and VGG16 reached lower accuracies around 80%, with more pronounced fluctuations, suggesting less stable learning. The training and validation loss curves of each model (Fig. 1B) further support these observations. EfficientNet-B0 demonstrated a rapid decrease in training loss and maintained a relatively low validation loss throughout the epochs, indicating efficient learning without overfitting. In contrast, models like InceptionV3, ResNet50, and VGG16 showed increasing validation loss after the initial epochs despite decreasing training loss, suggesting potential overfitting where the model becomes too tailored to the training data, leading to reduced performance on validation data. ResNet18 exhibited higher validation loss with significant fluctuations, indicating challenges in capturing the complex features necessary for accurate embryo quality classification. These results highlight that EfficientNet-B0 not only achieves higher validation accuracy but also maintains lower validation loss compared to the other models, confirming its superior performance and generalization capability for the embryo quality classification task. The consistent improvement in validation accuracy and the stable decrease in validation loss suggest that EfficientNet-B0 effectively learns relevant features without overfitting.
To evaluate the discriminative ability of each model across the five embryo quality classes, we generated class-specific ROC curves and calculated the AUC for each model. The ROC curves plot the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings, providing insight into the models’ ability to distinguish between classes (Fig. 2).
EfficientNet-B0 demonstrated exceptional performance, achieving AUC values of 0.99 for Classes 1, 2, and 5, and AUC values of 0.98 for the remaining classes (Table 2). This indicates that EfficientNet-B0 has a high discriminative capacity across all embryo quality grades, consistently distinguishing between different classes with great accuracy.
Table 2 . Validation accuracies and losses of CNNs
Architectures | Validation accuracies (%) | Validation losses |
---|---|---|
EfficientNet-B0 | 84.99 ± 5.207 | 0.46 ± 0.083 |
InceptionV3 | 82.31 ± 3.653 | 0.60 ± 0.075 |
ResNet18 | 77.61 ± 3.664 | 0.85 ± 0.218 |
ResNet50 | 81.59 ± 3.333 | 0.65 ± 0.078 |
VGG16 | 75.21 ± 5.922 | 0.96 ± 0.258 |
InceptionV3 also exhibited strong performance, with AUC values of 0.99 for Class 1 and ranging from 0.96 to 0.98 for the other classes. While slightly lower than EfficientNet-B0 in some classes, InceptionV3 still maintains high discriminative ability across the board.
In contrast, ResNet18 showed lower AUC values, particularly for Class 4, where it achieved an AUC of 0.88. Its AUC values across all classes ranged from 0.88 to 0.98, indicating less consistent performance and reduced ability to accurately classify certain embryo quality grades.
Similarly, ResNet50 achieved AUC values of 0.95 or higher for most classes but recorded a lower AUC of 0.92 for Class 4. VGG16 displayed a comparable pattern, with an AUC of 0.91 for Class 4 and AUC values between 0.94 and 0.98 for the other classes.
These results highlight that EfficientNet-B0 outperforms the other models in terms of discriminative ability, consistently achieving higher AUC values across all classes. The lower AUC values for Class 4 in ResNet18 and VGG16 suggest that these models struggle to differentiate embryos of this quality grade, potentially due to insufficient feature extraction or model complexity limitations.
To gain deeper insights into the classification performance and identify patterns of misclassification, we analyzed the confusion matrices for the five CNN architectures (Fig. 3). The confusion matrices display correct predictions along the diagonal and misclassifications as off-diagonal elements, providing detailed information on how each model predicts the embryo quality classes. These matrices were generated using the 2,928 images from the testing set, obtained by splitting the dataset into training and testing sets with an 80:20 ratio. By evaluating how each model assigned grades to the actual images in the testing set, we were able to assess their classification behaviors and gain insights into their classification performance.
EfficientNet-B0 and InceptionV3 demonstrated exceptional classification performance. Their confusion matrices exhibited strong diagonals with minimal off-diagonal entries, indicating accurate classification across all embryo quality grades. Misclassifications were rare and primarily occurred between adjacent classes, such as misclassifying Class 3 embryos as Class 4. This suggests that these models effectively capture the subtle morphological differences between embryo quality grades, leading to high precision and reliability in embryo assessment.
In contrast, ResNet18, ResNet50, and VGG16 showed higher rates of misclassification. Their confusion matrices revealed more off-diagonal entries, indicating frequent misclassifications not only between adjacent classes but also between non-adjacent classes. For instance, these models often confused Class 4 embryos with Class 2 or Class 5, suggesting difficulty in distinguishing embryos with subtle or less pronounced morphological features. The widespread misclassifications imply that these models may struggle with feature extraction and interpretation necessary for accurate embryo quality classification.
Overall, the confusion matrix analysis highlights that EfficientNet-B0 and InceptionV3 outperform the other models, exhibiting more accurate and consistent classifications. Their superior performance underscores the importance of selecting appropriate CNN architectures that can effectively generalize and capture critical features in medical image classification tasks.
The analysis of Grad-CAM heatmaps revealed a correlation between the regions of focus in the models and their predictive performance (Fig. 4). Models with higher accuracy–EfficientNet-B0, InceptionV3, and ResNet50–primarily concentrated on the ICM, as indicated by the red to yellow regions in the heatmaps. The ICM is a critical structure in embryo development, and its morphology is a key determinant of embryo quality. The models’ emphasis on the ICM suggests that they effectively learned to identify and prioritize biologically relevant features important for embryo viability.
In contrast, models with lower predictive performance, such as ResNet18 and VGG16, showed greater focus on the TE or other less critical regions instead of the ICM. Their Grad-CAM heatmaps displayed red to yellow activations in areas outside the ICM, indicating that these models may not be effectively capturing the essential features necessary for accurate embryo quality assessment. This misdirected attention could contribute to their reduced classification accuracy.
Furthermore, all models exhibited minimal focus on the overall blastocyst size, as evidenced by the green to blue areas in the heatmaps, which represent lower activation levels. This suggests that blastocyst size was not a significant factor in the models’ decision-making processes. While blastocyst size is a morphological characteristic considered during manual assessments, the models prioritized structural features of the ICM and TE over size metrics. This aligns with the understanding that the quality and viability of an embryo are more closely associated with the integrity and development of specific cellular structures rather than overall size alone.
By highlighting the correlation between model focus and predictive performance, these findings underscore the importance of the ICM in embryo quality classification. The superior performance of models concentrating on the ICM reinforces the relevance of this region in assessing embryo viability and supports the potential utility of these models in clinical applications where accurate and interpretable predictions are essential.
This study demonstrates that CNN-based deep learning models can objectively grade embryos using only embryo images, without incorporating patient or clinical data. Among the models compared, EfficientNet-B0 (Tan and Le, 2019) outperformed other architectures in terms of accuracy, AUC, and F1-score, indicating its robustness for embryo classification tasks (Fig. 1 and 2). Significantly, the ability to accurately classify embryos into five grades, rather than a simple good or bad assessment, adds valuable granularity to embryo evaluation. Grad-CAM visualizations provided insight into the morphological features prioritized by the models, with higher-performing models effectively focusing on the inner cell mass (ICM), a key determinant of embryo quality (Gardner et al., 2000) (Fig. 4). These findings suggest that the use of CNNs, particularly EfficientNet-B0, can enhance the reliability and consistency of embryo selection in IVF by minimizing human biases inherent in manual evaluation.
While EfficientNet-B0 clearly outperformed the other models, it is important to consider the reasons behind the relatively lower performance of architectures such as VGG16 and ResNet18. VGG16, although historically influential, is a relatively shallow and parameter-heavy network that may not efficiently capture subtle morphological nuances in embryo images. Its reliance on uniformly stacked convolutional layers and lack of advanced architectural elements could limit its capacity to differentiate closely related classes. ResNet18, on the other hand, is a shallower variant of the residual network family. Although residual connections help in training deeper networks by mitigating the vanishing gradient problem, the limited depth and complexity of ResNet18 may have restricted its feature extraction capabilities. Consequently, these models may focus on less discriminative features, as evidenced by their Grad-CAM maps that highlighted non-ICM regions, thus reducing their effectiveness in fine-grained embryo quality classification.
In contrast, models like InceptionV3 and ResNet50 incorporate design strategies–such as inception modules and deeper residual connections–that allow for more diverse and hierarchical feature representations. Although these models performed well, they still fell slightly short of EfficientNet-B0’s performance. EfficientNet-B0’s compound scaling approach, which balances network depth, width, and resolution, likely enhanced its ability to learn from the available data with improved parameter efficiency. This balanced architecture may be particularly well-suited to capturing subtle morphological traits characteristic of intermediate embryo quality classes, thus improving both accuracy and generalizability.
These architectural differences highlight that not all CNNs are equally effective for complex medical imaging tasks like embryo grading. The success of EfficientNet-B0 underscores the importance of selecting architectures that not only have sufficient representational capacity but also efficiently utilize parameters to capture subtle variations in biological structures. Choosing the right model is not merely a matter of picking the latest or most well-known architecture; rather, it involves aligning the model’s design principles with the specific characteristics of the target data and classification task.
In summary, our findings not only reaffirm the promise of CNN-based embryo grading in improving objectivity and consistency (Khosravi et al., 2019; VerMilyea et al., 2020) but also emphasize the importance of architectural selection. By identifying the structural attributes that lead to superior performance, researchers and clinicians can make more informed decisions when integrating AI models into clinical workflows, thereby enhancing the reliability and interpretability of embryo assessments and ultimately contributing to improved outcomes in reproductive medicine and IVF practices.
Thank to Jihye Park for her valuable assistance in creating and refining the figures presented in this study.
Conceptualization, S.R. and H.S.; project administration and resources, V.J.S.; methodology and investigation, V.J.S., H.S. and S.R.; data curation and validation, H.S. and S.R.; writing-original draft preparation, H.S. and S.R.; writing-review and editing, V.J.S. and S.R.
None.
Not applicable.
Not applicable.
Not applicable.
Not applicable.
No potential conflict of interest relevant to this article was reported.
Journal of Animal Reproduction and Biotechnology 2024; 39(4): 267-277
Published online December 31, 2024 https://doi.org/10.12750/JARB.39.4.267
Copyright © The Korean Society of Animal Reproduction and Biotechnology.
Vincent Jaehyun Shim1 , Hosup Shim2 and Sangho Roh1,*
1Cellular Reprogramming and Embryo Biotechnology Laboratory, Dental Research Institute, Seoul National University School of Dentistry, Seoul 08826, Korea
2Department of Nanobiomedical Science, Dankook University, Cheonan 31116, Korea
Correspondence to:Sangho Roh
E-mail: sangho@snu.ac.kr
This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Background: Evaluating embryo quality is crucial for the success of in vitro fertilization procedures. Traditional methods, such as the Gardner grading system, rely on subjective human assessment of morphological features, leading to potential inconsistencies and errors. Artificial intelligence-powered grading systems offer a more objective and consistent approach by reducing human biases and enhancing accuracy and reliability.
Methods: We evaluated the performance of five convolutional neural network architectures—EfficientNet-B0, InceptionV3, ResNet18, ResNet50, and VGG16— in grading blastocysts into five quality classes using only embryo images, without incorporating clinical or patient data. Transfer learning was applied to adapt pretrained models to our dataset, and data augmentation techniques were employed to improve model generalizability and address class imbalance.
Results: EfficientNet-B0 outperformed the other architectures, achieving the highest accuracy, area under the receiver operating characteristic curve, and F1-score across all evaluation metrics. Gradient-weighted Class Activation Mapping was used to interpret the models’ decision-making processes, revealing that the most successful models predominantly focused on the inner cell mass, a critical determinant of embryo quality.
Conclusions: Convolutional neural networks, particularly EfficientNet-B0, can significantly enhance the reliability and consistency of embryo grading in in vitro fertilization procedures by providing objective assessments based solely on embryo images. This approach offers a promising alternative to traditional subjective morphological evaluations.
Keywords: blastocyst, convolutional neural networks, deep learning, embryo, in vitro fertilization
Conventionally, embryo evaluation is performed by embryologists using morphological criteria, with the Gardner grading system being one of the most widely adopted methods (Alpha Scientists in Reproductive Medicine and ESHRE Special Interest Group of Embryology, 2011). This system assesses blastocyst quality based on parameters such as blastocyst expansion, Inner cell mass (ICM) quality, and trophectoderm (TE) appearance, assigning grades that correlate with implantation potential (Gardner et al., 2000). While this system provides a standardized framework, the manual evaluation process is inherently subjective and prone to inter-observer variability, which can impact the clinical outcomes (Baxter Bendus et al., 2006; Paternot et al., 2011).
Advancements in AI and deep learning offer promising avenues to enhance the objectivity and consistency of embryo grading (LeCun, Bengio, and Hinton, 2015). Recent studies have demonstrated that deep learning models can enable robust assessment and selection of human blastocysts after IVF, providing more objective and consistent evaluations compared to traditional morphological analysis (Khosravi et al., 2019). Convolutional neural network (CNN), a subset of deep learning models particularly effective in image recognition tasks, have demonstrated superior performance in various medical imaging applications (Esteva et al., 2017). In fields such as radiology, pathology, and dermatology, CNNs have achieved expert-level accuracy in image classification, segmentation, and detection tasks (Gulshan et al., 2016).
In the context of embryology, the application of CNNs for automated embryo assessment is an emerging area of research (VerMilyea et al., 2020). Previous research has predominantly focused on predicting embryo viability rather than capturing the finer distinctions in embryo quality (Manna et al., 2013; Tran et al., 2019; Zaninovic and Rosenwaks, 2020), although CNN-based morphological classification has been demonstrated using relatively limited datasets (Thirumalaraju et al., 2021), and comprehensive reviews have highlighted a variety of deep learning methods-encompassing both clinical and image-based analyses-in embryo selection (Salih et al., 2023), there remains a need for more detailed grading strategies. In response, our study introduces a five-class grading system derived solely from embryo images, providing a more granular representation of quality. By achieving greater accuracy than previously reported image-based approaches, this work may yield more clinically actionable insights and better reflect the nuanced conditions encountered in actual practice.
To achieve this, we performed a comparative analysis of five widely recognized CNN architectures-VGG16 (Simonyan and Zisserman, 2014), ResNet18 and ResNet50 (He et al., 2015), InceptionV3 (Szegedy et al., 2015), and EfficientNet-B0 (Tan and Le, 2019). These models differ in depth, width, and complexity, affecting their capacity to learn hierarchical representations from image data (Rawat and Wang, 2017).
Furthermore, to interpret and visualize the features learned by CNNs, techniques such as Gradient-weighted Class Activation Mapping (Grad-CAM) have been employed in medical imaging to highlight important regions contributing to model predictions (Selvaraju et al., 2017). In embryo assessment, applying Grad-CAM can help identify specific morphological features that are most influential in grading decisions, thereby enhancing the transparency and interpretability of deep learning models.
In this study, we focused solely on embryo image data without incorporating patient-specific clinical information. While this approach may be seen as a limitation, it allows us to isolate the performance of the CNN architectures in capturing morphological cues directly from embryo images. By leveraging transfer learning and data augmentation to address class imbalance, our analysis identifies a model that not only outperforms others in this challenging five-class grading task but also surpasses accuracy levels reported in previous studies. This work provides a clearer understanding of the relative strengths and weaknesses of different CNN architectures in embryo assessment, offering guidance for future research and potential clinical implementation.
The dataset utilized in this study was obtained from the publicly available Embryo Dataset on Kaggle, accessible at https://www.kaggle.com/datasets/bitanasiri/embryo-dataset. This dataset comprises high-resolution images of human embryos, categorized into five classes corresponding to different embryo quality grades based on morphological assessment consisted of 14,640 images. The dataset contains only embryo images and does not include any clinical or patient data.
To systematically evaluate model performance across various architectural strategies, we included five different CNN architectures: VGG16, ResNet18, ResNet50, InceptionV3, and EfficientNet-B0. These models represent a spectrum of design philosophies, layer configurations, and complexity levels. VGG16 provides a straightforward, deep convolutional baseline; ResNet18 and ResNet50 incorporate residual connections to facilitate the training of deeper networks; InceptionV3 leverages inception modules for efficiently capturing multi-scale features; and EfficientNet-B0 employs compound scaling to balance depth, width, and resolution for enhanced computational efficiency. By encompassing this range of architectures, we aimed to determine which structural characteristics would yield the most reliable and accurate results in the context of five-class embryo grading.
To enhance the generalization capabilities of the models and prevent overfitting, extensive data augmentation techniques were applied to the training dataset. Images were resized to match the input size to 244 × 244 required by the models, specifically 299 × 299 pixels for architectures like Inception-V3. The augmentation process included random horizontal flipping with a probability of 0.5, random rotations within a range of ± 10 degrees, and random resized cropping with a scale range of 80% to 100% of the original size. Color jitter adjustments were applied to modify brightness, contrast, saturation, and hue, introducing variability in color channels. After augmentation, images were converted to tensors and normalized using the mean and standard deviation values of the ImageNet dataset ([0.485, 0.456, 0.406] for mean and [0.229, 0.224, 0.225] for standard deviation) to align with the expectations of pre-trained models.
The validation dataset underwent a consistent set of transformations without augmentation to provide an unbiased evaluation. Images were resized and center-cropped to 299 × 299 pixels, converted to tensors, and normalized using the same parameters as the training set.
The dataset was split into training and validation sets using an 80:20 ratio. The training set comprised 80% of the total images, while the validation set consisted of the remaining 20%. This split ensured that models were evaluated on unseen data, providing an accurate assessment of their generalization capabilities.
All models were trained using the PyTorch framework, leveraging its flexibility and support for dynamic computational graphs. The Adam optimizer was employed for all models to adaptively adjust learning rates during training, with a consistent learning rate of 0.0001 set for all models to ensure stable convergence. This learning rate was chosen after preliminary experiments with values in the range of 1e-3 to 1e-5, and 1e-4 provided the best balance between convergence speed and stability. The Cross-Entropy Loss function was used, suitable for multi-class classification tasks. A batch size of 32 was used to balance computational efficiency with sufficient gradient diversity. This batch size was selected based on commonly used practices in image-based classification tasks, as well as GPU memory constraints noted during initial trial runs. Each model was trained for 10 epochs. While a larger number of epochs was tested (e.g., 20 and 30), early performance plateaus suggested that 10 epochs were sufficient to achieve stable model performance without overfitting. During training, models were set to training mode, enabling layers such as dropout and batch normalization. For InceptionV3, due to the presence of auxiliary outputs, the total loss was calculated as a weighted sum of the main loss and the auxiliary loss, with weights of 1.0 and 0.4, respectively. Gradient backpropagation was performed during the training phase to update model weights. Models were set to evaluation mode during validation to disable layers that could alter the data.
Models were evaluated based on their performance on the validation dataset using several metrics. Accuracy was calculated as the proportion of correctly classified samples over the total number of samples. The Cross-Entropy Loss was computed over the validation dataset. A classification report was generated, including precision, recall, F1-score, and support for each class, providing insights into the models’ performance on individual classes. A confusion matrix was constructed to illustrate the models’ ability to correctly predict each class and to identify where misclassifications occurred. Receiver Operating Characteristic (ROC) curves were plotted for each class by binarizing the labels, and the Area under the curve (AUC) scores were calculated to quantify the models’ ability to distinguish between classes.
To interpret the models and understand the regions of the images that contributed to the predictions, Grad-CAM was employed. For each model, five random images from the validation set were selected. Hooks were registered on the last convolutional layers of the models to capture gradients and activations. Grad-CAM heatmaps were computed to visualize the important regions influencing the models’ decisions.
We evaluated five CNN architectures–EfficientNet-B0, InceptionV3, ResNet18, ResNet50, and VGG16–for their effectiveness in classifying embryo images into five distinct quality grades. Key performance metrics such as accuracy, AUC, precision, recall, and F1-score were computed for each model (Table 1).
Table 1. Embryo classification performance metrics for different CNN models.
Model | Accuracy (%) | AUC | Precision | Recall | F1-score |
---|---|---|---|---|---|
EfficientNet-B0 | 88.90 | 0.98 | 0.89 | 0.89 | 0.89 |
InceptionV3 | 85.52 | 0.97 | 0.86 | 0.86 | 0.86 |
ResNet18 | 78.24 | 0.95 | 0.79 | 0.78 | 0.78 |
ResNet50 | 84.63 | 0.97 | 0.85 | 0.85 | 0.84 |
VGG16 | 75.44 | 0.96 | 0.80 | 0.75 | 0.75 |
Comparing the accuracy across the five models, EfficientNet-B0 achieved the highest accuracy of 88.9%, followed by InceptionV3 at 85.52%, and ResNet50 at 84.63%. ResNet18 and VGG16 showed lower accuracies of 78.24% and 75.44%, respectively.
In terms of AUC scores, EfficientNet-B0 again led with an AUC of 0.98, while InceptionV3 and ResNet50 both achieved AUCs of 0.97. VGG16 and ResNet18 had AUC values of 0.96 and 0.95, respectively.
For the weighted average precision, EfficientNet-B0 attained the highest score at 0.89, followed by InceptionV3 at 0.86, and ResNet50 at 0.85. VGG16 and ResNet18 had lower precision scores of 0.80 and 0.79, respectively.
Regarding weighted average recall, EfficientNet-B0 again had the highest score at 0.89, with InceptionV3 and ResNet50 following at 0.86 and 0.85. ResNet18 and VGG16 exhibited lower recall values of 0.78 and 0.75, respectively.
Comparing the weighted average F1-scores, EfficientNet-B0 achieved the highest value of 0.89, indicating balanced precision and recall. InceptionV3 and ResNet50 had F1-scores of 0.86 and 0.84, respectively, while ResNet18 and VGG16 had lower F1-scores of 0.78 and 0.75.
These results indicate that EfficientNet-B0 outperformed all other models across all evaluation metrics. It achieved the highest accuracy and AUC, as well as the highest weighted average precision, recall, and F1-score, demonstrating superior performance in classifying embryo quality. InceptionV3 and ResNet50 also showed strong but slightly lower performance, whereas ResNet18 and VGG16 were less effective in this classification task.
CNNs generally require substantial amounts of annotated image data to accurately learn features and differentiate between categories in complex classification tasks. Due to the scarcity of high-quality medical imaging datasets, we employed transfer learning by initializing our networks with pre-trained ImageNet weights. Five established CNN architectures–EfficientNet-B0, InceptionV3, ResNet50, ResNet18, and VGG16–were retrained using our dataset of 14,640 embryo images. The dataset was divided into training and testing sets using an 80:20 ratio, resulting in 11,712 images for training and 2,928 images for testing. All models were trained over 10 epochs with early stopping rules based on the lowest validation loss to minimize overfitting.
After training, we compared the validation losses and accuracies achieved by each network. EfficientNet-B0 achieved the lowest mean validation loss of 0.46 ± 0.083 and the highest validation accuracy of 84.99% ± 5.207%, indicating superior performance in classifying embryo quality grades (Table 1). InceptionV3 and ResNet50 also demonstrated strong performance, with validation losses of 0.60 ± 0.075 and 0.65 ± 0.078, and validation accuracies of 82.31% ± 3.653% and 81.59% ± 3.333%, respectively. In contrast, ResNet18 and VGG16 exhibited higher validation losses of 0.85 ± 0.218 and 0.96 ± 0.258, and lower validation accuracies of 77.61% ± 3.664% and 75.21% ± 5.922%, respectively.
The training and validation accuracy and loss curves illustrate the learning behavior of each model over the training epochs (Fig. 1). In the validation accuracy curves (Fig. 1A), EfficientNet-B0 consistently achieved higher validation accuracy throughout the 10 epochs, steadily increasing to approximately 86%. This indicates effective learning and better generalization to unseen data. In contrast, other models such as InceptionV3, ResNet18, ResNet50, and VGG16 showed fluctuations or decreases in validation accuracy during the training process. InceptionV3 initially increased in accuracy but exhibited variability after epoch 4, reaching a maximum of about 83%. ResNet18 and VGG16 reached lower accuracies around 80%, with more pronounced fluctuations, suggesting less stable learning. The training and validation loss curves of each model (Fig. 1B) further support these observations. EfficientNet-B0 demonstrated a rapid decrease in training loss and maintained a relatively low validation loss throughout the epochs, indicating efficient learning without overfitting. In contrast, models like InceptionV3, ResNet50, and VGG16 showed increasing validation loss after the initial epochs despite decreasing training loss, suggesting potential overfitting where the model becomes too tailored to the training data, leading to reduced performance on validation data. ResNet18 exhibited higher validation loss with significant fluctuations, indicating challenges in capturing the complex features necessary for accurate embryo quality classification. These results highlight that EfficientNet-B0 not only achieves higher validation accuracy but also maintains lower validation loss compared to the other models, confirming its superior performance and generalization capability for the embryo quality classification task. The consistent improvement in validation accuracy and the stable decrease in validation loss suggest that EfficientNet-B0 effectively learns relevant features without overfitting.
To evaluate the discriminative ability of each model across the five embryo quality classes, we generated class-specific ROC curves and calculated the AUC for each model. The ROC curves plot the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings, providing insight into the models’ ability to distinguish between classes (Fig. 2).
EfficientNet-B0 demonstrated exceptional performance, achieving AUC values of 0.99 for Classes 1, 2, and 5, and AUC values of 0.98 for the remaining classes (Table 2). This indicates that EfficientNet-B0 has a high discriminative capacity across all embryo quality grades, consistently distinguishing between different classes with great accuracy.
Table 2. Validation accuracies and losses of CNNs.
Architectures | Validation accuracies (%) | Validation losses |
---|---|---|
EfficientNet-B0 | 84.99 ± 5.207 | 0.46 ± 0.083 |
InceptionV3 | 82.31 ± 3.653 | 0.60 ± 0.075 |
ResNet18 | 77.61 ± 3.664 | 0.85 ± 0.218 |
ResNet50 | 81.59 ± 3.333 | 0.65 ± 0.078 |
VGG16 | 75.21 ± 5.922 | 0.96 ± 0.258 |
InceptionV3 also exhibited strong performance, with AUC values of 0.99 for Class 1 and ranging from 0.96 to 0.98 for the other classes. While slightly lower than EfficientNet-B0 in some classes, InceptionV3 still maintains high discriminative ability across the board.
In contrast, ResNet18 showed lower AUC values, particularly for Class 4, where it achieved an AUC of 0.88. Its AUC values across all classes ranged from 0.88 to 0.98, indicating less consistent performance and reduced ability to accurately classify certain embryo quality grades.
Similarly, ResNet50 achieved AUC values of 0.95 or higher for most classes but recorded a lower AUC of 0.92 for Class 4. VGG16 displayed a comparable pattern, with an AUC of 0.91 for Class 4 and AUC values between 0.94 and 0.98 for the other classes.
These results highlight that EfficientNet-B0 outperforms the other models in terms of discriminative ability, consistently achieving higher AUC values across all classes. The lower AUC values for Class 4 in ResNet18 and VGG16 suggest that these models struggle to differentiate embryos of this quality grade, potentially due to insufficient feature extraction or model complexity limitations.
To gain deeper insights into the classification performance and identify patterns of misclassification, we analyzed the confusion matrices for the five CNN architectures (Fig. 3). The confusion matrices display correct predictions along the diagonal and misclassifications as off-diagonal elements, providing detailed information on how each model predicts the embryo quality classes. These matrices were generated using the 2,928 images from the testing set, obtained by splitting the dataset into training and testing sets with an 80:20 ratio. By evaluating how each model assigned grades to the actual images in the testing set, we were able to assess their classification behaviors and gain insights into their classification performance.
EfficientNet-B0 and InceptionV3 demonstrated exceptional classification performance. Their confusion matrices exhibited strong diagonals with minimal off-diagonal entries, indicating accurate classification across all embryo quality grades. Misclassifications were rare and primarily occurred between adjacent classes, such as misclassifying Class 3 embryos as Class 4. This suggests that these models effectively capture the subtle morphological differences between embryo quality grades, leading to high precision and reliability in embryo assessment.
In contrast, ResNet18, ResNet50, and VGG16 showed higher rates of misclassification. Their confusion matrices revealed more off-diagonal entries, indicating frequent misclassifications not only between adjacent classes but also between non-adjacent classes. For instance, these models often confused Class 4 embryos with Class 2 or Class 5, suggesting difficulty in distinguishing embryos with subtle or less pronounced morphological features. The widespread misclassifications imply that these models may struggle with feature extraction and interpretation necessary for accurate embryo quality classification.
Overall, the confusion matrix analysis highlights that EfficientNet-B0 and InceptionV3 outperform the other models, exhibiting more accurate and consistent classifications. Their superior performance underscores the importance of selecting appropriate CNN architectures that can effectively generalize and capture critical features in medical image classification tasks.
The analysis of Grad-CAM heatmaps revealed a correlation between the regions of focus in the models and their predictive performance (Fig. 4). Models with higher accuracy–EfficientNet-B0, InceptionV3, and ResNet50–primarily concentrated on the ICM, as indicated by the red to yellow regions in the heatmaps. The ICM is a critical structure in embryo development, and its morphology is a key determinant of embryo quality. The models’ emphasis on the ICM suggests that they effectively learned to identify and prioritize biologically relevant features important for embryo viability.
In contrast, models with lower predictive performance, such as ResNet18 and VGG16, showed greater focus on the TE or other less critical regions instead of the ICM. Their Grad-CAM heatmaps displayed red to yellow activations in areas outside the ICM, indicating that these models may not be effectively capturing the essential features necessary for accurate embryo quality assessment. This misdirected attention could contribute to their reduced classification accuracy.
Furthermore, all models exhibited minimal focus on the overall blastocyst size, as evidenced by the green to blue areas in the heatmaps, which represent lower activation levels. This suggests that blastocyst size was not a significant factor in the models’ decision-making processes. While blastocyst size is a morphological characteristic considered during manual assessments, the models prioritized structural features of the ICM and TE over size metrics. This aligns with the understanding that the quality and viability of an embryo are more closely associated with the integrity and development of specific cellular structures rather than overall size alone.
By highlighting the correlation between model focus and predictive performance, these findings underscore the importance of the ICM in embryo quality classification. The superior performance of models concentrating on the ICM reinforces the relevance of this region in assessing embryo viability and supports the potential utility of these models in clinical applications where accurate and interpretable predictions are essential.
This study demonstrates that CNN-based deep learning models can objectively grade embryos using only embryo images, without incorporating patient or clinical data. Among the models compared, EfficientNet-B0 (Tan and Le, 2019) outperformed other architectures in terms of accuracy, AUC, and F1-score, indicating its robustness for embryo classification tasks (Fig. 1 and 2). Significantly, the ability to accurately classify embryos into five grades, rather than a simple good or bad assessment, adds valuable granularity to embryo evaluation. Grad-CAM visualizations provided insight into the morphological features prioritized by the models, with higher-performing models effectively focusing on the inner cell mass (ICM), a key determinant of embryo quality (Gardner et al., 2000) (Fig. 4). These findings suggest that the use of CNNs, particularly EfficientNet-B0, can enhance the reliability and consistency of embryo selection in IVF by minimizing human biases inherent in manual evaluation.
While EfficientNet-B0 clearly outperformed the other models, it is important to consider the reasons behind the relatively lower performance of architectures such as VGG16 and ResNet18. VGG16, although historically influential, is a relatively shallow and parameter-heavy network that may not efficiently capture subtle morphological nuances in embryo images. Its reliance on uniformly stacked convolutional layers and lack of advanced architectural elements could limit its capacity to differentiate closely related classes. ResNet18, on the other hand, is a shallower variant of the residual network family. Although residual connections help in training deeper networks by mitigating the vanishing gradient problem, the limited depth and complexity of ResNet18 may have restricted its feature extraction capabilities. Consequently, these models may focus on less discriminative features, as evidenced by their Grad-CAM maps that highlighted non-ICM regions, thus reducing their effectiveness in fine-grained embryo quality classification.
In contrast, models like InceptionV3 and ResNet50 incorporate design strategies–such as inception modules and deeper residual connections–that allow for more diverse and hierarchical feature representations. Although these models performed well, they still fell slightly short of EfficientNet-B0’s performance. EfficientNet-B0’s compound scaling approach, which balances network depth, width, and resolution, likely enhanced its ability to learn from the available data with improved parameter efficiency. This balanced architecture may be particularly well-suited to capturing subtle morphological traits characteristic of intermediate embryo quality classes, thus improving both accuracy and generalizability.
These architectural differences highlight that not all CNNs are equally effective for complex medical imaging tasks like embryo grading. The success of EfficientNet-B0 underscores the importance of selecting architectures that not only have sufficient representational capacity but also efficiently utilize parameters to capture subtle variations in biological structures. Choosing the right model is not merely a matter of picking the latest or most well-known architecture; rather, it involves aligning the model’s design principles with the specific characteristics of the target data and classification task.
In summary, our findings not only reaffirm the promise of CNN-based embryo grading in improving objectivity and consistency (Khosravi et al., 2019; VerMilyea et al., 2020) but also emphasize the importance of architectural selection. By identifying the structural attributes that lead to superior performance, researchers and clinicians can make more informed decisions when integrating AI models into clinical workflows, thereby enhancing the reliability and interpretability of embryo assessments and ultimately contributing to improved outcomes in reproductive medicine and IVF practices.
Thank to Jihye Park for her valuable assistance in creating and refining the figures presented in this study.
Conceptualization, S.R. and H.S.; project administration and resources, V.J.S.; methodology and investigation, V.J.S., H.S. and S.R.; data curation and validation, H.S. and S.R.; writing-original draft preparation, H.S. and S.R.; writing-review and editing, V.J.S. and S.R.
None.
Not applicable.
Not applicable.
Not applicable.
Not applicable.
No potential conflict of interest relevant to this article was reported.
Table 1 . Embryo classification performance metrics for different CNN models.
Model | Accuracy (%) | AUC | Precision | Recall | F1-score |
---|---|---|---|---|---|
EfficientNet-B0 | 88.90 | 0.98 | 0.89 | 0.89 | 0.89 |
InceptionV3 | 85.52 | 0.97 | 0.86 | 0.86 | 0.86 |
ResNet18 | 78.24 | 0.95 | 0.79 | 0.78 | 0.78 |
ResNet50 | 84.63 | 0.97 | 0.85 | 0.85 | 0.84 |
VGG16 | 75.44 | 0.96 | 0.80 | 0.75 | 0.75 |
Table 2 . Validation accuracies and losses of CNNs.
Architectures | Validation accuracies (%) | Validation losses |
---|---|---|
EfficientNet-B0 | 84.99 ± 5.207 | 0.46 ± 0.083 |
InceptionV3 | 82.31 ± 3.653 | 0.60 ± 0.075 |
ResNet18 | 77.61 ± 3.664 | 0.85 ± 0.218 |
ResNet50 | 81.59 ± 3.333 | 0.65 ± 0.078 |
VGG16 | 75.21 ± 5.922 | 0.96 ± 0.258 |
print Article | |
Export to Citation | Open Access |
Google Scholar | Send to Email |
pISSN: 2671-4639
eISSN: 2671-4663