STAVOS: A Medaka Larval Cardiac Video Segmentation Method Based on Deep Learning

Zeng, Kui; Xu, Shutan; Shu, Daode; Chen, Ming

doi:10.3390/app14031239

Open AccessArticle

STAVOS: A Medaka Larval Cardiac Video Segmentation Method Based on Deep Learning

Key Laboratory of Fisheries Information, Ministry of Agriculture and Rural Affairs, Shanghai Ocean University, Hucheng Ring Road 999, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(3), 1239; https://doi.org/10.3390/app14031239

Submission received: 23 December 2023 / Revised: 24 January 2024 / Accepted: 25 January 2024 / Published: 2 February 2024

(This article belongs to the Special Issue Practical Applications of New Optimization Methods and Intelligent Control)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Medaka (Oryzias latipes), as a crucial model organism in biomedical research, holds significant importance in fields such as cardiovascular diseases. Currently, the analysis of the medaka ventricle relies primarily on visual observation under a microscope, involving labor-intensive manual operations and visual assessments that are cumbersome and inefficient for biologists. Despite attempts by some scholars to employ machine learning methods, limited datasets and challenges posed by the blurred edges of the medaka ventricle have constrained research to relatively simple tasks such as ventricle localization and heart rate statistics, lacking precise segmentation of the medaka ventricle edges. To address these issues, we initially constructed a video object segmentation dataset comprising over 7000 microscopic images of medaka ventricles. Subsequently, we proposed a semi-supervised video object segmentation model named STAVOS, incorporating a spatial-temporal attention mechanism. Additionally, we developed an automated system capable of calculating various parameters and visualizing results for a medaka ventricle using the provided video. The experimental results demonstrate that STAVOS has successfully achieved precise segmentation of medaka ventricle contours. In comparison to the conventional U-Net model, where a mean accuracy improvement of 0.392 was achieved, our model demonstrates significant progress. Furthermore, when compared to the state-of-the-art Tackling Background Distraction (TBD) model, there is an additional enhancement of 0.038.

Keywords:

video object segmentation; ventricular segmentation; cardiac parameters; medaka; deep learning

1. Introduction

Medaka (Oryzias latipes) is a commonly used model organism with significant importance in cardiovascular disease research [1], widely applied in the fields of genetic modification and drug development. This is attributed to the high genetic similarity of medaka to humans, low cultivation costs, and a fast growth cycle. Additionally, during the early larval stages, the bodies of medaka larvae are transparent. Therefore, research primarily focuses on the ventricular development of medaka larvae, with tasks mainly involving heart rate detection and ventricular parameter assessment. The current methods for analyzing heart images of medaka larvae primarily involve manual visual inspections using microscopy, supplemented by visual scoring of embryos with the naked eye. However, the medaka larva is very small and is influenced by factors such as pigmentation and blood vessels. This constitutes a highly demanding and expensive task for biologists in this field [2].

To address this issue, researchers have successively developed various digital image analysis methods to replace observation with the naked eye. For instance, Elodie Puybareau et al. [3] proposed a mathematical morphology-based image analysis method for the automated detection of cardiac arrest in fish embryos. Chia-Pin Kang et al. [4] utilized the periodic bright intensity variations in video frames associated with heartbeats and employed the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to cluster pixels based on “pixel density”, calculating the heart rate. De Luca et al. [5] designed a platform based on a resonant laser scanning confocal microscope for analyzing the heart rate of fish embryos. However, this equipment is quite expensive for some laboratories.

Traditional computer vision methods can only reduce a small number of manual operations. Therefore, some scholars have begun to use deep learning methods for research. However, the research is mainly limited to heart rate estimation [6] or the segmentation of the entire fish body [7], and there has been little progress in the precise segmentation of internal organs such as the heart or ventricle. This is mainly due to the small size of medaka fish larvae, which require video capture through microscopy. After multiple layers of refraction, the video data exhibits low pixel density, resulting in severe dynamic blurring and prominent artifacts. Additionally, the medaka fish body is not entirely transparent, featuring various degrees of pigmentation that cause different levels of occlusion in the collected ventricle videos. Moreover, the heart atrium and ventricle share a high degree of morphological similarity, lacking clear boundaries, making the contour edges almost indistinguishable. These factors make the precise segmentation of the ventricle a significant challenge.

In response to these challenges, there are currently two main solutions: The first approach involves using fluorescence labeling to enhance the visibility of the ventricle. For example, P. Kramer et al. [8] improved the accuracy of heart segmentation through fluorescence microscopy techniques, and Isaac Bakis et al. [9] achieved overall imaging of the heart using fluorescence labeling. The second approach is to enhance the morphological features of the heart through transgenic cultivation. For instance, Bohan Zhang et al. [10] implemented automatic segmentation of the heart based on U-Net. Alexander et al. [11], even based on SegNet, proposed a method named Cardiac Functional Imaging Network (CFIN), capable of automatically assessing cardiac volume. Regardless of whether it is transgenic cultivation or fluorescence labeling, both methods significantly enhance the segmentation effectiveness of the ventricle. However, these approaches still require a substantial number of biological experiments, deviating from the original intention of achieving automation. Therefore, researchers in this field eagerly hope to find an approach that demands the least additional manual operations, truly realizing automation.

Although the ventricular image data of medaka fish has various limitations, cardiac pulsation is a continuous process, and the ventricle undergoes significant changes in volume and shape during pulsation, exhibiting considerable consistency between consecutive frames. To leverage visual information between adjacent frames, this study adopts the Video Object Segmentation (VOS) method [12]. The purpose of VOS is to maintain pixel-level tracking of a specific target object throughout the entire given video, and there is a significant demand for datasets in this field. However, there is no open and authoritative dataset for medaka larval ventricular beating. Therefore, we utilized pixel-level annotation tools to annotate the video data of medaka larval heartbeats, creating a novel dataset specifically designed for VOS tasks. Due to the issues of occlusion and severe background interference in medaka ventricular video images, our model is based on the state-of-the-art video segmentation model TBD (Tackling Background Distraction) [13]. Additionally, to address the problem of edge blurring, we propose a spatiotemporal attention module for video object segmentation (STAVOS), which better captures the dynamic features of heartbeats in the video, achieving precise segmentation of the medaka ventricle. Moreover, we have designed an automated system. By providing a video capturing the heartbeat of medaka larvae and a mask for one frame, the system can perform ventricular segmentation. Based on the precise segmentation results from STAVOS, the system automates the computation of various cardiac parameters of the medaka ventricle, such as heart rate (HR), stroke volume (SV), fractional shortening (FS), ejection fraction (EF), etc. Some of these parameters will undergo visualization to facilitate further research by relevant biologists into medaka.

The main contributions of the work in this paper are as follows:

For the first time, a medaka ventricular video image dataset was established, which was created in the format of a VOS task dataset with over 7000 labeled images. It can be used for precise segmentation of the medaka ventricle and provides an effective basis for calculating ventricular parameters.
We propose a spatiotemporal attention module for VOS, aiming to address the issue of edge blurring in the segmentation of moving targets in videos and thereby enhance the performance of semi-supervised VOS models.
We developed a deep learning-based automated system that achieves precise segmentation of the medaka ventricle. This system is capable of computing various parameters for the medaka heart and producing visualizations for some of these parameters.

2. Dataset

At present, there is a lack of accurate ventricular segmentation datasets in the field of medaka research. In order to solve this dilemma, we have chosen to create a brand new medaka ventricular VOS task dataset.

2.1. Data Acquisition

Unlike publicly available datasets for VOS tasks, medaka larvae are only about 3 mm in size, and acquiring their videos involves a series of complex and rigorous operations. We need to prepare a plate on a microscope equipped with a white LED array for bright-field imaging, a LED fluorescence excitation light source, a fixed plate bracket, movable optical devices, and a temperature-controlled incubation cover. One hour before imaging, medaka embryos at 32–36 h post-fertilization (hpf) were transferred from a methylene blue-containing medium (maintained at a constant 28 °C) into plain embryo medium (ERM). After equilibrating for 15 min in the microscope’s incubation chamber under bright-field illumination, use the Nikos CMOS camera (2048 × 2048 pixels) to capture video images of medaka embryos at 32–36 hpf.

A total of 63 AVI format medaka heartbeat videos were collected, each with a duration of approximately 11 to 13 s. The videos were recorded at a frame rate of 15 frames per second (fps) and a resolution of 640 × 480. Medaka hearts normally beat at a frequency of around 2 beats per second [14], which is lower than the video’s frame rate of 15 fps. The frame rate of the video is deemed sufficient for accurate heart rate estimation. Despite the video’s resolution being limited to 640 × 480, it was focused on the cardiac region, with the ventricle occupying approximately 160 × 128. Therefore, the video resolution is adequate for precise ventricular contour segmentation. Undoubtedly, higher fps and resolution contribute to the precision of the dataset, further enhancing the improvement of the model. We hope that more scholars are willing to openly share their data. We are very grateful to Dr. Jakob Gierten [15] from the German Centre for Organismal Studies at Heidelberg University for providing this original dataset. The video data are freely available and can be Obtained from the Open Science Framework (OSF), a platform that supports scientific research in fields such as social science, medicine, and engineering. The download address is https://osf.io/6svkf (accessed on 24 March 2019).

Due to different angles during video capture, there is a significant difference in the direction of the heart presentation. Therefore, we divide videos into two categories:

The video captured on the ventral side of a medaka is named with the prefix “R”. This perspective allows for simultaneous observation of the atrium and ventricle of the medaka, as shown in Figure 1a–d.
The video captured on the lateral-right side of a medaka is named with the prefix “N”. The atrium from this perspective is obstructed by the ventricles, and only the ventricles can be observed, as shown in Figure 1e–h.

2.2. Dataset Construction

The ventricles in the original videos exhibit varying degrees of occlusion, as illustrated in the subplots of Figure 1, where the visibility of the ventricles differs. Based on the clarity of the ventricles, we categorized the 63 videos into three levels:

In the first level, the ventricles are essentially unobstructed, and their edges are clear and visible, as shown in Figure 1b,f. The quantity of videos in this level is 21.
In the second level, the ventricles experience some degree of obstruction, or their edges are blurred and challenging to distinguish, as depicted in Figure 1c,g. The quantity of videos in this level is 17.
In the third level, the ventricles have a significant amount of occlusion, and the edges are indistinguishable, as seen in Figure 1d,h. The quantity of videos in this level is 25.

Due to the extremely poor quality of the videos in the third level, preventing accurate labeling, they were discarded. The videos in the second level also exhibited suboptimal quality, and considering the prevalence of ventricular occlusion and blurred edges, coupled with the limited dataset, we chose to use the first and second levels of quality videos for dataset construction. For differentiation, we labeled the video data from the first level as “better” and the second level as “lower”.

Our process for creating a dataset is as follows: Firstly, we extract frames from the medaka heart beat video, with approximately 207 frames per N-direction video and 165 frames per R-direction video. Next, we used the pixel level image annotation tool LabelMe (version 5.0.1) developed by the Massachusetts Institute of Technology to label medaka frame by frame, generate JSON label files with medaka ventricular edge coordinate information, and generate corresponding ventricular mask PNG images. Finally, we classify according to the folder format requirements of the DAVIS dataset [16].

We have created a total of 7225 label images, including 4569 in the Lateral right view of the medaka heart and 2656 in the Ventral view of the medaka heart. The specific quantity information of the dataset is shown in Table 1.

3. Methods

3.1. Spatiotemporal Attention Mechanism

Depending on the guidance provided for the target object, VOS can be classified into subclasses such as unsupervised VOS, reference VOS, and semi-supervised VOS. Since the cardiac pulsation of medaka fish causes surrounding areas to pulsate collectively, resulting in the non-uniqueness of moving objects in the video, we can only employ semi-supervised VOS. The primary distinction between semi-supervised VOS and unsupervised VOS lies in the fact that semi-supervised VOS requires additional dense annotation information for one frame in the video sequence. In this paper, we define the frame requiring this additional annotation as the guiding frame. Traditional semi-supervised VOS approaches commonly designate the first frame as the guiding frame. Some studies [17] have demonstrated that choosing the first frame as the guiding frame may not necessarily be the optimal choice. To address the question of which frame is the optimal choice for the guiding frame, we propose a method based on spatiotemporal attention mechanisms.

The video exhibits dynamic variations, with each frame containing varying levels of informational value, influencing the overall segmentation performance of the entire video sequence to different extents. The objective of VOS is to segment specific dynamic objects within the video, with the primary challenge residing in the precise segmentation of target edges. Guided by the concept of the spatiotemporal attention mechanism, we concentrate solely on the target edge region across the entire spatial extent of the image. Define the mean gradient of the target edge region in the image as the spatial feature value. The larger the mean gradient, the more pronounced the edges of the image. Throughout the temporal sequence of the entire video, our attention is specifically directed towards the frame with the highest spatial feature value. Ultimately, frames derived from this spatiotemporal attention mechanism, as per this ideology, are termed spatiotemporal attention frames (STA frames). Due to their possession of the highest spatiotemporal feature values, we designate the STA frame as the guiding frame for the semi-supervised VOS proposed in this paper. This choice strengthens the model’s comprehension of edge features, thereby enhancing the segmentation performance of the model.

3.2. Deriving STA Frame

The key to deriving the STA frame lies in how to calculate the spatial feature values of the image. The easiest method to think of is to use a mask for dilating and eroding the image, obtaining the segmented target’s edge region through differencing. Next, applying Laplacian filtering or Sobel operator filtering [18] to this region produces a gradient matrix. Calculating the mean of the gradient matrix yields the spatial feature values of the image. However, the edges of segmented targets are typically irregular regions that cannot be directly subjected to filtering operations.

To address this issue, a novel method for calculating the spatial feature values of irregular regions is proposed in this paper. Firstly, the image is converted to a grayscale image. Then, the Sobel operator’s two convolution kernels are applied to the grayscale image to calculate the horizontal gradient

G_{x}

and the vertical gradient

G_{y}

. Subsequently, the square root of the sum of squares of

G_{x}

and

G_{y}

is computed, yielding the overall gradient magnitude

G

at each pixel position, thereby obtaining the gradient matrix for the entire image. The formulas are as follows:

G_{x} = \sum_{i = - 1}^{1} \sum_{j = - 1}^{1} I (x + i, y + j) * G_{x} (i, j)

(1)

G_{y} = \sum_{i = - 1}^{1} \sum_{j = - 1}^{1} I (x + i, y + j) * G_{y} (i, j)

(2)

G = \sqrt{G_{x}^{2} + G_{y}^{2}}

(3)

Here,

I (x, y)

represent the pixel value at position

(x, y)

in the image, and

G_{x} (i, j)

and

G_{y} (i, j)

are the two convolution kernels of the Sobel operator. Following, a dilation operation is applied to the original mask image, resulting in the outward-expanding dilation mask. Subsequently, an erosion operation is performed on the original mask image, leading to the inward-shrinking erosion mask. The Region of Interest (ROI) is obtained by subtracting the erosion mask from the dilation mask. The values in the overlapping region of the entire image’s gradient matrix with the ROI are accumulated and summed, resulting in the total gradient of the image’s edge region. The mean of these values represents the spatial features of the image. By executing these operations on the entire video sequence, spatial feature values are computed for each frame. The frame with the highest spatial feature value in the entire video sequence is identified as the STA frame.

3.3. STAVOS

As illustrated in Figure 2, our framework consists of two main modules: the STA module and the TBD module.

In contrast to traditional semi-supervised VOS approaches that use the first frame of a video as a guiding frame, we utilize the STA frame as our initial frame. The function of the STA module is to conduct computations on the complete video sequence with the aim of identifying the STA frame contained therein. The STA frame is likely not the first or last frame of the video, leading to the segmentation of the entire video sequence into two parts, denoted as “pre_frames” and “sub_frames” in Figure 2. As the STA frame is situated after the “pre_frames”, it is necessary to reverse the order of the “pre_frames” before inputting them into the TBD module.

The TBD module is designed to encode, match, and decode the reversed “pre_frames” and “sub_frames”.

We use the DenseNet-201 [19] architecture as our encoder. To better adapt the deep learning model training, we employed the pre-trained version from ImageNet [20], utilizing [0.485, 0.456, 0.406] as the mean and [0.229, 0.224, 0.225] as the standard deviation for model initialization. Considering the lower resolution of the highest-level features in the original DenseNet-201, we chose to utilize the feature map obtained at the third dense block to achieve a higher feature resolution.

In the TBD module, we first decode the STA frame to obtain its feature map. Passing the feature map through a convolutional layer for L2 normalization, we obtain a vector representation for each pixel, commonly referred to as “key”. Subsequently, an average pooling operation is applied to the true mask of the STA frame, yielding a feature map with a window size of 16 × 16. The feature maps and “key” of the STA frame will be utilized to initialize templates for fine-grained matching and coarse-grained matching, serving as the initial state of the state dictionary. Following that, the inverted sequences of “pre_frames” and “sub_frames” video sequences will be decoded to obtain the feature map and “key” of the current frame. Drawing on methods such as FEELVOS and BMVOS [21], we have devised matching templates for spatial and temporal diversity (S&T Diversity) to enhance the semantic information of embedded features. The feature map and “key” of the current frame, in addition to undergoing spatial dimension matching within the image, also undergo temporal dimension matching with the state dictionary in the sequence. This process results in a comprehensive matching score that takes into account different temporal and spatial scales.

After the matching module, the decoder decodes the feature map and matching scores of the current frame, along with the predictions from the previous frame. To acquire more comprehensive feature representations, we introduce CBAM (Convolutional Block Attention Module) [22] after each deconvolutional layer. Upon completion of decoding, the softmax function is employed to calculate the foreground probability for each pixel in the current frame. The predictions of the current frame are then fed back into the encoder for encoding, facilitating the update of the state dictionary, thereby forming a recurrent loop.

Repeating the aforementioned process, “pre_frames” generate “pre_masks”, while “sub_frames” yield “sub_masks”. Ultimately, we invert the “pre_masks” and merge them with “sub_frames” to obtain the predictive result image of the entire video sequence.

3.4. Metrics

This study aims to achieve precise segmentation of target edges, and to better assess the model’s segmentation performance, we adopted two evaluation metrics specified in the DAVIS dataset challenge [23], namely, “Jaccard (J) Per-Sequence” and “Boundary (F) Per-Sequence”. These metrics are variations of the Jaccard Index and F-Measure, specifically designed to focus on the performance of each video sequence rather than the entire dataset.

“Jaccard (J) Per-Sequence” calculates the Jaccard Index at the sequence level. The Jaccard Index (also known as Intersection over Union, or IoU) is used to measure the similarity between two sets, calculated by the ratio of the intersection to the union of the predicted segmentation region and the true segmentation region. The formula is as follows:

J (A, B) = \frac{| p r e \cap m a s k |}{| p r e \cup m a s k |}

(4)

where “pre” represents the predicted segmentation region, and “mask” represents the true mask segmentation region. The Jaccard Index ranges from 0 to 1, with a higher value indicating higher overlap, 0 indicating no match, and 1 indicating a perfect match. For the entire VOS dataset, we calculate the Jaccard Index for each frame in each video sequence, and the mean Jaccard Index for that video sequence represents “Jaccard (J) Per-Sequence”. This process evaluates the model’s segmentation accuracy in each video sequence, helping quantify the model’s performance for each object segmentation task. Finally, the mean “Jaccard (J) Per-Sequence” for the entire dataset represents the overall performance of the model.

“Boundary (F) Per-Sequence” is a video sequence-level segmentation performance evaluation metric, focusing on the measurement of various parameters related to the model-generated segmentation contours and the true segmentation contours, such as F-Measure, Precision, and IoU. Since F-Measure is an effective metric for measuring precision and recall, emphasizing high-quality generation of segmentation contours, this study selects F-Measure as the metric for “Boundary (F) Per-Sequence”. The formula is as follows:

Precision = \frac{| pre \cap m a s k |}{| p r e |} Recall = \frac{| p r e \cap m a s k |}{| m a s k |}

(5)

F = \frac{2 \cdot P r e c i s i o n \cdot R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

The value of F ranges from 0 to 1, with a higher value indicating better contour segmentation, 0 indicating no correct identification of any positive examples and no recall of any positive examples, and 1 indicating perfect precision and recall in all situations. For each video sequence, we calculate the F-Measure between the model’s segmentation contours and the true segmentation contours, evaluating the model’s ability to maintain the accuracy of object boundaries in each video sequence.

The selection of “Jaccard (J) Per-Sequence” and “Boundary (F) Per-Sequence” emphasizes a detailed assessment of each video sequence. The Jaccard Index focuses on the ratio of the intersection to the union between the segmentation results and the true segmentation, while F-Measure emphasizes the accuracy of segmentation boundaries. In this way, we can comprehensively evaluate the model’s performance when dealing with dynamic scenes and changes between consecutive frames. This evaluation method helps capture the model’s ability to handle dynamic scenarios and provides more targeted feedback for fine-tuning the model. Therefore, the choice of “Jaccard (J) Per-Sequence” and “Boundary (F) Per-Sequence” makes the evaluation more accurate and practically meaningful.

3.5. Automation System

In order to further reduce manual operations, we have developed a deep learning-based system for the precise segmentation of medaka ventricles. This system can automatically compute relevant parameters of the medaka heart and provide visualizations of changes in ventricular status.

The overall workflow of the system is illustrated in Figure 3. To operate the system, users only need to provide a video recording of medaka heartbeats and mark the true mask region of the central ventricle in the STA frame, which is automatically computed by the system. Subsequently, the system automatically loads and reads the medaka heartbeat video and the STA frame’s mask. Utilizing the optimal weights obtained through training the STAVOS model, the system infers the ventricular regions for all frames in the video. In the images, the medaka ventricle is represented as a closed region. Based on the precise segmentation of the ventricles of the medaka, we can accurately calculate data such as the diameter and area of the ventricles. The size of the closed region undergoes periodic changes with the heartbeat, with the process from large to small defined as systole (S) and from small to large as diastole (D). The moment of maximal closure is defined as end-diastole (ED), and the moment of minimal closure is defined as end-systole (ES). A complete cardiac cycle consists of one systole and the adjacent diastole; thus, the time of one heartbeat is the time between two adjacent EDs or two adjacent ESs. Time is represented by the number of frames in the video. Therefore, the formula for heart rate is as follows:

H R = \frac{f p s}{E S F_{c} - E S F_{p}}

(7)

Here,

E S F_{c}

represents the current end-systolic frame, and

E S F_{p}

represents the previous end-systolic frame.

Fractional shortening (FS) is a metric used to evaluate cardiac contractile function, reflecting the heart’s ability to contract by measuring the change in ventricular diameter between diastole and systole. The calculation formula is expressed as follows:

F S = \frac{D_{d} - D_{s}}{D_{d}} \times 100 %

(8)

Here,

D_{d}

represents the short-axis diameter at end-diastole, and

D_{s}

represents the short-axis diameter at end-systole (ES).

By assuming a prolate spherical shape for the ventricle, the following volume formula can be used:

V = \frac{4}{3} π \times \frac{D L}{2} \times \frac{D S}{2}

(9)

where DL and DS are the long and short-axis diameters of the closed region, respectively.

Ejection Fraction (EF) is a critical indicator for assessing cardiac pumping efficiency, with its value reflecting the strength of cardiac function. The calculation formula is as follows:

E F = \frac{E D V - E S V}{E D V} \times 100 %

(10)

where EDV is the volume at ED, and ESV is the volume at ES. Due to the impracticality of measuring ventricular volume through a single-angle 2D video, researchers commonly employ the aforementioned methods to calculate ventricular parameters.

In addition to generating a table of relevant parameters for the medaka ventricle, the system also produces a GIF illustrating the shape of the medaka ventricle, an electrocardiogram (ECG) GIF, and a 3D dynamic simulation GIF of the heartbeat. This allows users to gain a more visual understanding of the changes in ventricular status.

4. Experiment

4.1. Experimental Environment

The training was conducted in an environment running the Windows 10 operating system, utilizing an NVIDIA GTX 2080 TI GPU with an Intel Core (TM) i5-6500 CPU and 40 GB of RAM. The deep learning environment used for this experiment includes PyTorch 1.12.0, CUDA 11.3, and Python 3.8.

4.2. Experimental Setup

Following the widely adopted approach [24], we chose to pretrain our model using the DAVIS2016 dataset, which is a dense adaptive video segmentation dataset. The intention is to equip the model with the capability to extract features related to prominent moving objects in videos before initiating training on the medaka dataset.

The reason for selecting DAVIS2016 is attributed to its annotation approach, which distinguishes only between foreground and background through binary labeling, aligning closely with the nature of our task. Moreover, the prominence of DAVIS lies in its high-quality, high-resolution characteristics, making it a mainstream benchmark dataset for VOS tasks. This choice is expected to enhance the credibility of our model, particularly given the dataset’s reputation for being a leading evaluation dataset in the field. The dataset has a video frame rate of 24 fps, a resolution of 1080 p, and short video sequences lasting 2–4 s. It comprises a total of 50 video sequences (3455 frames), with 30 videos (2079 frames) in the training set and 20 videos (1376 frames) in the validation set. Detailed information about the DAVIS dataset can be found at https://davischallenge.org (accessed on 1 July 2017).

Before conducting neural network training, a series of image augmentation operations were applied to enhance the model’s generalization capability. Initially, we randomly selected a video from the dataset and extracted a consecutive sequence of 10 frames. For these 10 frames, we applied video inversion with a probability of 40%, horizontal and vertical flipping with a probability of 20%, and additional swapping with a probability of 20% for supplementary enhancement. Prior to inputting into the model, we computed the STA frame for these 10 frames, followed by random joint cropping to a size of 384 × 384. In cases where the cropping excluded any objects, we repeated the process until the foreground object was included. Additionally, random affine transformations and random balanced cropping were introduced. Similar to CFBI (Collaborative Video Object Segmentation by Foreground-Background Integration) [25], sequences with insufficient foreground pixel counts after cropping were excluded to prevent the model from biasing towards background objects.

During the training phase, the batch size was set to 8, the initial learning rate was set to 1 × 10⁻⁴, and cross-entropy loss was utilized with the Adam optimizer [26]. Furthermore, all main encoders were frozen, and all batch normalization layers [27] were disabled.

4.3. Training Results

We performed 400 training epochs using the TBD model, with experimental settings identical to those of STAVOS. The training results are illustrated in Figure 4. As depicted, STAVOS and TBD demonstrate competitive performance before approximately 100 epochs. However, around the 80th epoch, STAVOS consistently outperformed TBD, gradually establishing a stable lead. TBD stabilizes around 0.901, while STAVOS stabilizes around 0.952. Notably, TBD exhibits a quicker decrease in loss than STAVOS in the initial 100 epochs, and after 100 epochs, both models converge to a stable loss value of approximately 0.020.

From the training results, the advantage of STAVOS is not evident. This is due to the fact that when the batch size is greater than 1, multiple batches of images will be concatenated into a five-dimensional tensor vector of shape (B, L, H, W, C), where B represents the batch size. When the batch size is greater than 1, the computed result of the STA frame is the STA frame of the concatenated image sequence from multiple batches, rather than the STA frame corresponding to each video sequence. This contradicts the original intention of the STA frame. Therefore, reducing the batch size is expected to improve the performance of STAVOS.

4.4. Ablation Experiments

To investigate the impact of batch size on STAVOS, we conducted 400 training epochs with different batch sizes, performing validation every 10 epochs. The results of the validation set are shown in Figure 5.

From Figure 5, we observe that different batch sizes have varying effects on the training performance of STAVOS, with training performance inversely proportional to the increase in batch size. The training performance is optimal when the batch size is 1, with a final J&F approximately 0.28 higher than that with a batch size of 8. This validates the inference from the previous section and provides additional evidence for the effectiveness of the STA Module in model training. Consequently, for subsequent training, we set the batch size to 1.

4.5. Comparative Experiments

Due to the limited size of the medaka dataset, both the test and validation sets consist of only 1–2 videos. Consequently, the dataset lacks diversity, resulting in experimental outcomes being significantly dependent on the specific videos used in the dataset. Therefore, we chose to assess the framework’s performance using the DAVIS2016 dataset instead of the medaka dataset. However, the test set for DAVIS2016 is not publicly available and is only provided for online testing during challenge competitions. Therefore, in this experiment, we can only use the DAVIS2016 validation set as our evaluation benchmark. Due to DAVIS2016 being a dataset for VOS and U-Net being a static image segmentation network, U-Net cannot participate in the current comparison.

We conducted 400 rounds of training on the DAVIS2016 training set for both the TBD and STAVOS models. Subsequently, we tested them on the DAVIS validation set, consisting of 20 video sequences. The results are presented in Table 2, where STAVOS outperforms TBD in 15 video sequences, leading to an overall improvement in J&F by 0.032. Especially in video sequences where TBD exhibits poorer segmentation performance, such as “breakdance” and “scooter-black”, STAVOS demonstrates a significantly superior segmentation effect. This is because, for more complex video objects, the STA module can exert greater advantages. It enables more precise segmentation of edges, which aligns with the original design purpose of STAVOS.

5. Results

5.1. Qualitative Results

Based on the pre-trained weights of TBD and STAVOS, we continued training for an additional 400 epochs on the Data-medaka-Ventral and Data-medaka-Lateral-right training sets, with the learning rate adjusted to 1 × 10⁻⁵. Similarly, we conducted 400 epochs of training using U-Net [28] on the entire medaka dataset.

We conducted a qualitative comparison of our method with U-Net and TBD. As shown in Figure 6, it can be observed that U-Net can only identify the central position of the medaka ventricle. This limitation arises from U-Net being a static image segmentation network incapable of utilizing information features from consecutive frames in the video, resulting in poor segmentation performance. TBD shows a significant improvement in segmentation compared to U-Net, but the accuracy of segmenting the ventricular edge is not high enough. In contrast, STAVOS achieves further improvements in segmentation. It not only accurately identifies the position of the ventricle but also achieves precise contour segmentation. This is attributed to the STA module’s ability to effectively extract edge features of the target.

5.2. Quantitative Results

We conducted a quantitative comparison of our method with U-Net and TBD. As shown in Table 3, it can be observed that U-Net has the poorest segmentation performance, with a mean accuracy of approximately 0.521. TBD demonstrates relatively good segmentation results, achieving a mean accuracy of around 0.875. In contrast, STAVOS achieves a mean accuracy of approximately 0.913. This indicates that our STAVOS outperforms existing methods.

5.3. Automated System Results

We applied the optimal weights trained by STAVOS to our automated system. For inputting a medaka video, the system produces a visual segmentation result GIF, a ventricular parameter table, an ECG GIF, and a 3D dynamic simulation GIF of heartbeats. We selected outputs from two videos (R0039, N0080) for presentation, as shown in Table 4 and Figure 7.

We manually measured the heartbeat counts of medaka in the medaka dataset and compared them with the counts calculated by the automated system, as shown in Figure 8. Among the 21 videos, our system achieved complete accuracy in 15 videos, and for the incorrect predictions, the errors were consistently within 1 count. This demonstrates that our automated system can accurately predict the heartbeat of medaka.

6. Discussions

Our method achieves precise segmentation of the medaka and calculates relevant parameters of the medaka ventricle. It should be noted that parameters such as EF and FS are merely estimations, as calculating the volume of an irregular 3D object accurately is not possible solely from 2D images in a single perspective. Although EF and FS are estimates, these parameters enable monitoring of the developmental state of medaka. Importantly, our method does not require expensive instruments [5], fluorescent labeling [8,9], or transgenic cultivation [10,11], significantly reducing experimental costs and manual efforts. However, the limited quantity and quality of source videos result in insufficient diversity in the dataset. Therefore, we hope that more biologists will share videos of medaka, as this would significantly contribute to the advancement of computer vision research related to medaka.

7. Conclusions

Firstly, we created a video object segmentation task dataset with over 7000 annotated images, filling a significant gap and providing a robust foundation for calculating medaka ventricle parameters.

Additionally, to address the issue of edge blurriness in videos, we introduced a spatiotemporal attention module for video object segmentation. This module enhances the performance of semi-supervised video object segmentation models.

Secondly, addressing the issue of edge blurring in videos, we proposed a spatiotemporal attention module for VOS. Our model achieved a J&F improvement of 0.032 compared to the state-of-the-art model TBD on the DAVIS2016 validation set. On the medaka test set, it demonstrated a mean accuracy improvement of 0.392 over the traditional U-Net model and an additional 0.038 improvement compared to TBD in mean accuracy.

Finally, we designed an automated system capable of automatically segmenting the medaka ventricle and providing relevant ventricular parameters, including heart rate (HR), stroke volume (SV), ejection fraction (EF), etc. Additionally, the system outputs dynamic segmentation result GIFs, a 3D dynamic simulation GIF of heartbeats, and an ECG GIF. This provides researchers with convenience, enabling them to visually observe the ventricular status of medaka.

To foster academic exchange and further research, we have open-sourced all the resources used in this paper, including the dataset, model code, pre-trained weights, and the relevant code for the automated system. Interested researchers can access these resources at https://github.com/KuiZeng/STAVOS (accessed on 1 December 2023).

Author Contributions

Conceptualization, S.X. and M.C.; methodology, K.Z.; software, K.Z. and D.S.; validation, K.Z., S.X. and D.S.; formal analysis, K.Z. and S.X.; data curation, S.X. and K.Z.; writing—original draft preparation, K.Z.; writing—review and editing, K.Z., S.X. and M.C.; supervision, M.C.; project administration, S.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Research and Development Planning in Key Areas of Guangdong Province (No. 2021B0202070001) and The Bioinformatics Research and Database Construction of Antifreeze Genes in Fish (No. A2-2006-21-200208).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

If any researcher needs a dataset and code related to this article, they can do so on the website https://github.com/kuizeng/stavos (accessed on 1 December 2023). Download it yourself or request it from the author via email. Email: [email protected].

Acknowledgments

The authors are very grateful to Jakob Gierten of the German Centre for Organizational Studies at Heidelberg University for publicly releasing heartbeat video data from their recorded medaka.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, J.; Cao, H. Zebrafish and Medaka: Important Animal Models for Human Neurodegenerative Diseases. Int. J. Mol. Sci. 2021, 22, 10766. [Google Scholar] [CrossRef]
Cui, M.; Su, L.; Zhang, P.; Zhang, H.; Wei, H.; Zhang, Z.; Zhang, X. Zebrafish Larva Heart Localization Using a Video Magnification Algorithm. In Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 13–16 October 2020; pp. 1071–1076. [Google Scholar] [CrossRef]
Puybareau, E.; Genest, D.; Barbeau, E.; Léonard, M.; Talbot, H. An automated assay for the assessment of cardiac arrest in fish embryo. Comput. Biol. Med. 2017, 81, 32–44. [Google Scholar] [CrossRef] [PubMed]
Kang, C.-P.; Tu, H.-C.; Fu, T.-F.; Wu, J.-M.; Chu, P.-H.; Chang, D.T.-H. An automatic method to calculate heart rate from zebrafish larval cardiac videos. BMC Bioinform. 2018, 19, 169. [Google Scholar] [CrossRef] [PubMed]
De Luca, E.; Zaccaria, G.M.; Hadhoud, M.; Rizzo, G.; Ponzini, R.; Morbiducci, U.; Santoro, M.M. ZebraBeat: A flexible platform for the analysis of the cardiac rate in zebrafish embryos. Sci. Rep. 2014, 4, 4898. [Google Scholar] [CrossRef]
Krishna, S.; Chatti, K.; Galigekere, R.R. Automatic and Robust Estimation of Heart Rate in Zebrafish Larvae. IEEE Trans. Autom. Sci. Eng. 2018, 15, 1041–1052. [Google Scholar] [CrossRef]
Guo, Y.; Xiong, Z.; Verbeek, F.J. An efficient and robust hybrid method for segmentation of zebrafish objects from bright-field microscope images. Mach. Vis. Appl. 2018, 29, 1211–1225. [Google Scholar] [CrossRef] [PubMed]
Krämer, P.; Boto, F.; Wald, D.; Bessy, F.; Paloc, C.; Callol, C.; Letamendia, A.; Ibarbia, I.; Holgado, O.; Virto, J.M. Comparison of Segmentation Algorithms for the Zebrafish Heart in Fluorescent Microscopy Images. In Advances in Visual Computing; Lecture Notes in Computer Science; Bebis, G., Boyle, R., Parvin, B., Koracin, D., Kuno, Y., Wang, J., Pajarola, R., Lindstrom, P., Hinkenjann, A., Encarnação, M.L., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5876, pp. 1041–1050. [Google Scholar] [CrossRef]
Bakis, I.; Sun, Y.; Elmagid, L.A.; Feng, X.; Garibyan, M.; Yip, J.; Yu, F.Z.; Chowdhary, S.; Fernandez, E.; Cao, J.; et al. Methods for dynamic and whole volume imaging of the zebrafish heart. Dev. Biol. 2023, 504, 75–85. [Google Scholar] [CrossRef] [PubMed]
Zhang, B.; Pas, K.E.; Ijaseun, T.; Cao, H.; Fei, P.; Lee, J. Automatic Segmentation and Cardiac Mechanics Analysis of Evolving Zebrafish Using Deep Learning. Front. Cardiovasc. Med. 2021, 8, 675291. [Google Scholar] [CrossRef] [PubMed]
Akerberg, A.A.; Burns, C.G.; Nguyen, C. Deep learning enables automated volumetric assessments of cardiac function in zebrafish. Dis. Model. Mech. 2019, 12, dmm.040188. [Google Scholar] [CrossRef] [PubMed]
Gao, M.; Zheng, F.; Yu, J.J.Q.; Shan, C.; Ding, G.; Han, J. Deep learning for video object segmentation: A review. Artif. Intell. Rev. 2023, 56, 457–531. [Google Scholar] [CrossRef]
Cho, S.; Lee, H.; Lee, M.; Park, C.; Jang, S.; Kim, M.; Lee, S. Tackling Background Distraction in Video Object Segmentation. In Computer Vision—ECCV 2022; Lecture Notes in Computer Science; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer Nature: Cham, Switzerland, 2022; Volume 13682, pp. 446–462. [Google Scholar] [CrossRef]
Puybareau, E.; Talbot, H.; Leonard, M. Automated heart rate estimation in fish embryo. In Proceedings of the 2015 International Conference on Image Processing Theory, Tools and Applications (IPTA), Orleans, France, 10–13 November 2015; pp. 379–384. [Google Scholar] [CrossRef]
Gierten, J.; Pylatiuk, C.; Hammouda, O.; Schock, C.; Stegmaier, J.; Wittbrodt, J.; Gehrig, J.; Loosli, F. Automated high-throughput heart rate measurement in medaka and zebrafish embryos under physiological conditions. Dev. Biol. 2019, preprint. [Google Scholar] [CrossRef]
Perazzi, F.; Pont-Tuset, J.; McWilliams, B.; Van Gool, L.; Gross, M.; Sorkine-Hornung, A. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016; pp. 724–732. [Google Scholar] [CrossRef]
Yin, Z.; Zheng, J.; Luo, W.; Qian, S.; Zhang, H.; Gao, S. Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild. arXiv 2021, arXiv:2103.10391. [Google Scholar]
Han, L.; Tian, Y.; Qi, Q. Research on edge detection algorithm based on improved sobel operator. MATEC Web Conf. 2020, 309, 03031. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Cho, S.; Lee, H.; Kim, M.; Jang, S.; Lee, S. Pixel-Level Bijective Matching for Video Object Segmentation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 1453–1462. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Lecture Notes in Computer Science; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. [Google Scholar] [CrossRef]
Pont-Tuset, J.; Perazzi, F.; Caelles, S.; Arbeláez, P.; Sorkine-Hornung, A.; Van Gool, L. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv 2018, arXiv:1704.00675. [Google Scholar]
Hu, L.; Zhang, P.; Zhang, B.; Pan, P.; Xu, Y.; Jin, R. Learning Position and Target Consistency for Memory-based Video Object Segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 4142–4152. [Google Scholar] [CrossRef]
Yang, Z.; Wei, Y.; Yang, Y. Collaborative Video Object Segmentation by Foreground-Background Integration. arXiv 2020, arXiv:2003.08333. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Ioffe, S.; Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
Schutera, M.; Just, S.; Gierten, J.; Mikut, R.; Reischl, M.; Pylatiuk, C. Machine Learning Methods for Automated Quantification of Ventricular Dimensions. Zebrafish 2019, 16, 542–545. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram and video sample display of medaka heart. (a) Ventral view of medaka heart; (b–d) Video capture of ventral view of medaka heart. (e) Lateral-right view of medaka heart; (f–h) Video capture of lateral-right view of medaka heart.

Figure 2. Architecture of our proposed method. The STA module is our proposed spatiotemporal attention mechanism module, and the TBD module is the core module of the TBD model. “pre_frames” and “sub_frames” respectively, represent the video sequence before and after the STA frame. During the training phase, the model outputs “pre_scores” and “sub_scores”. During the validation or testing phase, the model produces “pre_masks” and “sub_masks”. “S&T diversity” represents matching templates for spatial and temporal diversity.

Figure 3. The flowchart of our automated system. Input a medaka video and mark a frame of ventricular contour. The system will output three GIFs and a table. “seg” is an abbreviation for segmentation.

Figure 4. Comparison of Training Loss and IoU Curves between TBD and STAVOS Models.

Figure 5. Validation set J&F curves during training with different batch sizes. “J&F” represents the average values of Jaccard (J) Per-Sequence and Boundary (F) Per-Sequence for the entire validation set.

Figure 6. Qualitative Comparison of STAVOS with U-Net and TBD. The blue curve represents the true contour of the ventricle, while the red curve represents the predicted ventricular contour. ED stands for end-diastole, and ES stands for end-systole. N denotes the Data-medaka-Lateral-right Dataset, and R denotes the Data-medaka-Ventral Dataset.

Figure 7. Screenshot of ECG and 3D Heartbeat Simulation. (a) Screenshots of the ECG for N0080; (b) Screenshots of the ECG for R0039; (c) Screenshots of the 3D dynamic simulation at ED for R0039; (d) Screenshots of the 3D dynamic simulation at ES for R0039.

Figure 8. Comparison Line Chart of Actual and Predicted Heartbeat Counts.

Table 1. Dataset quantity information. “better”: better quality data; “lower”: lower quality data. “12 (2493)”: 12 videos with 2493 frames of images.

	Data-Medaka-Lateral-Right (N)		Data-Medaka-Ventral (R)
	Better	Lower	Better	Lower
Train set	9 (1870)	7 (1455)	6 (996)	5 (830)
Val set	1 (208)	1 (207)	1 (166)	1 (166)
Test set	2 (415)	2 (514)	2 (332)	1 (166)
total	12 (2493)	10 (2076)	9 (1494)	7 (1162)

Table 2. Segmentation performance of TBD and STAVOS on video sequences from the DAVIS2016 validation set.

Video Sequences	J&F of TBD	J&F of STAVOS
blackswan	0.954	0.955
bmx-trees	0.622	0.698
breakdance	0.617	0.741
camel	0.838	0.839
car-roundabout	0.801	0.829
car-shadow	0.938	0.920
cows	0.949	0.951
dance-twirl	0.717	0.780
dog	0.943	0.946
drift-chicane	0.850	0.821
drift-straight	0.784	0.867
goat	0.875	0.897
horsejump-high	0.840	0.767
kite-surf	0.711	0.790
libby	0.887	0.878
motocross-jump	0.760	0.757
paragliding-launch	0.656	0.735
parkour	0.917	0.918
scooter-black	0.640	0.836
soapbox	0.817	0.827
mean of all videos	0.806	0.838

Table 3. The accuracy of U-Net, TBD, and STAVOS on the medaka test set.

	U-Net	TBD	STAVOS
IoU of N	0.490	0.789	0.863
mIoU of N	0.534	0.891	0.930
IoU of R	0.496	0.883	0.908
mIoU of R	0.565	0.937	0.951
mean accuracy	0.521	0.875	0.913

Table 4. Ventricular parameters for R0039 output by the automated system. “ED”: end-diastole; “ES”: end-systole; “FS”: fractional shortening; “EDV”: the volume at ED; “ESV”: the volume at ES; “SV”: stroke volume; “EF”: ejection fraction; “HR”: heartrate. Due to the original table’s extensive length, the presented table is a truncated version, including only the first 10 rows.

ES Frame	ED Frame	Minor FS [%]	Major FS [%]	EDV [px³]	ESV [px³]	SV [px³]	EF [%]	HR [bpm]
2	4	17.85142	11.90518	1,375,722	2,314,089	938,367.3	40.5	106.25
10	11	26.07969	9.845666	1,341,991	2,724,177	1,382,186	50.7	93.75
17	22	9.965191	15.01318	1,414,451	2,053,125	638,673.9	31.1	114.2857
25	28	16.94796	8.436051	1,362,875	2,157,898	795,022.8	36.8	106.25
33	35	20.17185	9.791008	1,439,276	2,503,698	1,064,423	42.5	106.25
40	43	19.40398	15.18253	1,353,724	2,457,071	1,103,347	44.9	106.25
48	50	22.30603	1.42044	1,536,526	2,582,133	1,045,606	40.4	100
56	58	18.81942	8.56368	1,480,583	2,457,025	976,442.4	39.7	100
64	66	10.00727	6.485503	1,715,452	2,265,086	549,634.3	24.2	95
72	75	14.61971	16.09441	1,401,323	2,291,037	889,713.9	38.8	112.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, K.; Xu, S.; Shu, D.; Chen, M. STAVOS: A Medaka Larval Cardiac Video Segmentation Method Based on Deep Learning. Appl. Sci. 2024, 14, 1239. https://doi.org/10.3390/app14031239

AMA Style

Zeng K, Xu S, Shu D, Chen M. STAVOS: A Medaka Larval Cardiac Video Segmentation Method Based on Deep Learning. Applied Sciences. 2024; 14(3):1239. https://doi.org/10.3390/app14031239

Chicago/Turabian Style

Zeng, Kui, Shutan Xu, Daode Shu, and Ming Chen. 2024. "STAVOS: A Medaka Larval Cardiac Video Segmentation Method Based on Deep Learning" Applied Sciences 14, no. 3: 1239. https://doi.org/10.3390/app14031239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

STAVOS: A Medaka Larval Cardiac Video Segmentation Method Based on Deep Learning

Abstract

1. Introduction

2. Dataset

2.1. Data Acquisition

2.2. Dataset Construction

3. Methods

3.1. Spatiotemporal Attention Mechanism

3.2. Deriving STA Frame

3.3. STAVOS

3.4. Metrics

3.5. Automation System

4. Experiment

4.1. Experimental Environment

4.2. Experimental Setup

4.3. Training Results

4.4. Ablation Experiments

4.5. Comparative Experiments

5. Results

5.1. Qualitative Results

5.2. Quantitative Results

5.3. Automated System Results

6. Discussions

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI