Image quality enhancement of 4D light field microscopy via reference impge propagation-based one-shot learning

Kwon, Ki Hoon; Erdenebat, Munkh-Uchral; Kim, Nam; Kwon, Ki-Chul; Kim, Min Young

doi:10.1007/s10489-023-04684-4

Image quality enhancement of 4D light field microscopy via reference impge propagation-based one-shot learning

Open access
Published: 15 July 2023

Volume 53, pages 23834–23852, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Image quality enhancement of 4D light field microscopy via reference impge propagation-based one-shot learning

Download PDF

804 Accesses
1 Citation
Explore all metrics

Abstract

Four-dimensional (4D) light-field (LF) microscopes can acquire 3D information about target objects using a microlens array (MLA). However, the resolution and quality of sub-images in the LF images are reduced because of the spatial multiplexing of rays by the element lenses of the MLA. To overcome these limitations, this study proposes an LF one-shot learning technique that can convert LF sub-images into high-quality images similar to the 2D images of conventional optical microscopes obtained without any external training datasets for image enhancement. The proposed convolutional neural network model was trained using only one training dataset comprising a high-resolution reference image captured without an MLA as the ground truth. Further, its input was the central view of the LF image. After LF one-shot learning, the trained model should be able to convert well the other LF sub-images of various directional views that were not used in the main training process. Therefore, novel learning techniques were designed for LF one-shot learning. These novel techniques include an autoencoder-based model initialization method, a feature map-based learning algorithm to prevent the overfitting of the model, and cut loss to prevent saturation. The experimental results verified that the proposed technique effectively enhances the LF image quality and resolution using a reference image. Moreover, this method enhances the resolution by up to 13 times, decreases the noise amplification effect, and restores the lost details of microscopic objects. The proposed technique is stable and yields superior experimental results compared with those of the existing resolution-enhancing methods.

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Learning a Deep Convolutional Network for Image Super-Resolution

A novel image denoising algorithm based on least square generative adversarial network

Article 24 April 2024

1 Introduction

In a four-dimensional (4D) light-field (LF) system, 4D information about the angular and spatial characteristics of light can be obtained simultaneously [12, 23]. In LF cameras, an optical system with a microlens array (MLA) is placed in front of the image sensor [31, 32]. LF cameras can acquire 4D information for a three-dimensional (3D) space using the MLA arranged between an objective lens and image sensor. The 4D data acquired through an LF camera can be reconstructed into an orthographic-view image (OVI) or multi-focus image, such as in a multi-camera system, for 3D visualization [10, 20, 30, 43]. Moreover, as 3D depth and shape information can be obtained through post-processing, many studies on computer vision have used LF imaging systems [7, 15, 25, 47, 60].

As a method for obtaining 3D information on a specimen, an LF microscope (LFM) was designed and developed to acquire LF images by applying an MLA to an optical microscope [3, 24, 42]. In a general optical microscope system, although an enlarged image of a specimen can be obtained, only two-dimensional (2D) information can be obtained. To obtain multi-focus or multi-view images for analyzing 3D information, several images must be captured while changing the settings according to the intended purpose. Scan-based systems, such as confocal microscopy, are widely used to obtain 3D image information of an object from the microscopic field; however, data acquisition in such systems is time consuming [29]. Conversely, LFM can obtain an elemental image array (EIA) containing spatial and angular informations simultaneously through a single capture. This information can be analyzed and visualized by converting it into high-level information—multi-view, multi-focus, and depth—about the specimen [43].

However, in LF imaging systems that place an MLA in front of the image sensor, such as in LFM, there is an inevitable trade-off between the spatial resolution and the number of viewpoints [22]. This is a critical disadvantage of LF cameras based on MLAs. Multi-view images are simultaneously projected onto a limited image sensor, so the spatial resolution per viewpoint decreases as the number of viewpoints increases, i.e., as the angular resolution increases. To address this problem, LF image super-resolution (LFSR), which improves the spatial resolution while maintaining the number of viewpoints, is being extensively studied in the field of LF imaging [1, 16, 19, 21, 38, 56, 58].

Nonlearning-based LFSR methods enhance the resolution of LF images by analyzing the geometric and mathematical modeling information of 4D LF structures or utilizing overlapping information from different viewpoint images [1, 19, 38]. Although these methods are generally applicable and stable, the improvement in the resolution is limited and the image quality is poor. With recent advancements in deep learning, which is a field of brain-inspired artificial intelligence, these methods have been studied in various fields to realize high-level intelligence, high accuracy, high robustness, and low power consumption [44,45,46, 48,49,50,51,52,53]. In particular, increased attention has been paid to LFSR methods applying the convolutional neural network (CNN) model, and these methods have yielded better results compared with the conventional ones [16, 21, 56, 58].

In general, these learning-based image super-resolution (SR) methods are implemented by training an appropriately designed CNN model using external training datasets for image enhancement [2]. However, deep learning-based methods trained on these external datasets cannot avoid domain gap (shifting) problems [5, 39]. If the domain gap (i.e., the difference between real and training data) is large, the performance of these learning-based methods may be poorer than that of the existing methods. Furthermore, LF datasets (for training) are difficult to obtain compared with general image datasets. To address these problems, LF zero-shot learning SR (ZSSR) has been proposed [6]. The zero-shot learning method resolves domain gap problems by learning only the input image without any external dataset [39]. However, the training datasets used in most LFSR studies are images captured using general LF cameras, such as Lytro or Raytrix, and not LFM [35, 37].

The image resolution obtained in LFM in previous studies has been relatively low [19, 21]. Moreover, owing to the optical characteristics of the microscope system, it typically acquires low-quality LF images with vignetting effects caused by a poor illumination environment, distortion caused by the MLA, and deterioration caused by lens contamination. These LF images have a large domain gap compared with general LF images; thus, LFSR models trained with external LF datasets exhibit performance limitations. Similarly, it is not effective to apply zero-shot models that learn the internal features of the input image, such as ZSSR [39], to the LFM data. This is because the LF images captured by LFM have a low resolution and exhibit severe deterioration. Therefore, an appropriate reference image should be used to enhance the resolution.

An LFM system using an existing optical microscope can simultaneously acquire high-resolution 2D images and LF images of a sample depending on the presence or absence of MLAs. Herein, we propose an LF one-shot learning method to simultaneously improve the resolution and quality of LFM images by referring to high-quality 2D images obtained without MLAs. A CNN model was trained on a dataset with a high-resolution 2D image as the ground truth (GT) and a central view LF image as the input data. Several novel learning techniques were proposed for the one-shot LF learning process to prevent model overfitting based only on a central LF image. When all the sub-images of the LF image are input into the trained CNN model, they are converted into a high-quality image, such as a 2D image, while retaining the content of the input image and improving the resolution. Consequently, the details of the specimens lost during the EIA capture process can be restored, noise can be removed, and thus high-quality LF images can be obtained.

2 Background of LF microscopy

Figure 1 shows the general structure of LFM using an infinity-corrected optical system [18, 34]. The specimen is enlarged via a 4f-type infinity-corrected optical system, including the objective and tube lenses, onto an intermediate image plane. As shown in the path of optical axis 1 in Fig. 1, the EIA is captured through an MLA from the enlarged visualization of the specimen. Notably, object points can be imaged through multiple-element lenses, and different viewpoints and depth-of-field information can be encoded in corresponding multiple-element images [18]. From the captured EIA, two types of 3D visualizations can be reconstructed: an OVI (or LF image), which is a multi-view image that expresses the parallax information of the specimen [33], and depth slices reconstructed through the computational integral imaging reconstruction method, where each depth slice represents different depth plane information [14, 55]. To analyze the characteristics of the LFM, two main factors should be considered: lateral and axial resolutions ($R_{LFM}$ and $D_{LFM}$), both of which depend on the numerical aperture of the MLA ($NA_{MLA}$) as follows:

$$\begin{aligned} R_{LFM}= & {} \frac{0.47\lambda }{NA_{MLA}} \times \frac{g}{z}\nonumber \\ D_{LFM}= & {} \frac{1}{M^{2}}\left( \frac{\lambda }{{NA_{MLA}}^{2}} + \frac{z \times P_{S}}{g \times NA_{MLA}}\right) \end{aligned}$$

(1)

where $\lambda $ is the wavelength of illumination, g is the gap between the sensor and MLA, z is the distance between the MLA and intermediate plane, and $P_{S}$ is the pixel pitch of the image sensor.

The LFM resolution is fundamentally determined by the spatial density of the element lenses in the lens array [33]. One pixel is extracted from each elemental image when the EIA obtained through LFM is converted into an LF image through OVI conversion. Therefore, the number of element lenses directly determines the number of pixels in the reconstructed sub-image. As the number of pixels is very small, the resulting quality and resolution are unsatisfactory. Thus, a method for enhancing the resolution of reconstructed images without degrading their quality is required for comfortable viewing. LFSR methods based on deep learning can solve these problems, albeit with performance limitations even when using a deep learning model trained with an external dataset. These methods require many training datasets and cannot avoid performance degradation owing to domain gap problems or depending on the target resolution size. Unlike other complex LF systems [27, 36], our LF optical system has a simple configuration; thus, it can acquire both LF and 2D images depending on whether an MLA exists. In the absence of an MLA, 2D images can be acquired, as shown in the path of optical axis 2 in Fig. 1. Therefore, we developed an LF one-shot learning method to address the aforementioned performance limitations by utilizing the characteristics of the LFM system. Unlike conventional methods, the proposed method is trained only with two images captured from a sample, so no external datasets are required.

3 Proposed LF one-shot learning method

One-shot learning for SR is a technique for training a deep learning model using one pair of datasets [4, 54]. Herein, we adopted this technique to enhance the quality and resolution of LFM images. Hence, we propose an LF one-shot learning algorithm, where an LF image and a 2D image of each sample are used as training data for the LF one-shot learning. It is difficult to avoid overfitting problems in a model trained via one-shot learning, which can cause problems such as performance degradation and content loss when there is a difference between the input and training data. These issues pose considerable difficulties when applying one-shot learning to LF images, where each sub-image captures distinct viewpoint information. However, the proposed method can resolve these training issues and is suitable for LF images. The core idea of the proposed algorithm is to use only a pair of central viewpoint image datasets to train the model to perform only 2D image–like style transformations while retaining the viewpoint information of other sub-images. This can solve the overfitting problems and improve the image quality.

The proposed method involves three major steps. First, high-resolution 2D and LF images were acquired in pairs to set up the training dataset. This training dataset was used to train the proposed model via one-shot learning. Second, the proposed learning techniques and loss functions were used for LF one-shot learning. Thus, the model was trained such that it can be applied to other sub-images without overfitting the central LF image. Third, all the sub-images of the LF image were input into the trained model; detailed restoration and noise removal were performed simultaneously, as in the 2D image; and an output image with an improved resolution was obtained.

3.1 Training data acquisition

A training data pair for one-shot learning was configured using the optical characteristics of the proposed LFM system. First, the EIA was obtained by placing the MLA in the frame and capturing an image, and another image was captured without the MLA to obtain a high-resolution 2D image. The EIA was then converted into a multi-view LF image. The high-resolution 2D image and central LF image (the central viewpoint image among the sub-images) were the same as the front-view image; therefore, it was assumed that the fields of view of these two images matched, and they were used as a training data pair. In the proposed method, the input image was resized to the target size in advance, as shown in Fig. 2. Each sub-image was resized to the target resolution through bicubic interpolation, and the central LF image was resized further to utilize image crop–based data augmentation [40]. Similarly, the high-resolution 2D image was resized to be used as the GT for the training dataset. The change in the image size in this process could be adjusted according to the target resolution.

3.2 Model architecture and initialization

In the proposed method, fully convolutional architectures based on a deep image prior [41] were used in the CNN model. As shown in Fig. 3, the proposed model comprised an encoder–decoder (hourglass) architecture with skip connections. The encoder model $\phi (\cdot )$ comprised six blocks, with a convolution layer, batch normalization layer, and leaky ReLU function as one block. The decoder model $\varphi (\cdot )$ also comprised six blocks, with a pixel shuffle layer, concat function, convolution layer, batch normalization layer, and leaky ReLU function as one block. The entire model was trained to extract latent data [26] in which the unique features of the input image and viewpoint information were preserved through the encoder model $\phi (\cdot )$. Further, the latent data were restructured as a 2D high-resolution image through the decoder model $\varphi (\cdot )$. This is generally considered as a suitable method for converting only the style of an input image while maintaining its unique information, as in the style transfer model [11, 17].

To implement the LF one-shot learning herein, a proposed learning method was used in the model initialization process. As shown in Fig. 3, one image was randomly selected from all the sub-images as the input image for model initialization. Then, as in the autoencoder model training, the model was trained using this input image as the GT. This training process typically minimized the reconstruction error between the original input and the reconstructed output, allowing the model to learn efficient and meaningful representations of the input image. For weight initialization, learning was performed at one-tenth of the total training iteration and the model was trained on the features of various sub-images. This initialization process induced learning to extract features while preserving the information of various viewpoints naturally before training the model with the central LF image and high-resolution 2D image.

3.3 Loss function design for optimization

Figure 4(a) shows the results of two overlapping diagonal sub-images from the original LF image used as the input of the proposed model to show the parallax. In the original LF image, there is a diagonal parallax, as indicated by blue arrows. Figure 4(b) shows the difference between the two images returned to the model trained by the general method with only the central LF and 2D high-resolution images. The unique viewpoint information of the input sub-images was lost because of the overfitting induced by the central viewpoint image in the learning process. To avoid the overfitting problem in one-shot learning and retain the viewpoint information of the input sub-images while maintaining the style transformation performance, the proposed method set a relatively large crop size for the image data augmentation and utilized the enhancement loss. As shown in Fig. 5, data augmentation was applied to prevent overfitting during the training process. Both the input and GT images were randomly cropped to the target size from the resized image to perform data augmentation and used as training data for parts of the image. Additionally, the model was trained on the entire image using the original training image.

Latent loss. Relatively large crop images were used for model training. Therefore, the latent loss $L_{la}$, an image feature map–based loss function, was also used. In the proposed $L_{la}$, the difference between the specific feature map extracted by the CNN model (an intermediate result and not the final feedforward result of the model) was reflected in the model training. $L_{la}$ in Eq. (3) represents the mean square error between the latent data $Z_{c}^{lr}$ extracted from the input image $LF_{c}^{lr}$ and the latent data $Z_{c}^{hr}$ extracted from the output image $LF_{c}^{hr}$ using the encoder model $\phi (\cdot )$. C, H, and W refer to the channel, height, and width of the latent data, respectively.

$$\begin{aligned} Z_{c}^{lr} = \phi (LF_{c}^{lr}); \quad LF_{c}^{hr} = \varphi (Z_{c}^{lr}); \quad Z_{c}^{hr} = \phi (LF_{c}^{hr}) \end{aligned}$$

(2)

$$\begin{aligned} L_{la} = \frac{1}{CHW}\left\Vert Z_{c}^{lr} - Z_{c}^{hr}\right\Vert _{2}^{2} \end{aligned}$$

(3)

$Z_{c}^{lr}$ contains unique characteristics of the input image, including the viewpoint information. Therefore, as shown in Fig. 4(b), $Z_{c}^{hr}$ loses viewpoint information owing to overfitting and differs considerably from the $Z_{c}^{lr}$ of the input image, which retains the viewpoint information. $L_{la}$ induces model training to minimize the feature-map error between the input and output images. Therefore, it performs a complementary role such that the unique viewpoint information of the input image can be maintained in the output image. Figure 4(c) shows the results of the proposed method. It can be observed that the output image is converted into a high-resolution 2D image and that the difference information is retained.

Style loss. Unlike the feature map–based $L_{la}$, the style loss $L_{st}$ is an error function between the final output and GT. Therefore, $L_{st}$ induces model learning to convert an input image and produce an output, such as a high-resolution 2D image, i.e., the GT. General loss functions, such as the L1/L2 loss, are per-pixel loss functions for representing the error between the corresponding pixels of an output and the GT. As these per-pixel loss functions estimate errors in pixels, they are not suitable for the proposed system that uses learning data that structurally match the input image and GT while having some discrepancies at the pixel unit owing to the MLA. Therefore, the proposed system requires a loss function for estimating the error in local region units. Accordingly, herein, the multiscale structural similarity index measure (MS-SSIM) loss based on the image structure was used as a style loss. This is shown as follows [59]:

$$\begin{aligned} \begin{aligned} SSIM(x,y)&= \frac{2\mu _{x}\mu _{y}+C_{1}}{\mu _{x}^{2}+\mu _{y}^{2}+C_{1}} \cdot \frac{2\sigma _{xy}+C_{2}}{\sigma _{x}^{2}+\sigma _{y}^{2}+C_{2}}\\&= l(x,y) \cdot cs(x,y) \end{aligned} \end{aligned}$$

(4)

$$\begin{aligned} MS\text {-}SSIM(x,y) = l_{M}^{\alpha }(x,y) \cdot \prod _{j=1}^{M}cs_{j}^{\beta _{j}}(x,y) \end{aligned}$$

(5)

$$\begin{aligned} L_{st} = 1 - MS\text {-}SSIM(LF_{c}^{hr},I^{hr}) \end{aligned}$$

(6)

where $\mu _{x}$ and $\mu _{y}$ are the averages of each x and y, respectively, $\mu _{x}^{2}$ and $\mu _{y}^{2}$ are the variances of each x and y, respectively, $\sigma _{xy}$ is the covariance between x and y, and $C_{1}$ and $C_{2}$ are variables that stabilize the division. $L_{st}$ trains the model through the structural differences between the images rather than errors between the corresponding pixels. Figure 6 shows the intermediate results output through the model trained while replacing the loss function. As shown in Fig. 6, when the MS-SSIM loss is used in the proposed method, the learning convergence is faster and the visual style conversion performance is better than that of other per-pixel loss functions. Moreover, the proposed loss function is effective in terms of the visual expression and restoration of fine details.

Table 1 Devices specifications of the proposed LFM

Full size table

Enhancement loss. The enhancement loss $L_{en}$ used for model training is a combination of $L_{st}$ and $L_{la}$ with a specific weight ($\alpha _{st}$, $\alpha _{la}$), as shown in Eq. (7). Through $L_{en}$, the model can be trained to transform an input image into an image similar to the reference image while preserving unique information in the input image.

$$\begin{aligned} L_{en} = \alpha _{st}L_{st} + \alpha _{la}L_{la} \end{aligned}$$

(7)

Cut loss. When the model is trained with only $L_{en}$, saturation may occur in the image area during the learning process, as shown in the left image of Fig. 7(b). If the saturation area is small compared with the image used for training, the effect on $L_{st}$ is insignificant; therefore, there is a limit to the convergence, regardless of the number of training iterations. To solve this problem, a cut loss $L_{cut}$ was proposed herein. As shown in Fig. 7(a), the images used for training were divided into 8 $\times $ 8 array patches, the L1 loss between the corresponding patches in a total of 64 regions was obtained, and the value with the highest error became $L_{cut}$. Further, the L1 loss value between the output and the n-th patch of GT is $CL_{pn}$, and pn is the patch number. $CL_{pn}$ is obtained from all patches (pn = 1–64), as shown in Eq. (8), and its highest value affects the learning process. $L_{cut}$ is the same as in Eq. (9).

$$\begin{aligned} CL_{pn} = \frac{1}{CHW}\left\Vert P_{pn}^{out} - P_{pn}^{gt}\right\Vert _{1}, (pn = 1\text {--}64) \end{aligned}$$

(8)

$$\begin{aligned} L_{cut} = Max(CL_{pn}), (pn = 1\text {--}64) \end{aligned}$$

(9)

In general, if $L_{cut}$ is utilized, even a small saturation area can have a considerable effect on model training. In this context, it is possible to remove saturation in a certain area during the learning process. In case of separate training with only the patch-based $L_{cut}$, blurring may occur in the patch boundary area. Therefore, herein, $L_{en}$ (for image restoration) and $L_{cut}$ (for saturation removal) were alternately used for the training. Adding $L_{cut}$ to the learning process removed the saturation, as shown in Fig. 7(b).

3.4 Training process and hyperparameters

The following hyperparameters were used for training: learning rate = 1e-3, weight of style loss $\alpha _{st}$ = 0.6, weight of latent loss $\alpha _{la}$ = 0.4, initialization iteration = 300, total iteration = 3000, image crop ratio = 9:1, and cut loss ratio = 1:8. The overall model training process is depicted in Algorithm 1.

3.5 Model test and display

When model training was completed with the central LF image and high-resolution 2D image, all the pre-upsampled sub-images of the LF image were input into the trained model to be transformed into high-quality images, as shown in Fig. 8. The converted images were rearranged for each viewpoint, similar to the original LF image, and the process proposed for enhancing the low-quality LF image using one-shot learning was completed. Through this process, resolution enhancement, noise removal, and specimen detail restoration based on a reference image were simultaneously performed, resulting in an LF image quality similar to that of a high-resolution 2D image used as the reference image. The original LF image, pre-upsampled sub-images for each viewpoint, and converted sub-images were output in the display program.

4 Experiment results and analysis

The proposed LFM comprises an infinity-corrected optical system, an MLA, a digital single-lens reflex camera, and a PC for processing and displaying the acquired microscopic images, as shown in Fig. 9. The detailed device specifications of the proposed LFM are listed in Table 1.

Six different objects were utilized as experimental specimens: a mini gear, seedpod, chip resistor, bud, grass seed 1, and grass seed 2 (Fig. 10). For each specimen, EIA and 2D images (Figs. 10 (a) and (b)) were acquired with a resolution of 4000 $\times $ 4000 pixels. The LF image (Fig. 10(c)) was generated by OVI reconstruction for the EIA and comprised 53 $\times $ 53 directional views. The size of each sub-image was 76 $\times $ 76 pixels. In this experiment, the region of interest was set as shown in Fig. 10(c) and 9 $\times $ 9 (i.e., 81) sub-images were selected at regular intervals based on the central LF image. The selected images were reconstructed as shown in Fig. 10(d) and used for the experiment. One of the reasons for using sampled images (as against all sub-images) in the experiments was to filter images unsuitable for display owing to vignetting. Furthermore, the core of this experiment was to verify whether the proposed method is properly applied to the central LF image, which is mainly used for learning, and other viewpoint sub-images with large disparities.

Unlike traditional learning-based methods, the proposed model is trained with only one dataset comprising two images acquired from a specific specimen. Using a 2D image (Fig. 10(a)) and an LF image (Fig. 10(d)) of each specimen, each model was trained using the proposed one-shot learning method and the default target resolution was set to 512 $\times $ 512 pixels. Because the model was trained with only one dataset for a specific specimen without any external dataset, the trained model exhibited specialized performance for that specimen. Therefore, to perform the proposed method on another specimen, a new model should be trained with a dataset of a correspondence specimen. When all the sub-images were input into the trained model, the proposed LFSR was performed. The degraded input images were converted into high-quality images, resulting in various image improvements, such as resolution enhancement, detail restoration, and noise removal. Moreover, image quality evaluations, comparisons, and additional resolution improvement experiments were performed to verify the usefulness of the proposed method more systematically.

4.1 Comparative analysis

To prove the superiority the proposed method, we compare it with previous methods for enhancing image resolution and assessing image quality. Figure 11(b) shows the result of improving the resolution using bicubic interpolation, which is generally used for image upsampling. The figure displays the input images used in the proposed method. Figure 11(c) shows the output of the SR network for multiple degradations (SRMD) [57], which is an external image SR model trained with a large external dataset. Unlike other CNN-based SR models, the SRMD model is an effective model that improves resolution and considers degradations such as noise in the input low-resolution images. SRMD is a suitable model for the proposed LFM system, in which a low-quality EIA, including distortions and lens contamination, is obtained owing to the poor illumination environment and use of the MLA. Therefore, it was used in our previous studies [21]. Figure 11(d) shows the result of LFSR using block matching, principal component analysis, and ridge regression (bm_pca_rr_LFSR) [9]. Similar to typical LFSR methods, bm_pca_rr_LFSR enhances LF images by utilizing and synthesizing surrounding sub-image information. Figure 11(e) shows the results from ZSSR [39] as a representative of an internal image SR model that utilizes only input images rather than an external dataset for model training. Methods that do not use reference images, such as Figs. 11(b)–(e), cannot improve image quality if the input image is extremely deteriorated and has very low resolution, similar to the LF image of the proposed system. Figure 11(f) shows the results from the initial model of the proposed method when replacing the CNN model of the ZSSR with a deeper model (ResNet-12). In addition, the training method is replaced with one-shot learning, which learns high-resolution 2D images with the GT, similar to the proposed method. The results from the initial model are better than those of the existing methods that do not utilize reference images. However, compared with that of the reference image, the visual quality remains insufficient. Meanwhile, as shown in Fig. 11(g), the results from the proposed method are similar to those of the high-resolution 2D image used as the reference image. It can be observed that resolution enhancement and image conversion are performed properly.

Table 2 PSNR, Mean NIQE, and Mean DISTS

Full size table

Figure 12 and Table 2 present the results of the image-quality assessment for each SR method. The reference-based peak signal-to-noise ratio (PSNR), deep image structure and texture similarity (DISTS), and nonreference-based natural image quality evaluator (NIQE) were used [8, 13, 28]. The PSNR compares a reference image with an input image, measuring and quantifying the noise or distortion, and it is calculated on the basis of the mean squared error between the reference image and input image. The PSNR is an indicator measured by setting a high-resolution 2D image as a reference and comparing it with the output of the central LF image; the larger the value of PSNR, the better the perceptual quality. The results from the one-shot learning methods (Figs. 11 (f) and (g)) using high-resolution 2D images as a GT have considerably higher PSNR values than the results of the other methods (Figs. 11 (b)–(e)). In particular, the proposed method has the highest evaluation value for all samples; thus, the resultant image has been restored to be as similar as possible to the 2D image. The NIQE is a quality evaluation indicator that requires a reference image using statistical standardization observed in natural images (high quality); the smaller the value of NIQE, the better the perceptual quality. This metric utilizes a model of natural scene statistics to capture the characteristics of undistorted images. It computes a set of features that describe the spatial and spectral characteristics of an input image and then compares these features with those extracted from a database of high-quality natural images. The NIQE is computed as a scalar value that quantifies the quality of the input image according to this comparison. The NIQE is used to perform the quality evaluation of the central LF image and other viewpoint sub-images. Compared to the PSNR, which uses only the central LF image, the NIQE is obtained for all 81 sub-images. The average values are shown in Fig. 12(b) and Table 2. It can be observed that the NIQE values of the proposed method are lower than those of the other methods for all samples. The DISTS is an image-quality assessment method that leverages deep learning to evaluate the perceptual similarity between a reference image and an input image. The DISTS employs a pre-trained CNN to extract deep features from both the reference and input images. The similarity between the extracted features was computed to obtain this metric score. This method was used in this study because it can measure the similarity and quality of image textures. Using a high-resolution 2D image as a reference image, the DISTS was performed on all sub-images. This score shows whether the texture quality of the converted sub-images is similar to that of the 2D image, with lower scores indicating better perceptual quality. As shown in Fig. 12(c) and Table 2, the proposed method has the lowest mean values for all specimens; thus, the results of the proposed method have high-quality textures most similar to 2D images.

4.2 Ablation experiment

Ablation experiments were conducted to verify the effectiveness and performance of the trained model using the proposed loss functions. The style loss is an important loss function that allows the trained model to transform the input images similar to the 2D image used as the GT and affects the perceptual quality of the output images. Figure 13 shows the results of the style-loss replacement experiment: the output images of some sub-images of the model trained using L1 loss (Fig. 13(a)), L2 loss (Fig. 13(b)), smooth L1 loss (Fig. 13(c)), and MS-SSIM loss (Fig. 13(d)) as the style loss. The style loss of the proposed method (Fig. 13(d)) showed faster learning convergence compared with other per-pixel loss functions (Figs. 13(a)–(c)). The proposed method completed training in only 3000 iterations; however, other loss functions required at least 5000 iterations. Furthermore, it was insufficient compared with the proposed method in terms of visual expression and restoration of fine details. Moreover, even when using image quantitative metrics for comparison, the proposed method shows the highest overall performance, as shown in Fig. 13. This result proves that training the model through structural differences between images rather than errors between corresponding pixels is more suitable for LF one-shot learning.

Cut loss suppresses the saturation that may occur during model training. In particular, the cut loss based on local patches can effectively eliminate even saturation occurring in a small area of the output image. Figure 14(a) shows the output images of the model trained without cut loss, and Fig. 14(b) shows the output images of the proposed model with cut loss. Without cut loss, the model could not effectively remove the saturation occurring in a certain area, even when the total iteration for training the model was increased. In contrast, the model using cut loss could effectively avoid image saturation because the cut loss can select patches with large errors owing to saturation to train the model. This proved to be a useful loss function for the proposed method, which has a relatively large target resolution to be improved.

In LF one-shot learning, the model is trained to improve the quality and resolution of all sub-images; however, only one dataset about a sub-image of the center viewpoint is used for the main learning process. Therefore, the latent loss is used to suppress overfitting by one-shot learning and preserve the unique content information, such as the viewpoint of the input sub-images. The overfitting that occurred when training the model without latent loss caused the distortion of the output image. Figure 15 shows parallax information by overlapping sub-images with vertical and horizontal parallax. Magenta and green regions in the composite image show where the intensities are different between the two images and gray regions show where the two images have the same intensities. The output sub-images of the proposed method show the shifting information caused by parallax, such as the input sub-images. Meanwhile, the model trained without latent loss has lost the unique content information of all the input sub-images and converted them to the viewpoint of the image used for one-shot learning. Therefore, unlike the input images, overlapping images do not show parallax changes properly. Furthermore, in some cases, the input sub-image information and the image information used for training were synthesized and output. This experimental result indicates that latent loss is an essential loss function that can preserve the unique information of the input sub-images.

4.3 Additional experiment

To verify that the proposed algorithm performs properly even when the target image resolution is higher than 512 $\times $ 512, an additional experiment was performed by setting the target resolution to 1024 $\times $ 1024. This resolution is the result of about a 13-fold improvement in resolution compared with the original 76 $\times $ 76 one. The framework of the entire algorithm remained the same, although certain parameters were doubled. Figure 16 shows the results from the additional experiments. The top row shows the LF image used in the experiment, and the second row is the original sub-image for one viewpoint in the upper left of the LF image with a resolution of 76 $\times $ 76. The image on the left of the third row of Fig. 16 has been resized to 512 $\times $ 512 by bicubic interpolation of this sub-image, and the image is used as an input to the model. The image on the right in the third row of Fig. 16 shows the output result of the model, indicating that the input image is converted to one with quality similar to that of the high-resolution 2D image. The last row shows the results for the target resolution in the 1024 $\times $ 1024 model, where the quality of the converted image is similar to that of the high-resolution 2D image at the 1024 $\times $ 1024 output. Visually, the results of 1024 $\times $ 1024 model are not considerably degraded compared with the results of the previous 512 $\times $ 512 model. Furthermore, the quantitative values (such as the NIQE) are similar, indicating that image-enhancement performance remains constant regardless of the output resolution. This feature is an advantage of the proposed method compared with the existing CNN-based SR models. This is because the results of the previous methods show an inverse relationship between the size of the target resolution and the image quality. This additional experiment confirms that using a deeper model and GPU with a larger memory can improve the resolution of the LF image with a size similar to that of the 2D image.

An additional experiment verified the utility of the parameter initialization processes for all sub-images—as in the autoencoder model training—before conducting model learning with the central LF and high-resolution 2D images. Notably, the model can be trained immediately through one-shot learning without the initialization process. However, the trained model with the initialization process tends to perform a more natural transformation for the sub-images of various views. In particular, the larger the parallax from the central LF image, the greater the difference between the training with and without the initialization process, as shown in Fig. 17. In Fig. 17, the upper left sub-image is used as an input, as shown in Fig. 16. Figure 17(a) shows the results of the model trained without the initialization process, and Fig. 17(b) shows the results from the proposed method, where both models were trained with the same iterations. Comparing Fig. 17(a) and (b), we can observe that the converted image in Fig. 17(b), which underwent the initialization process, has a more visually natural quality. These differences become more noticeable as the disparity and target resolution increase.

5 Conclusion

The primary disadvantage of an MLA-based LFM imaging system is yielding low-resolution and low-quality LF images. This problem can be addressed via learning using high-resolution 2D images of the same scene. Herein, an LF one-shot learning algorithm was developed to improve LF image resolution and quality. The proposed algorithm used both high-resolution 2D images and LF images captured with an LFM system. Various learning techniques were applied to address the problems arising when using only one LF sub-image as training data. Various LF sub-images could be converted into high-quality images, such as high-resolution 2D images, while retaining their content. The results indicated that the developed system offered superior performance compared with existing SR methods, which are considerably affected by the domain gap between the input data and training data. In addition, the image quality was maintained when the LF sub-image was enhanced by up to 13 times without performance deterioration depending on the target resolution size. Finally, as with the LFM of this study, if both 2D and LF images can be obtained, it is confirmed that a superior LF imaging system can be developed to overcome the limitations of LF images. However, unlike other learning-based methods that utilize pre-trained models, the proposed model is suitable for post-processing and has limitations in real-time processing because a new model should be trained for each sample. In future research, we will develop a reference-based LFSR with few-shot learning or meta-learning techniques to utilize a pre-trained model for real-time and versatility performance. Moreover, a stereomicroscope system will be developed for capturing LF and 2D images simultaneously using a beam splitter and imaging systems that use combined 2D and LF images will be examined.

Data availability

Data sharing not applicable to this article as no datasets were generated or analysed during the current study.

References

Alain M, Smolic A (2018) Light field super-resolution via lfbm5d sparse coding. 2018 25th IEEE international conference on image processing (ICIP). IEEE, Athens, Greece, pp 2501–2505
Anwar S, Khan S, Barnes N (2020) A deep journey into super-resolution: A survey. ACM Computing Surveys (CSUR) 53(3):1–34
Article Google Scholar
Bimber O, Schedl DC (2019) Light-field microscopy: A review. J Neurol Neuromedicine 4(1):1–6
Article Google Scholar
Cheng J, Han Z, Wang Z, Chen L (2021) One-shot super-resolution via backward style transfer for fast high-resolution style transfer. IEEE Signal Processing Letters 28:1485–1489
Article Google Scholar
Cheng Z, Xiong Z, Chen C, Liu D (2019) Light field super-resolution: a benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR), Long Beach, CA, USA, pp 1804–1813
Cheng Z, Xiong Z, Chen C, Liu D, Zha ZJ (2021b) Light field super-resolution with zero-shot learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10010–10019
Da Sie Y, Lin CY, Chen SJ (2018) 3d surface morphology imaging of opaque microstructures via light-field microscopy. Sci Rep 8(1):1–13
Google Scholar
Ding K, Ma K, Wang S, Simoncelli EP (2020) Image quality assessment: Unifying structure and texture similarity. IEEE transactions on pattern analysis and machine intelligence 44(5):2567–2581
Google Scholar
Farrugia RA, Galea C, Guillemot C (2017) Super resolution of light field images using linear subspace projection of patch-volumes. IEEE Journal of Selected Topics in Signal Processing 11(7):1058–1071
Article Google Scholar
Fiss J, Curless B, Szeliski R (2014) Refocusing plenoptic images using depth-adaptive splatting. 2014 IEEE international conference on computational photography(ICCP). IEEE, Santa Clara, CA, USA, pp 1–9
Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, pp 2414–2423
Gortler SJ, Grzeszczuk R, Szeliski R, Cohen MF (1996) The lumigraph. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, New Orleans, LA, USA, pp 43–54
Hore A, Ziou D (2010) Image quality metrics: Psnr vs. ssim. In: 2010 20th international conference on pattern recognition (ICPR), IEEE, Istanbul, Turkey, pp 2366–2369
Hwang DC, Shin DH, Kim SC, Kim ES (2008) Depth extraction of three-dimensional objects in space by the computational integral imaging reconstruction technique. Appl Optics 47(19):D128–D135
Article Google Scholar
Jeon HG, Park J, Choe G, Park J, Bok Y, Tai YW, Kweon IS (2018) Depth from a light field image with learning-based matching costs. IEEE Trans Pattern Anal Mach Intell 41(2):297–310
Article Google Scholar
Jin J, Hou J, Chen J, Kwong S (2020) Light field spatial super-resolution via deep combinatorial geometry embedding and structural consistency regularization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 2260–2269
Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. European conference on computer vision (ECCV). Springer, Amsterdam, Netherlands, pp 694–711
Kim N, Erdenebat MU (2016) Three-dimensional integral photography. SPIE, Bellingham, WA, USA
Google Scholar
Kwon KC, Jeong JS, Erdenebat MU, Piao YL, Yoo KH, Kim N (2015) Resolution-enhancement for an orthographic-view image display in an integral imaging microscope system. Biomed Opt Express 6(3):736–746
Article Google Scholar
Kwon KC, Erdenebat MU, Alam MA, Lim YT, Kim KG, Kim N (2016) Integral imaging microscopy with enhanced depth-of-field using a spatial multiplexing. Opt Express 24(3):2072–2083
Article Google Scholar
Kwon KC, Kwon KH, Erdenebat MU, Piao YL, Lim YT, Kim MY, Kim N (2019) Resolution-enhancement for an integral imaging microscopy using deep learning. IEEE Photonics J 11(1):1–12
Article Google Scholar
Levin A, Freeman WT, Durand F (2008) Understanding camera trade-offs through a bayesian analysis of light field projections. European Conference on Computer Vision (ECCV). Springer, Marseille, France, pp 88–101
Levoy M, Hanrahan P (1996) Light field rendering. In: Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, New Orleans, LA, USA, pp 31–42
Levoy M, Ng R, Adams A, Footer M, Horowitz M (2006) Light field microscopy. USA, Boston, MA, pp 924–934
Google Scholar
Li H, Guo C, Muniraj I, Schroeder BC, Sheridan JT, Jia S (2017) Volumetric light-field encryption at the microscopic scale. Sci Rep 7(1):1–8
Google Scholar
Lin S, Clark R (2020) Ladder: Latent data distribution modelling with a generative prior. In: The British Machine Vision Conference (BMVC), pp 1–14
Lumsdaine A, Georgiev T (2009) The focused plenoptic camera. 2009 IEEE International Conference on Computational Photography (ICCP). IEEE, San Francisco, CA, USA, pp 1–8
Mittal A, Soundararajan R, Bovik AC (2012) Making a completely blind image quality analyzer. IEEE Signal Process Lett 20(3):209–212
Article Google Scholar
Murphy DB (2002) Fundamentals of light microscopy and electronic imaging, 1st edn. John Wiley & Sons, Hoboken, NJ, USA
Google Scholar
Ng R (2005) Fourier slice photography. Los Angeles, CA, USA, pp 735–744
Google Scholar
Ng R (2006) Digital light field photography. stanford university, Stanford, CA, USA
Ng R, Levoy M, Brédif M, Duval G, Horowitz M, Hanrahan P (2005) Light field photography with a hand-held plenoptic camera. Thesis, Stanford, CA, USA
Google Scholar
Park JH, Baasantseren G, Kim N, Park G, Kang JM, Lee B (2008) View image generation in perspective and orthographic projection geometry based on integral imaging. Opt Express 16(12):8800–8813
Article Google Scholar
Park JH, Hong K, Lee B (2009) Recent progress in three-dimensional information processing based on integral imaging. Appl Optics 48(34):H77–H94
Article Google Scholar
Paudyal P, Olsson R, Sjöström M, Battisti F, Carli M (2016) Smart: A light field image quality dataset. In: Proceedings of the 7th international conference on multimedia systems (MMSys), Klagenfurt, Austria, pp 1–6
Perwass C, Wietzke L (2012) Single lens 3d-camera with extended depth-of-field. Human vision and electronic imaging XVII 8291:45–59
Google Scholar
Rerabek M, Ebrahimi T (2016) New light field image dataset. In: 8th International Conference on Quality of Multimedia Experience (QoMEX), Lisbon, Portugal
Rossi M, Frossard P (2017) Graph-based light field super-resolution. In: 2017 IEEE 19th International Workshop on Multimedia Signal Processing (MMSP). IEEE, Luton, United Kingdom, pp 1–6
Shocher A, Cohen N, Irani M (2018) “zero-shot” super-resolution using deep internal learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, pp 3118–3126
Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big Data 6(1):1–48
Article Google Scholar
Ulyanov D, Vedaldi A, Lempitsky V (2018) Deep image prior. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, pp 9446–9454
Wang D, Zhu Z, Xu Z, Zhang D (2022) Neuroimaging with light field microscopy: a mini review of imaging systems. Eur Phys J-Spec Top 231:749–761
Article Google Scholar
Wu G, Masia B, Jarabo A, Zhang Y, Wang L, Dai Q, Chai T, Liu Y (2017) Light field image processing: An overview. IEEE J Sel Top Signal Process 11(7):926–954
Article Google Scholar
Xiao S, Wen J, Yang J, Zhou Y (????) No-reference quality assessment of stereoscopic video based on deep frequency perception. In: Proceedings of the 2nd Workshop on Quality of Experience in Visual Multimedia Applications, pp 39–47
Yang J, Xiao S, Li A, Lu W, Gao X, Li Y (2021) Msta-net: forgery detection by generating manipulation trace based on multi-scale self-texture attention. IEEE transactions on circuits and systemsfor video technology 32(7):4854–4866
Yang J, Guo X, Li Y, Marinello F, Ercisli S, Zhang Z (2022) A survey of few-shot learning in smart agriculture: developments, applications, and challenges. Plant Methods 18(1):1–12
Article Google Scholar
Yang S, Sang X, Yu X, Gao X, Liu L, Liu B, Yang L (2018) 162-inch 3d light field display based on aspheric lens array and holographic functional screen. Opt Express 26(25):33013–33021
Article Google Scholar
Yang S, Gao T, Wang J, Deng B, Lansdell B, Linares-Barranco B (2021) Efficient spike-driven learning with dendritic event-based processing. Frontiers in Neuroscience 15:601109
Article Google Scholar
Yang S, Wang J, Deng B, Azghadi MR, Linares-Barranco B (2021) Neuromorphic context-dependent learning framework with fault-tolerant spike routing. IEEE Transactions on Neural Networks and Learning Systems 33(12):7126–7140
Article Google Scholar
Yang S, Wang J, Hao X, Li H, Wei X, Deng B, Loparo KA (2021) Bicoss: toward large-scale cognition brain with multigranular neuromorphic architecture. IEEE Transactions on Neural Networks and Learning Systems 33(7):2801–2815
Article Google Scholar
Yang S, Gao T, Wang J, Deng B, Azghadi MR, Lei T, Linares-Barranco B (2022b) Sam: a unified self-adaptive multicompartmental spiking neuron model for learning with working memory. Frontiers in Neuroscience 16
Yang S, Linares-Barranco B, Chen B (2022c) Heterogeneous ensemble-based spike-driven few-shot online learning. Frontiers in Neuroscience 16
Yang S, Tan J, Chen B (2022) Robust spike-based continual meta-learning improved by restricted minimum error entropy criterion. Entropy 24(4):455
Article MathSciNet Google Scholar
Yeo H, Jung Y, Kim J, Shin J, Han D (????) Neural adaptive content-aware internet video delivery. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp 645–661
Yi F, Moon I, Lee JA, Javidi B (2012) Fast 3d computational integral imaging using graphics processing unit. J Disp Technol 8(12):714–722
Article Google Scholar
Yoon Y, Jeon HG, Yoo D, Lee JY, Kweon IS (2017) Light-field image super-resolution using convolutional neural network. IEEE Signal Process Lett 24(6):848–852
Article Google Scholar
Zhang K, Zuo W, Zhang L (2018) Learning a single convolutional super-resolution network for multiple degradations. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Salt Lake City, UT, USA, pp 3262–3271
Zhang S, Lin Y, Sheng H (2019) Residual networks for light field image super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, pp 11046–11055
Zhao H, Gallo O, Frosio I, Kautz J (2016) Loss functions for image restoration with neural networks. IEEE Trans Comput Imaging 3(1):47–57
Zhou W, Zhou E, Yan Y, Lin L, Lumsdaine A (2019) Learning depth cues from focal stack for light field depth estimation. 2019 IEEE International Conference on Image Processing (ICIP). IEEE, Taipei, Taiwan, pp 1074–1078

Download references

Acknowledgements

This research was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government(MSIT) (2022R1A2C2008133 & 2018R1D1A3B07044041) and in part by Basic Science Research Program through the NRF, funded by the Ministry of Education (2021R1A6A1A03043144). The authors would like to thank Enago (www.enago.com) for the English language review.

Author information

Authors and Affiliations

School of Electronic and Electrical Engineering, Kyungpook National University, 80 Daehak-ro, Buk-gu, 41566, Daegu, South Korea
Ki Hoon Kwon & Min Young Kim
School of Information and Communication Engineering, Chungbuk National University, 1 Chungdae-ro, Seowon-gu, 28644, Cheongju, South Korea
Munkh-Uchral Erdenebat, Nam Kim & Ki-Chul Kwon

Authors

Ki Hoon Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Munkh-Uchral Erdenebat
View author publications
You can also search for this author in PubMed Google Scholar
Nam Kim
View author publications
You can also search for this author in PubMed Google Scholar
Ki-Chul Kwon
View author publications
You can also search for this author in PubMed Google Scholar
Min Young Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Ki-Chul Kwon or Min Young Kim.

Ethics declarations

Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kwon, K.H., Erdenebat, MU., Kim, N. et al. Image quality enhancement of 4D light field microscopy via reference impge propagation-based one-shot learning. Appl Intell 53, 23834–23852 (2023). https://doi.org/10.1007/s10489-023-04684-4

Download citation

Accepted: 02 May 2023
Published: 15 July 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s10489-023-04684-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Image quality enhancement of 4D light field microscopy via reference impge propagation-based one-shot learning

Abstract

Similar content being viewed by others

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Learning a Deep Convolutional Network for Image Super-Resolution

A novel image denoising algorithm based on least square generative adversarial network

1 Introduction

2 Background of LF microscopy