1 Introduction

In a four-dimensional (4D) light-field (LF) system, 4D information about the angular and spatial characteristics of light can be obtained simultaneously [12, 23]. In LF cameras, an optical system with a microlens array (MLA) is placed in front of the image sensor [31, 32]. LF cameras can acquire 4D information for a three-dimensional (3D) space using the MLA arranged between an objective lens and image sensor. The 4D data acquired through an LF camera can be reconstructed into an orthographic-view image (OVI) or multi-focus image, such as in a multi-camera system, for 3D visualization [10, 20, 30, 43]. Moreover, as 3D depth and shape information can be obtained through post-processing, many studies on computer vision have used LF imaging systems [7, 15, 25, 47, 60].

As a method for obtaining 3D information on a specimen, an LF microscope (LFM) was designed and developed to acquire LF images by applying an MLA to an optical microscope [3, 24, 42]. In a general optical microscope system, although an enlarged image of a specimen can be obtained, only two-dimensional (2D) information can be obtained. To obtain multi-focus or multi-view images for analyzing 3D information, several images must be captured while changing the settings according to the intended purpose. Scan-based systems, such as confocal microscopy, are widely used to obtain 3D image information of an object from the microscopic field; however, data acquisition in such systems is time consuming [29]. Conversely, LFM can obtain an elemental image array (EIA) containing spatial and angular informations simultaneously through a single capture. This information can be analyzed and visualized by converting it into high-level information—multi-view, multi-focus, and depth—about the specimen [43].

However, in LF imaging systems that place an MLA in front of the image sensor, such as in LFM, there is an inevitable trade-off between the spatial resolution and the number of viewpoints [22]. This is a critical disadvantage of LF cameras based on MLAs. Multi-view images are simultaneously projected onto a limited image sensor, so the spatial resolution per viewpoint decreases as the number of viewpoints increases, i.e., as the angular resolution increases. To address this problem, LF image super-resolution (LFSR), which improves the spatial resolution while maintaining the number of viewpoints, is being extensively studied in the field of LF imaging [1, 16, 19, 21, 38, 56, 58].

Nonlearning-based LFSR methods enhance the resolution of LF images by analyzing the geometric and mathematical modeling information of 4D LF structures or utilizing overlapping information from different viewpoint images [1, 19, 38]. Although these methods are generally applicable and stable, the improvement in the resolution is limited and the image quality is poor. With recent advancements in deep learning, which is a field of brain-inspired artificial intelligence, these methods have been studied in various fields to realize high-level intelligence, high accuracy, high robustness, and low power consumption [44,45,46, 48,49,50,51,52,53]. In particular, increased attention has been paid to LFSR methods applying the convolutional neural network (CNN) model, and these methods have yielded better results compared with the conventional ones [16, 21, 56, 58].

In general, these learning-based image super-resolution (SR) methods are implemented by training an appropriately designed CNN model using external training datasets for image enhancement [2]. However, deep learning-based methods trained on these external datasets cannot avoid domain gap (shifting) problems [5, 39]. If the domain gap (i.e., the difference between real and training data) is large, the performance of these learning-based methods may be poorer than that of the existing methods. Furthermore, LF datasets (for training) are difficult to obtain compared with general image datasets. To address these problems, LF zero-shot learning SR (ZSSR) has been proposed [6]. The zero-shot learning method resolves domain gap problems by learning only the input image without any external dataset [39]. However, the training datasets used in most LFSR studies are images captured using general LF cameras, such as Lytro or Raytrix, and not LFM [35, 37].

Fig. 1
figure 1

Conceptual diagram of LFM

The image resolution obtained in LFM in previous studies has been relatively low [19, 21]. Moreover, owing to the optical characteristics of the microscope system, it typically acquires low-quality LF images with vignetting effects caused by a poor illumination environment, distortion caused by the MLA, and deterioration caused by lens contamination. These LF images have a large domain gap compared with general LF images; thus, LFSR models trained with external LF datasets exhibit performance limitations. Similarly, it is not effective to apply zero-shot models that learn the internal features of the input image, such as ZSSR [39], to the LFM data. This is because the LF images captured by LFM have a low resolution and exhibit severe deterioration. Therefore, an appropriate reference image should be used to enhance the resolution.

An LFM system using an existing optical microscope can simultaneously acquire high-resolution 2D images and LF images of a sample depending on the presence or absence of MLAs. Herein, we propose an LF one-shot learning method to simultaneously improve the resolution and quality of LFM images by referring to high-quality 2D images obtained without MLAs. A CNN model was trained on a dataset with a high-resolution 2D image as the ground truth (GT) and a central view LF image as the input data. Several novel learning techniques were proposed for the one-shot LF learning process to prevent model overfitting based only on a central LF image. When all the sub-images of the LF image are input into the trained CNN model, they are converted into a high-quality image, such as a 2D image, while retaining the content of the input image and improving the resolution. Consequently, the details of the specimens lost during the EIA capture process can be restored, noise can be removed, and thus high-quality LF images can be obtained.

2 Background of LF microscopy

Figure 1 shows the general structure of LFM using an infinity-corrected optical system [18, 34]. The specimen is enlarged via a 4f-type infinity-corrected optical system, including the objective and tube lenses, onto an intermediate image plane. As shown in the path of optical axis 1 in Fig. 1, the EIA is captured through an MLA from the enlarged visualization of the specimen. Notably, object points can be imaged through multiple-element lenses, and different viewpoints and depth-of-field information can be encoded in corresponding multiple-element images [18]. From the captured EIA, two types of 3D visualizations can be reconstructed: an OVI (or LF image), which is a multi-view image that expresses the parallax information of the specimen [33], and depth slices reconstructed through the computational integral imaging reconstruction method, where each depth slice represents different depth plane information [14, 55]. To analyze the characteristics of the LFM, two main factors should be considered: lateral and axial resolutions (\(R_{LFM}\) and \(D_{LFM}\)), both of which depend on the numerical aperture of the MLA (\(NA_{MLA}\)) as follows:

$$\begin{aligned} R_{LFM}= & {} \frac{0.47\lambda }{NA_{MLA}} \times \frac{g}{z}\nonumber \\ D_{LFM}= & {} \frac{1}{M^{2}}\left( \frac{\lambda }{{NA_{MLA}}^{2}} + \frac{z \times P_{S}}{g \times NA_{MLA}}\right) \end{aligned}$$
(1)

where \(\lambda \) is the wavelength of illumination, g is the gap between the sensor and MLA, z is the distance between the MLA and intermediate plane, and \(P_{S}\) is the pixel pitch of the image sensor.

Fig. 2
figure 2

Training data acquisition for LF one-shot learning

Fig. 3
figure 3

Model architecture and initialization process

The LFM resolution is fundamentally determined by the spatial density of the element lenses in the lens array [33]. One pixel is extracted from each elemental image when the EIA obtained through LFM is converted into an LF image through OVI conversion. Therefore, the number of element lenses directly determines the number of pixels in the reconstructed sub-image. As the number of pixels is very small, the resulting quality and resolution are unsatisfactory. Thus, a method for enhancing the resolution of reconstructed images without degrading their quality is required for comfortable viewing. LFSR methods based on deep learning can solve these problems, albeit with performance limitations even when using a deep learning model trained with an external dataset. These methods require many training datasets and cannot avoid performance degradation owing to domain gap problems or depending on the target resolution size. Unlike other complex LF systems [27, 36], our LF optical system has a simple configuration; thus, it can acquire both LF and 2D images depending on whether an MLA exists. In the absence of an MLA, 2D images can be acquired, as shown in the path of optical axis 2 in Fig. 1. Therefore, we developed an LF one-shot learning method to address the aforementioned performance limitations by utilizing the characteristics of the LFM system. Unlike conventional methods, the proposed method is trained only with two images captured from a sample, so no external datasets are required.

3 Proposed LF one-shot learning method

One-shot learning for SR is a technique for training a deep learning model using one pair of datasets [4, 54]. Herein, we adopted this technique to enhance the quality and resolution of LFM images. Hence, we propose an LF one-shot learning algorithm, where an LF image and a 2D image of each sample are used as training data for the LF one-shot learning. It is difficult to avoid overfitting problems in a model trained via one-shot learning, which can cause problems such as performance degradation and content loss when there is a difference between the input and training data. These issues pose considerable difficulties when applying one-shot learning to LF images, where each sub-image captures distinct viewpoint information. However, the proposed method can resolve these training issues and is suitable for LF images. The core idea of the proposed algorithm is to use only a pair of central viewpoint image datasets to train the model to perform only 2D image–like style transformations while retaining the viewpoint information of other sub-images. This can solve the overfitting problems and improve the image quality.

The proposed method involves three major steps. First, high-resolution 2D and LF images were acquired in pairs to set up the training dataset. This training dataset was used to train the proposed model via one-shot learning. Second, the proposed learning techniques and loss functions were used for LF one-shot learning. Thus, the model was trained such that it can be applied to other sub-images without overfitting the central LF image. Third, all the sub-images of the LF image were input into the trained model; detailed restoration and noise removal were performed simultaneously, as in the 2D image; and an output image with an improved resolution was obtained.

3.1 Training data acquisition

A training data pair for one-shot learning was configured using the optical characteristics of the proposed LFM system. First, the EIA was obtained by placing the MLA in the frame and capturing an image, and another image was captured without the MLA to obtain a high-resolution 2D image. The EIA was then converted into a multi-view LF image. The high-resolution 2D image and central LF image (the central viewpoint image among the sub-images) were the same as the front-view image; therefore, it was assumed that the fields of view of these two images matched, and they were used as a training data pair. In the proposed method, the input image was resized to the target size in advance, as shown in Fig. 2. Each sub-image was resized to the target resolution through bicubic interpolation, and the central LF image was resized further to utilize image crop–based data augmentation [40]. Similarly, the high-resolution 2D image was resized to be used as the GT for the training dataset. The change in the image size in this process could be adjusted according to the target resolution.

Fig. 4
figure 4

Overlapping sub-images. a Original LF image, b result of the overfitting model, and c result of the proposed model

Fig. 5
figure 5

Proposed LF one-shot learning framework for LF image enhancement

Fig. 6
figure 6

Intermediate result obtained using a L1 loss, b L2 loss, c smooth L1 loss, and d MS-SSIM loss as style loss

3.2 Model architecture and initialization

In the proposed method, fully convolutional architectures based on a deep image prior [41] were used in the CNN model. As shown in Fig. 3, the proposed model comprised an encoder–decoder (hourglass) architecture with skip connections. The encoder model \(\phi (\cdot )\) comprised six blocks, with a convolution layer, batch normalization layer, and leaky ReLU function as one block. The decoder model \(\varphi (\cdot )\) also comprised six blocks, with a pixel shuffle layer, concat function, convolution layer, batch normalization layer, and leaky ReLU function as one block. The entire model was trained to extract latent data [26] in which the unique features of the input image and viewpoint information were preserved through the encoder model \(\phi (\cdot )\). Further, the latent data were restructured as a 2D high-resolution image through the decoder model \(\varphi (\cdot )\). This is generally considered as a suitable method for converting only the style of an input image while maintaining its unique information, as in the style transfer model [11, 17].

Fig. 7
figure 7

a Process of cut loss, b result of applying cut loss

To implement the LF one-shot learning herein, a proposed learning method was used in the model initialization process. As shown in Fig. 3, one image was randomly selected from all the sub-images as the input image for model initialization. Then, as in the autoencoder model training, the model was trained using this input image as the GT. This training process typically minimized the reconstruction error between the original input and the reconstructed output, allowing the model to learn efficient and meaningful representations of the input image. For weight initialization, learning was performed at one-tenth of the total training iteration and the model was trained on the features of various sub-images. This initialization process induced learning to extract features while preserving the information of various viewpoints naturally before training the model with the central LF image and high-resolution 2D image.

3.3 Loss function design for optimization

Figure 4(a) shows the results of two overlapping diagonal sub-images from the original LF image used as the input of the proposed model to show the parallax. In the original LF image, there is a diagonal parallax, as indicated by blue arrows. Figure 4(b) shows the difference between the two images returned to the model trained by the general method with only the central LF and 2D high-resolution images. The unique viewpoint information of the input sub-images was lost because of the overfitting induced by the central viewpoint image in the learning process. To avoid the overfitting problem in one-shot learning and retain the viewpoint information of the input sub-images while maintaining the style transformation performance, the proposed method set a relatively large crop size for the image data augmentation and utilized the enhancement loss. As shown in Fig. 5, data augmentation was applied to prevent overfitting during the training process. Both the input and GT images were randomly cropped to the target size from the resized image to perform data augmentation and used as training data for parts of the image. Additionally, the model was trained on the entire image using the original training image.

Fig. 8
figure 8

Display processing of converted high-quality LF image

Latent loss. Relatively large crop images were used for model training. Therefore, the latent loss \(L_{la}\), an image feature map–based loss function, was also used. In the proposed \(L_{la}\), the difference between the specific feature map extracted by the CNN model (an intermediate result and not the final feedforward result of the model) was reflected in the model training. \(L_{la}\) in Eq. (3) represents the mean square error between the latent data \(Z_{c}^{lr}\) extracted from the input image \(LF_{c}^{lr}\) and the latent data \(Z_{c}^{hr}\) extracted from the output image \(LF_{c}^{hr}\) using the encoder model \(\phi (\cdot )\). C, H, and W refer to the channel, height, and width of the latent data, respectively.

$$\begin{aligned} Z_{c}^{lr} = \phi (LF_{c}^{lr}); \quad LF_{c}^{hr} = \varphi (Z_{c}^{lr}); \quad Z_{c}^{hr} = \phi (LF_{c}^{hr}) \end{aligned}$$
(2)
$$\begin{aligned} L_{la} = \frac{1}{CHW}\left\Vert Z_{c}^{lr} - Z_{c}^{hr}\right\Vert _{2}^{2} \end{aligned}$$
(3)

\(Z_{c}^{lr}\) contains unique characteristics of the input image, including the viewpoint information. Therefore, as shown in Fig. 4(b), \(Z_{c}^{hr}\) loses viewpoint information owing to overfitting and differs considerably from the \(Z_{c}^{lr}\) of the input image, which retains the viewpoint information. \(L_{la}\) induces model training to minimize the feature-map error between the input and output images. Therefore, it performs a complementary role such that the unique viewpoint information of the input image can be maintained in the output image. Figure 4(c) shows the results of the proposed method. It can be observed that the output image is converted into a high-resolution 2D image and that the difference information is retained.

Fig. 9
figure 9

A prototype of the proposed LFM

Style loss. Unlike the feature map–based \(L_{la}\), the style loss \(L_{st}\) is an error function between the final output and GT. Therefore, \(L_{st}\) induces model learning to convert an input image and produce an output, such as a high-resolution 2D image, i.e., the GT. General loss functions, such as the L1/L2 loss, are per-pixel loss functions for representing the error between the corresponding pixels of an output and the GT. As these per-pixel loss functions estimate errors in pixels, they are not suitable for the proposed system that uses learning data that structurally match the input image and GT while having some discrepancies at the pixel unit owing to the MLA. Therefore, the proposed system requires a loss function for estimating the error in local region units. Accordingly, herein, the multiscale structural similarity index measure (MS-SSIM) loss based on the image structure was used as a style loss. This is shown as follows [59]:

$$\begin{aligned} \begin{aligned} SSIM(x,y)&= \frac{2\mu _{x}\mu _{y}+C_{1}}{\mu _{x}^{2}+\mu _{y}^{2}+C_{1}} \cdot \frac{2\sigma _{xy}+C_{2}}{\sigma _{x}^{2}+\sigma _{y}^{2}+C_{2}}\\&= l(x,y) \cdot cs(x,y) \end{aligned} \end{aligned}$$
(4)
$$\begin{aligned} MS\text {-}SSIM(x,y) = l_{M}^{\alpha }(x,y) \cdot \prod _{j=1}^{M}cs_{j}^{\beta _{j}}(x,y) \end{aligned}$$
(5)
$$\begin{aligned} L_{st} = 1 - MS\text {-}SSIM(LF_{c}^{hr},I^{hr}) \end{aligned}$$
(6)

where \(\mu _{x}\) and \(\mu _{y}\) are the averages of each x and y, respectively, \(\mu _{x}^{2}\) and \(\mu _{y}^{2}\) are the variances of each x and y, respectively, \(\sigma _{xy}\) is the covariance between x and y, and \(C_{1}\) and \(C_{2}\) are variables that stabilize the division. \(L_{st}\) trains the model through the structural differences between the images rather than errors between the corresponding pixels. Figure 6 shows the intermediate results output through the model trained while replacing the loss function. As shown in Fig. 6, when the MS-SSIM loss is used in the proposed method, the learning convergence is faster and the visual style conversion performance is better than that of other per-pixel loss functions. Moreover, the proposed loss function is effective in terms of the visual expression and restoration of fine details.

Table 1 Devices specifications of the proposed LFM

Enhancement loss. The enhancement loss \(L_{en}\) used for model training is a combination of \(L_{st}\) and \(L_{la}\) with a specific weight (\(\alpha _{st}\), \(\alpha _{la}\)), as shown in Eq. (7). Through \(L_{en}\), the model can be trained to transform an input image into an image similar to the reference image while preserving unique information in the input image.

$$\begin{aligned} L_{en} = \alpha _{st}L_{st} + \alpha _{la}L_{la} \end{aligned}$$
(7)

Cut loss. When the model is trained with only \(L_{en}\), saturation may occur in the image area during the learning process, as shown in the left image of Fig. 7(b). If the saturation area is small compared with the image used for training, the effect on \(L_{st}\) is insignificant; therefore, there is a limit to the convergence, regardless of the number of training iterations. To solve this problem, a cut loss \(L_{cut}\) was proposed herein. As shown in Fig. 7(a), the images used for training were divided into 8 \(\times \) 8 array patches, the L1 loss between the corresponding patches in a total of 64 regions was obtained, and the value with the highest error became \(L_{cut}\). Further, the L1 loss value between the output and the n-th patch of GT is \(CL_{pn}\), and pn is the patch number. \(CL_{pn}\) is obtained from all patches (pn = 1–64), as shown in Eq. (8), and its highest value affects the learning process. \(L_{cut}\) is the same as in Eq. (9).

$$\begin{aligned} CL_{pn} = \frac{1}{CHW}\left\Vert P_{pn}^{out} - P_{pn}^{gt}\right\Vert _{1}, (pn = 1\text {--}64) \end{aligned}$$
(8)
$$\begin{aligned} L_{cut} = Max(CL_{pn}), (pn = 1\text {--}64) \end{aligned}$$
(9)

In general, if \(L_{cut}\) is utilized, even a small saturation area can have a considerable effect on model training. In this context, it is possible to remove saturation in a certain area during the learning process. In case of separate training with only the patch-based \(L_{cut}\), blurring may occur in the patch boundary area. Therefore, herein, \(L_{en}\) (for image restoration) and \(L_{cut}\) (for saturation removal) were alternately used for the training. Adding \(L_{cut}\) to the learning process removed the saturation, as shown in Fig. 7(b).

Fig. 10
figure 10

Specimens (a mini gear, seedpod, chip resistor, bud, grass seed 1, and grass seed 2) of the a 2D image, b EIA, c LF image, and d selected sub-images for the experiment

3.4 Training process and hyperparameters

The following hyperparameters were used for training: learning rate = 1e-3, weight of style loss \(\alpha _{st}\) = 0.6, weight of latent loss \(\alpha _{la}\) = 0.4, initialization iteration = 300, total iteration = 3000, image crop ratio = 9:1, and cut loss ratio = 1:8. The overall model training process is depicted in Algorithm 1.

figure a

Proposed LF one-shot learning process.

3.5 Model test and display

Fig. 11
figure 11

a Input sub-image; image-enhancement results of b bicubic, c SRMD, d bm_pca_rr_LFSR, e ZZSR, f one-shot ZSSR, and g proposed method

When model training was completed with the central LF image and high-resolution 2D image, all the pre-upsampled sub-images of the LF image were input into the trained model to be transformed into high-quality images, as shown in Fig. 8. The converted images were rearranged for each viewpoint, similar to the original LF image, and the process proposed for enhancing the low-quality LF image using one-shot learning was completed. Through this process, resolution enhancement, noise removal, and specimen detail restoration based on a reference image were simultaneously performed, resulting in an LF image quality similar to that of a high-resolution 2D image used as the reference image. The original LF image, pre-upsampled sub-images for each viewpoint, and converted sub-images were output in the display program.

4 Experiment results and analysis

The proposed LFM comprises an infinity-corrected optical system, an MLA, a digital single-lens reflex camera, and a PC for processing and displaying the acquired microscopic images, as shown in Fig. 9. The detailed device specifications of the proposed LFM are listed in Table 1.

Six different objects were utilized as experimental specimens: a mini gear, seedpod, chip resistor, bud, grass seed 1, and grass seed 2 (Fig. 10). For each specimen, EIA and 2D images (Figs. 10 (a) and (b)) were acquired with a resolution of 4000 \(\times \) 4000 pixels. The LF image (Fig. 10(c)) was generated by OVI reconstruction for the EIA and comprised 53 \(\times \) 53 directional views. The size of each sub-image was 76 \(\times \) 76 pixels. In this experiment, the region of interest was set as shown in Fig. 10(c) and 9 \(\times \) 9 (i.e., 81) sub-images were selected at regular intervals based on the central LF image. The selected images were reconstructed as shown in Fig. 10(d) and used for the experiment. One of the reasons for using sampled images (as against all sub-images) in the experiments was to filter images unsuitable for display owing to vignetting. Furthermore, the core of this experiment was to verify whether the proposed method is properly applied to the central LF image, which is mainly used for learning, and other viewpoint sub-images with large disparities.

Fig. 12
figure 12

Value graphs of a PNSR, b NIQE, and c DISTS

Unlike traditional learning-based methods, the proposed model is trained with only one dataset comprising two images acquired from a specific specimen. Using a 2D image (Fig. 10(a)) and an LF image (Fig. 10(d)) of each specimen, each model was trained using the proposed one-shot learning method and the default target resolution was set to 512 \(\times \) 512 pixels. Because the model was trained with only one dataset for a specific specimen without any external dataset, the trained model exhibited specialized performance for that specimen. Therefore, to perform the proposed method on another specimen, a new model should be trained with a dataset of a correspondence specimen. When all the sub-images were input into the trained model, the proposed LFSR was performed. The degraded input images were converted into high-quality images, resulting in various image improvements, such as resolution enhancement, detail restoration, and noise removal. Moreover, image quality evaluations, comparisons, and additional resolution improvement experiments were performed to verify the usefulness of the proposed method more systematically.

4.1 Comparative analysis

To prove the superiority the proposed method, we compare it with previous methods for enhancing image resolution and assessing image quality. Figure 11(b) shows the result of improving the resolution using bicubic interpolation, which is generally used for image upsampling. The figure displays the input images used in the proposed method. Figure 11(c) shows the output of the SR network for multiple degradations (SRMD) [57], which is an external image SR model trained with a large external dataset. Unlike other CNN-based SR models, the SRMD model is an effective model that improves resolution and considers degradations such as noise in the input low-resolution images. SRMD is a suitable model for the proposed LFM system, in which a low-quality EIA, including distortions and lens contamination, is obtained owing to the poor illumination environment and use of the MLA. Therefore, it was used in our previous studies [21]. Figure 11(d) shows the result of LFSR using block matching, principal component analysis, and ridge regression (bm_pca_rr_LFSR) [9]. Similar to typical LFSR methods, bm_pca_rr_LFSR enhances LF images by utilizing and synthesizing surrounding sub-image information. Figure 11(e) shows the results from ZSSR [39] as a representative of an internal image SR model that utilizes only input images rather than an external dataset for model training. Methods that do not use reference images, such as Figs. 11(b)–(e), cannot improve image quality if the input image is extremely deteriorated and has very low resolution, similar to the LF image of the proposed system. Figure 11(f) shows the results from the initial model of the proposed method when replacing the CNN model of the ZSSR with a deeper model (ResNet-12). In addition, the training method is replaced with one-shot learning, which learns high-resolution 2D images with the GT, similar to the proposed method. The results from the initial model are better than those of the existing methods that do not utilize reference images. However, compared with that of the reference image, the visual quality remains insufficient. Meanwhile, as shown in Fig. 11(g), the results from the proposed method are similar to those of the high-resolution 2D image used as the reference image. It can be observed that resolution enhancement and image conversion are performed properly.

Table 2 PSNR, Mean NIQE, and Mean DISTS
Fig. 13
figure 13

Image-quality evaluation results of outputs obtained using a L1 loss, b L2 loss, c smooth L1 loss, and d MS-SSIM loss as style loss

Figure 12 and Table 2 present the results of the image-quality assessment for each SR method. The reference-based peak signal-to-noise ratio (PSNR), deep image structure and texture similarity (DISTS), and nonreference-based natural image quality evaluator (NIQE) were used [8, 13, 28]. The PSNR compares a reference image with an input image, measuring and quantifying the noise or distortion, and it is calculated on the basis of the mean squared error between the reference image and input image. The PSNR is an indicator measured by setting a high-resolution 2D image as a reference and comparing it with the output of the central LF image; the larger the value of PSNR, the better the perceptual quality. The results from the one-shot learning methods (Figs. 11 (f) and (g)) using high-resolution 2D images as a GT have considerably higher PSNR values than the results of the other methods (Figs. 11 (b)–(e)). In particular, the proposed method has the highest evaluation value for all samples; thus, the resultant image has been restored to be as similar as possible to the 2D image. The NIQE is a quality evaluation indicator that requires a reference image using statistical standardization observed in natural images (high quality); the smaller the value of NIQE, the better the perceptual quality. This metric utilizes a model of natural scene statistics to capture the characteristics of undistorted images. It computes a set of features that describe the spatial and spectral characteristics of an input image and then compares these features with those extracted from a database of high-quality natural images. The NIQE is computed as a scalar value that quantifies the quality of the input image according to this comparison. The NIQE is used to perform the quality evaluation of the central LF image and other viewpoint sub-images. Compared to the PSNR, which uses only the central LF image, the NIQE is obtained for all 81 sub-images. The average values are shown in Fig. 12(b) and Table 2. It can be observed that the NIQE values of the proposed method are lower than those of the other methods for all samples. The DISTS is an image-quality assessment method that leverages deep learning to evaluate the perceptual similarity between a reference image and an input image. The DISTS employs a pre-trained CNN to extract deep features from both the reference and input images. The similarity between the extracted features was computed to obtain this metric score. This method was used in this study because it can measure the similarity and quality of image textures. Using a high-resolution 2D image as a reference image, the DISTS was performed on all sub-images. This score shows whether the texture quality of the converted sub-images is similar to that of the 2D image, with lower scores indicating better perceptual quality. As shown in Fig. 12(c) and Table 2, the proposed method has the lowest mean values for all specimens; thus, the results of the proposed method have high-quality textures most similar to 2D images.

4.2 Ablation experiment

Fig. 14
figure 14

Output of the proposed model a without cut loss and b with cut loss

Ablation experiments were conducted to verify the effectiveness and performance of the trained model using the proposed loss functions. The style loss is an important loss function that allows the trained model to transform the input images similar to the 2D image used as the GT and affects the perceptual quality of the output images. Figure 13 shows the results of the style-loss replacement experiment: the output images of some sub-images of the model trained using L1 loss (Fig. 13(a)), L2 loss (Fig. 13(b)), smooth L1 loss (Fig. 13(c)), and MS-SSIM loss (Fig. 13(d)) as the style loss. The style loss of the proposed method (Fig. 13(d)) showed faster learning convergence compared with other per-pixel loss functions (Figs. 13(a)–(c)). The proposed method completed training in only 3000 iterations; however, other loss functions required at least 5000 iterations. Furthermore, it was insufficient compared with the proposed method in terms of visual expression and restoration of fine details. Moreover, even when using image quantitative metrics for comparison, the proposed method shows the highest overall performance, as shown in Fig. 13. This result proves that training the model through structural differences between images rather than errors between corresponding pixels is more suitable for LF one-shot learning.

Fig. 15
figure 15

Overlapping sub-images of a a seedpod and b a bud

Fig. 16
figure 16

Results of the additional resolution enhancement experiment

Cut loss suppresses the saturation that may occur during model training. In particular, the cut loss based on local patches can effectively eliminate even saturation occurring in a small area of the output image. Figure 14(a) shows the output images of the model trained without cut loss, and Fig. 14(b) shows the output images of the proposed model with cut loss. Without cut loss, the model could not effectively remove the saturation occurring in a certain area, even when the total iteration for training the model was increased. In contrast, the model using cut loss could effectively avoid image saturation because the cut loss can select patches with large errors owing to saturation to train the model. This proved to be a useful loss function for the proposed method, which has a relatively large target resolution to be improved.

Fig. 17
figure 17

Output of the proposed model a without initialization and b with initialization

In LF one-shot learning, the model is trained to improve the quality and resolution of all sub-images; however, only one dataset about a sub-image of the center viewpoint is used for the main learning process. Therefore, the latent loss is used to suppress overfitting by one-shot learning and preserve the unique content information, such as the viewpoint of the input sub-images. The overfitting that occurred when training the model without latent loss caused the distortion of the output image. Figure 15 shows parallax information by overlapping sub-images with vertical and horizontal parallax. Magenta and green regions in the composite image show where the intensities are different between the two images and gray regions show where the two images have the same intensities. The output sub-images of the proposed method show the shifting information caused by parallax, such as the input sub-images. Meanwhile, the model trained without latent loss has lost the unique content information of all the input sub-images and converted them to the viewpoint of the image used for one-shot learning. Therefore, unlike the input images, overlapping images do not show parallax changes properly. Furthermore, in some cases, the input sub-image information and the image information used for training were synthesized and output. This experimental result indicates that latent loss is an essential loss function that can preserve the unique information of the input sub-images.

4.3 Additional experiment

To verify that the proposed algorithm performs properly even when the target image resolution is higher than 512 \(\times \) 512, an additional experiment was performed by setting the target resolution to 1024 \(\times \) 1024. This resolution is the result of about a 13-fold improvement in resolution compared with the original 76 \(\times \) 76 one. The framework of the entire algorithm remained the same, although certain parameters were doubled. Figure 16 shows the results from the additional experiments. The top row shows the LF image used in the experiment, and the second row is the original sub-image for one viewpoint in the upper left of the LF image with a resolution of 76 \(\times \) 76. The image on the left of the third row of Fig. 16 has been resized to 512 \(\times \) 512 by bicubic interpolation of this sub-image, and the image is used as an input to the model. The image on the right in the third row of Fig. 16 shows the output result of the model, indicating that the input image is converted to one with quality similar to that of the high-resolution 2D image. The last row shows the results for the target resolution in the 1024 \(\times \) 1024 model, where the quality of the converted image is similar to that of the high-resolution 2D image at the 1024 \(\times \) 1024 output. Visually, the results of 1024 \(\times \) 1024 model are not considerably degraded compared with the results of the previous 512 \(\times \) 512 model. Furthermore, the quantitative values (such as the NIQE) are similar, indicating that image-enhancement performance remains constant regardless of the output resolution. This feature is an advantage of the proposed method compared with the existing CNN-based SR models. This is because the results of the previous methods show an inverse relationship between the size of the target resolution and the image quality. This additional experiment confirms that using a deeper model and GPU with a larger memory can improve the resolution of the LF image with a size similar to that of the 2D image.

An additional experiment verified the utility of the parameter initialization processes for all sub-images—as in the autoencoder model training—before conducting model learning with the central LF and high-resolution 2D images. Notably, the model can be trained immediately through one-shot learning without the initialization process. However, the trained model with the initialization process tends to perform a more natural transformation for the sub-images of various views. In particular, the larger the parallax from the central LF image, the greater the difference between the training with and without the initialization process, as shown in Fig. 17. In Fig. 17, the upper left sub-image is used as an input, as shown in Fig. 16. Figure 17(a) shows the results of the model trained without the initialization process, and Fig. 17(b) shows the results from the proposed method, where both models were trained with the same iterations. Comparing Fig. 17(a) and (b), we can observe that the converted image in Fig. 17(b), which underwent the initialization process, has a more visually natural quality. These differences become more noticeable as the disparity and target resolution increase.

5 Conclusion

The primary disadvantage of an MLA-based LFM imaging system is yielding low-resolution and low-quality LF images. This problem can be addressed via learning using high-resolution 2D images of the same scene. Herein, an LF one-shot learning algorithm was developed to improve LF image resolution and quality. The proposed algorithm used both high-resolution 2D images and LF images captured with an LFM system. Various learning techniques were applied to address the problems arising when using only one LF sub-image as training data. Various LF sub-images could be converted into high-quality images, such as high-resolution 2D images, while retaining their content. The results indicated that the developed system offered superior performance compared with existing SR methods, which are considerably affected by the domain gap between the input data and training data. In addition, the image quality was maintained when the LF sub-image was enhanced by up to 13 times without performance deterioration depending on the target resolution size. Finally, as with the LFM of this study, if both 2D and LF images can be obtained, it is confirmed that a superior LF imaging system can be developed to overcome the limitations of LF images. However, unlike other learning-based methods that utilize pre-trained models, the proposed model is suitable for post-processing and has limitations in real-time processing because a new model should be trained for each sample. In future research, we will develop a reference-based LFSR with few-shot learning or meta-learning techniques to utilize a pre-trained model for real-time and versatility performance. Moreover, a stereomicroscope system will be developed for capturing LF and 2D images simultaneously using a beam splitter and imaging systems that use combined 2D and LF images will be examined.