Unet论文原文

编程入门行业动态更新时间:2024-10-26 11:22:52

U-Net: Convolutional Networks for Biomedical Image Segmentation

Abstract. There is large consent that successful training of deep net- works requires many thousand annotated training samples. In this pa- per, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localiza- tion. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neu- ronal structures in electron microscopic stacks. Using the same net- work trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these cate- gories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net.

1 Introduction

In the last two years, deep convolutional networks have outperformed the state of the art in many visual recognition tasks, e.g. [7,3]. While convolutional networks have already existed for a long time [8], their success was limited due to the size of the available training sets and the size of the considered networks. The breakthrough by Krizhevsky et al. [7] was due to supervised training of a large network with 8 layers and millions of parameters on the ImageNet dataset with 1 million training images. Since then, even larger and deeper networks have been trained [12].

The typical use of convolutional networks is on classification tasks, where the output to an image is a single class label. However, in many visual tasks, especially in biomedical image processing, the desired output should include localization, i.e., a class label is supposed to be assigned to each pixel. More- over, thousands of training images are usually beyond reach in biomedical tasks. Hence, Ciresan et al. [1] trained a network in a sliding-window setup to predict the class label of each pixel by providing a local region (patch) around that pixel as input. First, this network can localize. Secondly, the training data in terms of patches is much larger than the number of training images. The resulting network won the EM segmentation challenge at ISBI 2012 by a large margin.

Obviously, the strategy in Ciresan et al. [1] has two drawbacks. First, it is quite slow because the network must be run separately for each patch, and there is a lot of redundancy due to overlapping patches. Secondly, there is a trade-off between localization accuracy and the use of context. Larger patches require more max-pooling layers that reduce the localization accuracy, while small patches allow the network to see only little context. More recent approaches [11,4] proposed a classifier output that takes into account the features from multiple layers. Good localization and the use of context are possible at the same time.

In this paper, we build upon a more elegant architecture, the so-called “fully convolutional network” [9]. We modify and extend this architecture such that it works with very few training images and yields more precise segmentations; see Figure 1. The main idea in [9] is to supplement a usual contracting network by successive layers, where pooling operators are replaced by upsampling operators. Hence, these layers increase the resolution of the output. In order to localize, high resolution features from the contracting path are combined with the upsampled output. A successive convolution layer can then learn to assemble a more precise output based on this information.

One important modification in our architecture is that in the upsampling part we have also a large number of feature channels, which allow the network to propagate context information to higher resolution layers. As a consequence, the expansive path is more or less symmetric to the contracting path, and yields a u-shaped architecture. The network does not have any fully connected layers and only uses the valid part of each convolution, i.e., the segmentation map only contains the pixels, for which the full context is available in the input image. This strategy allows the seamless segmentation of arbitrarily large images by an overlap-tile strategy (see Figure 2). To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images, since otherwise the resolution would be limited by the GPU memory.

As for our tasks there is very little training data available, we use excessive data augmentation by applying elastic deformations to the available training im- ages. This allows the network to learn invariance to such deformations, without the need to see these transformations in the annotated image corpus. This is particularly important in biomedical segmentation, since deformation used to be the most common variation in tissue and realistic deformations can be simu- lated efficiently. The value of data augmentation for learning invariance has been shown in Dosovitskiy et al. [2] in the scope of unsupervised feature learning.

Another challenge in many cell segmentation tasks is the separation of touch- ing objects of the same class; see Figure 3. To this end, we propose the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function.

The resulting network is applicable to various biomedical segmentation prob- lems. In this paper, we show results on the segmentation of neuronal structures in EM stacks (an ongoing competition started at ISBI 2012), where we outperformed the network of Ciresan et al. [1]. Furthermore, we show results for cell segmentation in light microscopy images from the ISBI cell tracking chal- lenge 2015. Here we won with a large margin on the two most challenging 2D transmitted light datasets.

2 Network Architecture

The network architecture is illustrated in Figure 1. It consists of a contracting path (left side) and an expansive path (right side). The contracting path follows the typical architecture of a convolutional network. It consists of the repeated application of two 3x3 convolutions (unpadded convolutions), each followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation with stride 2 for downsampling. At each downsampling step we double the number of feature channels. Every step in the expansive path consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that halves the number of feature channels, a concatenation with the correspondingly cropped feature map from the contracting path, and two 3x3 convolutions, each fol- lowed by a ReLU. The cropping is necessary due to the loss of border pixels in every convolution. At the final layer a 1x1 convolution is used to map each 64- component feature vector to the desired number of classes. In total the network has 23 convolutional layers.

To allow a seamless tiling of the output segmentation map (see Figure 2), it is important to select the input tile size such that all 2x2 max-pooling operations are applied to a layer with an even x- and y-size.

3 Training

The input images and their corresponding segmentation maps are used to train the network with the stochastic gradient descent implementation of Caffe [6]. Due to the unpadded convolutions, the output image is smaller than the input by a constant border width. To minimize the overhead and make maximum use of the GPU memory, we favor large input tiles over a large batch size and hence reduce the batch to a single image. Accordingly we use a high momentum (0.99) such that a large number of the previously seen training samples determine the update in the current optimization step.

The energy function is computed by a pixel-wise soft-max over the final feature map combined with the cross entropy loss function. The soft-max is defined as pk(x) = exp(ak(x))/ k′=1 exp(ak′(x))

activation in feature channel k at the pixel position x ∈ Ω with Ω ⊂ Z . K is the number of classes and pk(x) is the approximated maximum-function. I.e. pk(x) ≈ 1 for the k that has the maximum activation ak(x) and pk(x) ≈ 0 for all other k. The cross entropy then penalizes at each position the deviation of pl(x)(x) from 1 using

E = 􏰁 w(x) log(pl(x)(x)) (1) x∈Ω

where ak(x) denotes the

where l : Ω → {1,...,K} is the true label of each pixel and w : Ω → R is a weight map that we introduced to give some pixels more importance in the training.

We pre-compute the weight map for each ground truth segmentation to com- pensate the different frequency of pixels from a certain class in the training data set, and to force the network to learn the small separation borders that we introduce between touching cells (See Figure 3c and d).

The separation border is computed using morphological operations. The weight map is then computed as

where wc : Ω → R is the weight map to balance the class frequencies, d1 : Ω → R denotes the distance to the border of the nearest cell and d2 : Ω → R the distance to the border of the second nearest cell. In our experiments we set w0 = 10 and σ ≈ 5 pixels.

In deep networks with many convolutional layers and different paths through the network, a good initialization of the weights is extremely important. Oth- erwise, parts of the network might give excessive activations, while other parts never contribute. Ideally the initial weights should be adapted such that each feature map in the network has approximately unit variance. For a network with our architecture (alternating convolution and ReLU layers) this can be achieved by drawing the initial weights from a Gaussian distribution with a standard deviation of 􏰐2/N, where N denotes the number of incoming nodes of one neu- ron [5]. E.g. for a 3x3 convolution and 64 feature channels in the previous layer N =9·64=576.

3.1 Data Augmentation

Data augmentation is essential to teach the network the desired invariance and robustness properties, when only few training samples are available. In case of

microscopical images we primarily need shift and rotation invariance as well as robustness to deformations and gray value variations. Especially random elas- tic deformations of the training samples seem to be the key concept to train a segmentation network with very few annotated images. We generate smooth deformations using random displacement vectors on a coarse 3 by 3 grid. The displacements are sampled from a Gaussian distribution with 10 pixels standard deviation. Per-pixel displacements are then computed using bicubic interpola- tion. Drop-out layers at the end of the contracting path perform further implicit data augmentation.

4 Experiments

We demonstrate the application of the u-net to three different segmentation tasks. The first task is the segmentation of neuronal structures in electron mi- croscopic recordings. An example of the data set and our obtained segmentation is displayed in Figure 2. We provide the full result as Supplementary Material. The data set is provided by the EM segmentation challenge [14] that was started at ISBI 2012 and is still open for new contributions. The training data is a set of 30 images (512x512 pixels) from serial section transmission electron microscopy of the Drosophila first instar larva ventral nerve cord (VNC). Each image comes with a corresponding fully annotated ground truth segmentation map for cells (white) and membranes (black). The test set is publicly available, but its seg- mentation maps are kept secret. An evaluation can be obtained by sending the predicted membrane probability map to the organizers. The evaluation is done by thresholding the map at 10 different levels and computation of the “warping error”, the “Rand error” and the “pixel error” [14].

The u-net (averaged over 7 rotated versions of the input data) achieves with- out any further pre- or postprocessing a warping error of 0.0003529 (the new best score, see Table 1) and a rand-error of 0.0382.

This is significantly better than the sliding-window convolutional network result by Ciresan et al. [1], whose best submission had a warping error of 0.000420 and a rand error of 0.0504. In terms of rand error the only better performing algorithms on this data set use highly data set specific post-processing methods1 applied to the probability map of Ciresan et al. [1].

We also applied the u-net to a cell segmentation task in light microscopic im- ages. This segmenation task is part of the ISBI cell tracking challenge 2014 and 2015 [10,13]. The first data set “PhC-U373”2 contains Glioblastoma-astrocytoma U373 cells on a polyacrylimide substrate recorded by phase contrast microscopy (see Figure 4a,b and Supp. Material). It contains 35 partially annotated train- ing images. Here we achieve an average IOU (“intersection over union”) of 92%, which is significantly better than the second best algorithm with 83% (see Ta- ble 2). The second data set “DIC-HeLa”3 are HeLa cells on a flat glass recorded by differential interference contrast (DIC) microscopy (see Figure 3, Figure 4c,d and Supp. Material). It contains 20 partially annotated training images. Here we achieve an average IOU of 77.5% which is significantly better than the second best algorithm with 46%.

5 Conclusion

The u-net architecture achieves very good performance on very different biomed- ical segmentation applications. Thanks to data augmentation with elastic deformations, it only needs very few annotated images and has a very reasonable training time of only 10 hours on a NVidia Titan GPU (6 GB). We provide the full Caffe[6]-based implementation and the trained networks . We are sure that the u-net architecture can be applied easily to many more tasks

U-Net：用于生物医学图像分割的卷积网络,抽象的。人们普遍认为，成功地训练深层网络需要数千个带注释的训练样本。在本文中，我们提出了一种网络和训练策略，该策略依赖于强大的数据扩充来更有效地使用可用的注释样本。该体系结构由一条收缩路径和一条对称扩展路径组成，前者用于捕获上下文，后者用于精确定位。我们证明了这样的网络可以从很少的图像进行端到端的训练，并且在ISBI挑战中优于先前的最佳方法（滑动窗口卷积网络）。使用同样的网络训练透射光显微镜图像（相位对比度和DIC），我们赢得了ISBI细胞跟踪挑战赛2015年在这些类别的一大优势。而且，网络速度很快。在最新的GPU上，512x512图像的分割不到一秒钟。全面实施（基于Caffe）和经过培训的网络可在http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net。,1简介,在过去的两年中，深度卷积网络在许多视觉识别任务中的表现都超过了最新技术，例如[7,3]。虽然卷积网络已经存在了很长时间[8]，但由于可用训练集的大小和考虑的网络的大小，它们的成功是有限的。Krizhevsky等人[7]的突破是由于在ImageNet数据集上监督训练了一个具有8层和数百万个参数的大型网络，该数据集具有100万个训练图像。从那时起，更大、更深的人际网络得到了训练[12]。,卷积网络的典型用途是分类任务，其中图像的输出是单个类别标签。然而，在许多视觉任务中，特别是在生物医学图像处理中，期望的输出应该包括定位，即一个类标签应该被分配给每个像素。此外，在生物医学任务中，成千上万的训练图像通常是遥不可及的。因此，Ciresan等人[1]在滑动窗口设置中训练网络，通过提供像素周围的局部区域（补丁）作为输入来预测每个像素的类标签。首先，这个网络可以本地化。其次，训练数据中的斑块数目远远大于训练图像的数目。由此产生的网络赢得了2012年ISBI的新兴市场细分挑战赛。,显然，Ciresan等人[1]的策略有两个缺点。首先，它相当慢，因为网络必须为每个补丁单独运行，并且由于重叠的补丁而有很多冗余。第二，在本地化的准确性和语境的使用之间有一个权衡。较大的补丁需要更多的最大池层，这会降低定位精度，而较小的补丁只允许网络看到很少的上下文。最近的方法[11,4]提出了一种分类器输出，它考虑了来自多个层的特征。良好的本地化和上下文的使用是可能的，在同一时间。,在本文中，我们构建了一个更优雅的体系结构，即所谓的“完全卷积网络”[9]。我们修改并扩展了这个体系结构，使得它能够处理很少的训练图像，并产生更精确的分割；见图1。[9]中的主要思想是通过连续的层来补充通常的承包网络，其中池运算符由上采样运算符代替。因此，这些层提高了输出的分辨率。为了定位，收缩路径的高分辨率特征与上采样输出相结合。随后，连续的卷积层可以学习基于该信息组装更精确的输出。,我们架构中的一个重要修改是，在上采样部分，我们还有大量的特征通道，这些通道允许网络将上下文信息传播到更高分辨率的层。因此，扩张路径或多或少与收缩路径对称，并产生u形结构。该网络不具有任何完全连接的层并且仅使用每个卷积的有效部分，即，分割图仅包含像素，对于该像素，在输入图像中可以获得完整的上下文。该策略允许通过重叠平铺策略对任意大的图像进行无缝分割（见图2）。为了预测图像边界区域中的像素，通过镜像输入图像来外推缺失的上下文。这种平铺策略对于将网络应用于大型图像非常重要，否则分辨率将受到GPU内存的限制。,对于训练数据非常少的任务，我们通过对训练图像进行弹性变形来增加数据量。这使得网络能够学习

更多推荐

Unet论文原文

本文发布于:2023-06-10 20:30:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1341931.html