admin管理员组

文章数量:1623567

CONTENTS

  • Convolutional networks ( LeCun , 1989 ), also known as convolutional neural networks or CNNs, are a specialized kind of neural network for processing data that has a known, grid-like topology. Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be thought of as a 2D grid of pixels.
  • Convolutional networks have been tremendously successful in practical applications. The name “convolutional neural network” indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
  • In this chapter, we will first describe what convolution is. Next, we will explain the motivation behind using convolution in a neural network. We will then describe an operation called pooling, which almost all convolutional networks employ. Usually, the operation used in a convolutional neural network does not correspond precisely to the definition of convolution as used in other fields such as engineering or pure mathematics. We will describe several variants on the convolution function that are widely used in practice for neural networks. We will also show how convolution may be applied to many kinds of data, with different numbers of dimensions. We then discuss means of making convolution more efficient. Convolutional networks stand out as an example of neuroscientific principles influencing deep learning. We will discuss these neuroscientific principles, then conclude with comments about the role convolutional networks have played in the history of deep learning. One topic this chapter does not address is how to choose the architecture of your convolutional network. The goal of this chapter is to describe the kinds of tools that convolutional networks provide, while chapter 11 describes general guidelines for choosing which tools to use in which circumstances. Research into convolutional network architectures proceeds so rapidly that a new best architecture for a given benchmark is announced every few weeks to months, rendering it impractical to describe the best architecture in print. However, the best architectures have consistently been composed of the building blocks described here.

The Convolution Operation

  • In its most general form, convolution is an operation on two functions of a realvalued argument. To motivate the definition of convolution, we start with examples of two functions we might use.

  • Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output x ( t ) x(t) x(t), the position of the spaceship at time t t t. Both x x x and t t t are real-valued, i.e., we can get a different reading from the laser sensor at any instant in time.

  • Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average together several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function w ( a ) w(a) w(a), where a a a is the age of a measurement. If we apply such a weighted average operation at every moment, we obtain a new function s s s providing a smoothed estimate of the position of the spaceship:
    s ( t ) = ∫ x ( a ) w ( t − a ) d a s(t)=\int x(a) w(t-a) d a s(t)=x(a)w(ta)da
    This operation is called convolution. The convolution operation is typically denoted with an asterisk:
    s ( t ) = ( x ∗ w ) ( t ) s(t)=(x * w)(t) s(t)=(xw)(t)

  • In our example, w w w needs to be a valid probability density function, or the output is not a weighted average. Also, w w w needs to be 0 for all negative arguments, or it will look into the future, which is presumably beyond our capabilities. These limitations are particular to our example though. In general, convolution is defined for any functions for which the above integral is defined, and may be used for other purposes besides taking weighted averages.

  • In convolutional network terminology, the first argument (in this example, the function x x x ) to the convolution is often referred to as the input and the second argument (in this example, the function w w w ) as the kernel. The output is sometimes referred to as the feature map.

  • If we now assume that x x x and w w w are defined only on integer t t t, we can define the discrete convolution:
    s ( t ) = ( x ∗ w ) ( t ) = ∑ a = − ∞ ∞ x ( a ) w ( t − a ) s(t)=(x * w)(t)=\sum_{a=-\infty}^{\infty} x(a) w(t-a) s(t)=(xw)(t)=a=x(a)w(ta)

  • In machine learning applications, the input is usually a multidimensional array of data and the kernel is usually a multidimensional array of parameters that are adapted by the learning algorithm. We will refer to these multidimensional arrays as tensors. Because each element of the input and kernel must be explicitly stored separately, we usually assume that these functions are zero everywhere but the finite set of points for which we store the values. This means that in practice we can implement the infinite summation as a summation over a finite number of array elements.

  • Finally, we often use convolutions over more than one axis at a time. For example, if we use a two-dimensional image I I I as our input, we probably also want to use a two-dimensional kernel K K K :
    S ( i , j ) = ( I ∗ K ) ( i , j ) = ∑ m ∑ n I ( m , n ) K ( i − m , j − n ) S(i, j)=(I * K)(i, j)=\sum_{m} \sum_{n} I(m, n) K(i-m, j-n) S(i,j)=(IK)(i,j)=mnI(m,n)K(im,jn)
    Convolution is commutative, meaning we can equivalently write:
    S ( i , j ) = ( K ∗ I ) ( i , j ) = ∑ m ∑ n I ( i − m , j − n ) K ( m , n ) S(i, j)=(K * I)(i, j)=\sum_{m} \sum_{n} I(i-m, j-n) K(m, n) S(i,j)=(KI)(i,j)=mnI(im,jn)K(m,n)

  • Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of m m m and n n n.

  • The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as m m m increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation. Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:
    S ( i , j ) = ( I ∗ K ) ( i , j ) = ∑ m ∑ n I ( i + m , j + n ) K ( m , n ) S(i, j)=(I * K)(i, j)=\sum_{m} \sum_{n} I(i+m, j+n) K(m, n) S(i,j)=(IK)(i,j)=mnI(i+m,j+n)K(m,n)

  • Many machine learning libraries implement cross-correlation but call it convolution.
    In this text we will follow this convention of calling both operations convolution, and specify whether we mean to flip the kernel or not in contexts where kernel flipping is relevant. In the context of machine learning, the learning algorithm will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an algorithm without the flipping. It is also rare for convolution to be used alone in machine learning; instead convolution is used simultaneously with other functions, and the combination of these functions does not commute regardless of whether the convolution operation flips its kernel or not.

  • See figure 9.1 9.1 9.1 for an example of convolution (without kernel flipping) applied to a 2-D tensor.

  • Discrete convolution can be viewed as multiplication by a matrix. However, the matrix has several entries constrained to be equal to other entries. For example, for univariate(单变量) discrete convolution, each row of the matrix is constrained to be equal to the row above shifted by one element. This is known as a Toeplitz matrix. In two dimensions, a doubly block circulant matrix corresponds to convolution. In addition to these constraints that several elements be equal to each other, convolution usually corresponds to a very sparse matrix (a matrix whose entries are mostly equal to zero). This is because the kernel is usually much smaller than the input image. Any neural network algorithm that works with matrix multiplication and does not depend on specific properties of the matrix structure should work with convolution, without requiring any further changes to the neural network. Typical convolutional neural networks do make use of further specializations in order to deal with large inputs efficiently, but these are not strictly necessary from a theoretical perspective.

Motivation

  • Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions, parameter sharing and equivariant representations. Moreover, convolution provides a means for working with inputs of variable size. We now describe each of these ideas in turn.

  • Traditional neural network layers use matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means every output unit interacts with every input unit. Convolutional networks, however, typically have sparse interactions (also referred to as sparse connectivity or sparse weights).

  • This is accomplished by making the kernel smaller than the input. For example, when processing an image, the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. It also means that computing the output requires fewer operations.

  • These improvements in efficiency are usually quite large. If there are m m m inputs and n n n outputs, then matrix multiplication requires m × n m \times n m×n parameters and the algorithms used in practice have O ( m × n ) O(m \times n) O(m×n) runtime (per example). If we limit the number of connections each output may have to k k k, then the sparsely connected approach requires only k × n k \times n k×n parameters and O ( k × n ) O(k \times n) O(k×n) runtime. For many practical applications, it is possible to obtain good performance on the machine learning task while keeping k k k several orders of magnitude smaller than m m m.

  • For graphical demonstrations of sparse connectivity, see figure 9.2 9.2 9.2 and figure 9.3. 9.3 . 9.3. In a deep convolutional network, units in the deeper layers may indirectly interact with a larger portion of the input, as shown in figure 9.4 9.4 9.4. This allows the network to efficiently describe complicated interactions between many variables by constructing such interactions from simple building blocks that each describe only sparse interactions.

  • Parameter sharing refers to using the same parameter for more than one function in a model. In a traditional neural net, each element of the weight matrix is used exactly once when computing the output of a layer. It is multiplied by one element of the input and then never revisited. As a synonym for parameter sharing, one can say that a network has tied weights, because the value of the weight applied to one input is tied to the value of a weight applied elsewhere.

  • In a convolutional neural net, each member of the kernel is used at every position of the input (except perhaps some of the boundary pixels, depending on the design decisions regarding the boundary). The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set. This does not affect the runtime of forward propagation - it is still O ( k × n ) O(k \times n) O(k×n) -but it does further reduce the storage requirements of the model to k k k parameters. Recall that k k k is usually several orders of magnitude less than m m m. Since m m m and n n n are usually roughly the same size, k k k is practically insignificant compared to m × n m \times n m×n. Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency. For a graphical depiction of how parameter sharing works, see figure 9.5 9.5 9.5.

  • As an example of both of these first two principles in action, figure 9.6 9.6 9.6 shows how sparse connectivity and parameter sharing can dramatically improve the efficiency of a linear function for detecting edges in an image.

  • In the case of convolution, the particular form of parameter sharing causes the layer to have a property called equivariance to translation. To say a function is equivariant means that if the input changes, the output changes in the same way. Specifically, a function f ( x ) f(x) f(x) is equivariant to a function g g g if f ( g ( x ) ) = g ( f ( x ) ) f(g(x))=g(f(x)) f(g(x))=g(f(x)). In the case of convolution, if we let g g g be any function that translates the input,i.e., shifts it, then the convolution function is equivariant to g . g . g.

  • Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image. Other mechanisms are necessary for handling these kinds of transformations.

  • Finally, some kinds of data cannot be processed by neural networks defined by matrix multiplication with a fixed-shape matrix. Convolution enables processing of some of these kinds of data. We discuss this further in section 9.7. 9.7 . 9.7.

Pooling

  • A typical layer of a convolutional network consists of three stages (see figure 9.7)
    In the first stage, the layer performs several convolutions in parallel to produce a set of linear activations.
    In the second stage, each linear activation is run through a nonlinear activation function, such as the rectified linear activation function. This stage is sometimes called the detector stage.
    In the third stage, we use a pooling function to modify the output of the layer further.

  • A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs. For example, the max pooling (Zho and Chellappa , 1988 ) operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the L 2 L^{2} L2 norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.

  • In all cases, pooling helps to make the representation become approximately invariant to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change. See figure 9.8 9.8 9.8 for an example of how this works. Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is. For example, when determining whether an image contains a face, we need not know the location of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of the face and an eye on the right side of the face. In other contexts, it is more important to preserve the location of a feature. For example, if we want to find a corner defined by two edges meeting at a specific orientation, we need to preserve the location of the edges well enough to test whether they meet.

  • The use of pooling can be viewed as adding an infinitely strong prior that the function the layer learns must be invariant to small translations. When this assumption is correct, it can greatly improve the statistical efficiency of the network.
    Pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately parametrized convolutions, the features can learn which transformations to become invariant to (see figure 9.9 ) 9.9) 9.9).

  • Because pooling summarizes the responses over a whole neighborhood, it is possible to use fewer pooling units than detector units, by reporting summary statistics for pooling regions spaced k k k pixels apart rather than 1 pixel apart. See figure 9.10 9.10 9.10 for an example. This improves the computational efficiency of the network because the next layer has roughly k k k times fewer inputs to process. When the number of parameters in the next layer is a function of its input size (such as when the next layer is fully connected and based on matrix multiplication) this reduction in the input size can also result in improved statistical efficiency and reduced memory requirements for storing the parameters.

  • For many tasks, pooling is essential for handling inputs of varying size. For example, if we want to classify images of variable size, the input to the classification layer must have a fixed size. This is usually accomplished by varying the size of an offset between pooling regions so that the classification layer always receives the same number of summary statistics regardless of the input size. For example, the final pooling layer of the network may be defined to output four sets of summary statistics, one for each quadrant of an image, regardless of the image size.

  • Some theoretical work gives guidance as to which kinds of pooling one should use in various situations ( Boureau et al. , 2010 ). It is also possible to dynamically pool features together, for example, by running a clustering algorithm on the locations of interesting features ( Boureau et al. , 2011 ). This approach yields a different set of pooling regions for each image. Another approach is to learn a single pooling structure that is then applied to all images ( Jia et al. , 2012 ).

  • Pooling can complicate some kinds of neural network architectures that use top-down information, such as Boltzmann machines and autoencoders. These issues will be discussed further when we present these types of networks in part III . Pooling in convolutional Boltzmann machines is presented in section 20.6 . The inverse-like operations on pooling units needed in some differentiable networks will be covered in section 20.10.6 .

  • Some examples of complete convolutional network architectures for classification using convolution and pooling are shown in figure 9.11 .

Convolution and Pooling as an Infinitely Strong Prior

  • Recall the concept of a prior probability distribution from section 5.2. 5.2 . 5.2. This is a probability distribution over the parameters of a model that encodes our beliefs about what models are reasonable, before we have seen any data.

  • Priors can be considered weak or strong depending on how concentrated the probability density in the prior is.
    A weak prior is a prior distribution with high entropy, such as a Gaussian distribution with high variance. Such a prior allows the data to move the parameters more or less freely.
    A strong prior has very low entropy, such as a Gaussian distribution with low variance. Such a prior plays a more active role in determining where the parameters end up.

  • An infinitely strong prior places zero probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data gives to those values.

  • We can imagine a convolutional net as being similar to a fully connected net, but with an infinitely strong prior over its weights.
    This infinitely strong prior says that the weights for one hidden unit must be identical to the weights of its neighbor, but shifted in space.
    The prior also says that the weights must be zero, except for in the small, spatially contiguous receptive field assigned to that hidden unit.

  • Overall, we can think of the use of convolution as introducing an infinitely strong prior probability distribution over the parameters of a layer. This prior says that the function the layer should learn contains only local interactions and is equivariant to translation. Likewise, the use of pooling is an infinitely strong prior that each unit should be invariant to small translations.

  • One key insight is that convolution and pooling can cause underfitting. Like any prior, convolution and pooling are only useful when the assumptions made by the prior are reasonably accurate. If a task relies on preserving precise spatial information, then using pooling on all features can increase the training error.

  • Some convolutional network architectures (Szegedy et al., 2014a) are designed to use pooling on some channels but not on other channels, in order to get both highly invariant features and features that will not underfit when the translation invariance prior is incorrect. When a task involves incorporating information from very distant locations in the input, then the prior imposed by convolution may be inappropriate.

  • Another key insight from this view is that we should only compare convolutional models to other convolutional models in benchmarks of statistical learning performance. Models that do not use convolution would be able to learn even if we permuted all of the pixels in the image. For many image datasets, there are separate benchmarks for models that are permutation invariant and must discover the concept of topology via learning, and models that have the knowledge of spatial relationships hard-coded into them by their designer.

WORKS

  1. 画出以下⽹络结构示意图:AlexNet, VGGNet, Inception, ResNet
  2. 卷积神经网络中的1x1卷积核

本文标签: ConvolutionalNetworks