Convolutional Networks（1）|电子爱好者

admin管理员组
文章数量:1623567

Convolutional networks ( LeCun , 1989 ), also known as convolutional neural networks or CNNs, are a specialized kind of neural network for processing data that has a known, grid-like topology. Examples include time-series data, which can be thought of as a 1D grid taking samples at regular time intervals, and image data, which can be thought of as a 2D grid of pixels.
Convolutional networks have been tremendously successful in practical applications. The name “convolutional neural network” indicates that the network employs a mathematical operation called convolution. Convolution is a specialized kind of linear operation. Convolutional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers.
In this chapter, we will first describe what convolution is. Next, we will explain the motivation behind using convolution in a neural network. We will then describe an operation called pooling, which almost all convolutional networks employ. Usually, the operation used in a convolutional neural network does not correspond precisely to the definition of convolution as used in other fields such as engineering or pure mathematics. We will describe several variants on the convolution function that are widely used in practice for neural networks. We will also show how convolution may be applied to many kinds of data, with different numbers of dimensions. We then discuss means of making convolution more efficient. Convolutional networks stand out as an example of neuroscientific principles influencing deep learning. We will discuss these neuroscientific principles, then conclude with comments about the role convolutional networks have played in the history of deep learning. One topic this chapter does not address is how to choose the architecture of your convolutional network. The goal of this chapter is to describe the kinds of tools that convolutional networks provide, while chapter 11 describes general guidelines for choosing which tools to use in which circumstances. Research into convolutional network architectures proceeds so rapidly that a new best architecture for a given benchmark is announced every few weeks to months, rendering it impractical to describe the best architecture in print. However, the best architectures have consistently been composed of the building blocks described here.

The Convolution Operation

In its most general form, convolution is an operation on two functions of a realvalued argument. To motivate the definition of convolution, we start with examples of two functions we might use.
Suppose we are tracking the location of a spaceship with a laser sensor. Our laser sensor provides a single output x ( t ) x(t) x(t), the position of the spaceship at time t t t. Both x x x and t t t are real-valued, i.e., we can get a different reading from the laser sensor at any instant in time.
Now suppose that our laser sensor is somewhat noisy. To obtain a less noisy estimate of the spaceship’s position, we would like to average together several measurements. Of course, more recent measurements are more relevant, so we will want this to be a weighted average that gives more weight to recent measurements. We can do this with a weighting function w ( a ) w(a) w(a), where a a a is the age of a measurement. If we apply such a weighted average operation at every moment, we obtain a new function s s s providing a smoothed estimate of the position of the spaceship:
s ( t ) = ∫ x ( a ) w ( t − a ) d a s(t)=\int x(a) w(t-a) d a s(t)=∫x(a)w(t−a)da
This operation is called convolution. The convolution operation is typically denoted with an asterisk:
s ( t ) = ( x ∗ w ) ( t ) s(t)=(x * w)(t) s(t)=(x∗w)(t)
In our example, w w w needs to be a valid probability density function, or the output is not a weighted average. Also, w w w needs to be 0 for all negative arguments, or it will look into the future, which is presumably beyond our capabilities. These limitations are particular to our example though. In general, convolution is defined for any functions for which the above integral is defined, and may be used for other purposes besides taking weighted averages.
In convolutional network terminology, the first argument (in this example, the function x x x ) to the convolution is often referred to as the input and the second argument (in this example, the function w w w ) as the kernel. The output is sometimes referred to as the feature map.
If we now assume that x x x and w w w are defined only on integer t t t, we can define the discrete convolution:
s ( t ) = ( x ∗ w ) ( t ) = ∑ a = − ∞ ∞ x ( a ) w ( t − a ) s(t)=(x * w)(t)=\sum_{a=-\infty}^{\infty} x(a) w(t-a) s(t)=(x∗w)(t)=a=−∞∑∞x(a)w(t−a)
In machine learning applications, the input is usually a multidimensional array of data and the kernel is usually a multidimensional array of parameters that are adapted by the learning algorithm. We will refer to these multidimensional arrays as tensors. Because each element of the input and kernel must be explicitly stored separately, we usually assume that these functions are zero everywhere but the finite set of points for which we store the values. This means that in practice we can implement the infinite summation as a summation over a finite number of array elements.
Finally, we often use convolutions over more than one axis at a time. For example, if we use a two-dimensional image I I I as our input, we probably also want to use a two-dimensional kernel K K K :
S ( i , j ) = ( I ∗ K ) ( i , j ) = ∑ m ∑ n I ( m , n ) K ( i − m , j − n ) S(i, j)=(I * K)(i, j)=\sum_{m} \sum_{n} I(m, n) K(i-m, j-n) S(i,j)=(I∗K)(i,j)=m∑n∑I(m,n)K(i−m,j−n)
Convolution is commutative, meaning we can equivalently write:
S ( i , j ) = ( K ∗ I ) ( i , j ) = ∑ m ∑ n I ( i − m , j − n ) K ( m , n ) S(i, j)=(K * I)(i, j)=\sum_{m} \sum_{n} I(i-m, j-n) K(m, n) S(i,j)=(K∗I)(i,j)=m∑n∑I(i−m,j−n)K(m,n)
Usually the latter formula is more straightforward to implement in a machine learning library, because there is less variation in the range of valid values of m m m and n n n.
The commutative property of convolution arises because we have flipped the kernel relative to the input, in the sense that as m m m increases, the index into the input increases, but the index into the kernel decreases. The only reason to flip the kernel is to obtain the commutative property. While the commutative property is useful for writing proofs, it is not usually an important property of a neural network implementation. Instead, many neural network libraries implement a related function called the cross-correlation, which is the same as convolution but without flipping the kernel:
S ( i , j ) = ( I ∗ K ) ( i , j ) = ∑ m ∑ n I ( i + m , j + n ) K ( m , n ) S(i, j)=(I * K)(i, j)=\sum_{m} \sum_{n} I(i+m, j+n) K(m, n) S(i,j)=(I∗K)(i,j)=m∑n∑I(i+m,j+n)K(m,n)
Many machine learning libraries implement cross-correlation but call it convolution.
In this text we will follow this convention of calling both operations convolution, and specify whether we mean to flip the kernel or not in contexts where kernel flipping is relevant. In the context of machine learning, the learning algorithm will learn the appropriate values of the kernel in the appropriate place, so an algorithm based on convolution with kernel flipping will learn a kernel that is flipped relative to the kernel learned by an algorithm without the flipping. It is also rare for convolution to be used alone in machine learning; instead convolution is used simultaneously with other functions, and the combination of these functions does not commute regardless of whether the convolution operation flips its kernel or not.
See figure 9.1 9.1 9.1 for an example of convolution (without kernel flipping) applied to a 2-D tensor.
Discrete convolution can be viewed as multiplication by a matrix. However, the matrix has several entries constrained to be equal to other entries. For example, for univariate（单变量） discrete convolution, each row of the matrix is constrained to be equal to the row above shifted by one element. This is known as a Toeplitz matrix. In two dimensions, a doubly block circulant matrix corresponds to convolution. In addition to these constraints that several elements be equal to each other, convolution usually corresponds to a very sparse matrix (a matrix whose entries are mostly equal to zero). This is because the kernel is usually much smaller than the input image. Any neural network algorithm that works with matrix multiplication and does not depend on specific properties of the matrix structure should work with convolution, without requiring any further changes to the neural network. Typical convolutional neural networks do make use of further specializations in order to deal with large inputs efficiently, but these are not strictly necessary from a theoretical perspective.

Motivation

Convolution leverages three important ideas that can help improve a machine learning system: sparse interactions, parameter sharing and equivariant representations. Moreover, convolution provides a means for working with inputs of variable size. We now describe each of these ideas in turn.
Traditional neural network layers use matrix multiplication by a matrix of parameters with a separate parameter describing the interaction between each input unit and each output unit. This means every output unit interacts with every input unit. Convolutional networks, however, typically have sparse interactions (also referred to as sparse connectivity or sparse weights).
This is accomplished by making the kernel smaller than the input. For example, when processing an image, the input image might have thousands or millions of pixels, but we can detect small, meaningful features such as edges with kernels that occupy only tens or hundreds of pixels. This means that we need to store fewer parameters, which both reduces the memory requirements of the model and improves its statistical efficiency. It also means that computing the output requires fewer operations.
These improvements in efficiency are usually quite large. If there are m m m inputs and n n n outputs, then matrix multiplication requires m × n m \times n m×n parameters and the algorithms used in practice have O ( m × n ) O(m \times n) O(m×n) runtime (per example). If we limit the number of connections each output may have to k k k, then the sparsely connected approach requires only k × n k \times n k×n parameters and O ( k × n ) O(k \times n) O(k×n) runtime. For many practical applications, it is possible to obtain good performance on the machine learning task while keeping k k k several orders of magnitude smaller than m m m.
For graphical demonstrations of sparse connectivity, see figure 9.2 9.2 9.2 and figure 9.3. 9.3 . 9.3. In a deep convolutional network, units in the deeper layers may indirectly interact with a larger portion of the input, as shown in figure 9.4 9.4 9.4. This allows the network to efficiently describe complicated interactions between many variables by constructing such interactions from simple building blocks that each describe only sparse interactions.
Parameter sharing refers to using the same parameter for more than one function in a model. In a traditional neural net, each element of the weight matrix is used exactly once when computing the output of a layer. It is multiplied by one element of the input and then never revisited. As a synonym for parameter sharing, one can say that a network has tied weights, because the value of the weight applied to one input is tied to the value of a weight applied elsewhere.
In a convolutional neural net, each member of the kernel is used at every position of the input (except perhaps some of the boundary pixels, depending on the design decisions regarding the boundary). The parameter sharing used by the convolution operation means that rather than learning a separate set of parameters for every location, we learn only one set. This does not affect the runtime of forward propagation - it is still O ( k × n ) O(k \times n) O(k×n) -but it does further reduce the storage requirements of the model to k k k parameters. Recall that k k k is usually several orders of magnitude less than m m m. Since m m m and n n n are usually roughly the same size, k k k is practically insignificant compared to m × n m \times n m×n. Convolution is thus dramatically more efficient than dense matrix multiplication in terms of the memory requirements and statistical efficiency. For a graphical depiction of how parameter sharing works, see figure 9.5 9.5 9.5.
As an example of both of these first two principles in action, figure 9.6 9.6 9.6 shows how sparse connectivity and parameter sharing can dramatically improve the efficiency of a linear function for detecting edges in an image.
In the case of convolution, the particular form of parameter sharing causes the layer to have a property called equivariance to translation. To say a function is equivariant means that if the input changes, the output changes in the same way. Specifically, a function f ( x ) f(x) f(x) is equivariant to a function g g g if f ( g ( x ) ) = g ( f ( x ) ) f(g(x))=g(f(x)) f(g(x))=g(f(x)). In the case of convolution, if we let g g g be any function that translates the input,i.e., shifts it, then the convolution function is equivariant to g . g . g.
Convolution is not naturally equivariant to some other transformations, such as changes in the scale or rotation of an image. Other mechanisms are necessary for handling these kinds of transformations.
Finally, some kinds of data cannot be processed by neural networks defined by matrix multiplication with a fixed-shape matrix. Convolution enables processing of some of these kinds of data. We discuss this further in section 9.7. 9.7 . 9.7.

Pooling

A typical layer of a convolutional network consists of three stages (see figure 9.7)
In the first stage, the layer performs several convolutions in parallel to produce a set of linear activations.
In the second stage, each linear activation is run through a nonlinear activation function, such as the rectified linear activation function. This stage is sometimes called the detector stage.
In the third stage, we use a pooling function to modify the output of the layer further.
A pooling function replaces the output of the net at a certain location with a summary statistic of the nearby outputs. For example, the max pooling (Zho and Chellappa , 1988 ) operation reports the maximum output within a rectangular neighborhood. Other popular pooling functions include the average of a rectangular neighborhood, the L 2 L^{2} L2 norm of a rectangular neighborhood, or a weighted average based on the distance from the central pixel.
In all cases, pooling helps to make the representation become approximately invariant to small translations of the input. Invariance to translation means that if we translate the input by a small amount, the values of most of the pooled outputs do not change. See figure 9.8 9.8 9.8 for an example of how this works. Invariance to local translation can be a very useful property if we care more about whether some feature is present than exactly where it is. For example, when determining whether an image contains a face, we need not know the location of the eyes with pixel-perfect accuracy, we just need to know that there is an eye on the left side of the face and an eye on the right side of the face. In other contexts, it is more important to preserve the location of a feature. For example, if we want to find a corner defined by two edges meeting at a specific orientation, we need to preserve the location of the edges well enough to test whether they meet.
The use of pooling can be viewed as adding an infinitely strong prior that the function the layer learns must be invariant to small translations. When this assumption is correct, it can greatly improve the statistical efficiency of the network.
Pooling over spatial regions produces invariance to translation, but if we pool over the outputs of separately parametrized convolutions, the features can learn which transformations to become invariant to (see figure 9.9 ) 9.9) 9.9).
Because pooling summarizes the responses over a whole neighborhood, it is possible to use fewer pooling units than detector units, by reporting summary statistics for pooling regions spaced k k k pixels apart rather than 1 pixel apart. See figure 9.10 9.10 9.10 for an example. This improves the computational efficiency of the network because the next layer has roughly k k k times fewer inputs to process. When the number of parameters in the next layer is a function of its input size (such as when the next layer is fully connected and based on matrix multiplication) this reduction in the input size can also result in improved statistical efficiency and reduced memory requirements for storing the parameters.
For many tasks, pooling is essential for handling inputs of varying size. For example, if we want to classify images of variable size, the input to the classification layer must have a fixed size. This is usually accomplished by varying the size of an offset between pooling regions so that the classification layer always receives the same number of summary statistics regardless of the input size. For example, the final pooling layer of the network may be defined to output four sets of summary statistics, one for each quadrant of an image, regardless of the image size.
Some theoretical work gives guidance as to which kinds of pooling one should use in various situations ( Boureau et al. , 2010 ). It is also possible to dynamically pool features together, for example, by running a clustering algorithm on the locations of interesting features ( Boureau et al. , 2011 ). This approach yields a different set of pooling regions for each image. Another approach is to learn a single pooling structure that is then applied to all images ( Jia et al. , 2012 ).
Pooling can complicate some kinds of neural network architectures that use top-down information, such as Boltzmann machines and autoencoders. These issues will be discussed further when we present these types of networks in part III . Pooling in convolutional Boltzmann machines is presented in section 20.6 . The inverse-like operations on pooling units needed in some differentiable networks will be covered in section 20.10.6 .
Some examples of complete convolutional network architectures for classification using convolution and pooling are shown in figure 9.11 .

Convolution and Pooling as an Infinitely Strong Prior

Recall the concept of a prior probability distribution from section 5.2. 5.2 . 5.2. This is a probability distribution over the parameters of a model that encodes our beliefs about what models are reasonable, before we have seen any data.
Priors can be considered weak or strong depending on how concentrated the probability density in the prior is.
A weak prior is a prior distribution with high entropy, such as a Gaussian distribution with high variance. Such a prior allows the data to move the parameters more or less freely.
A strong prior has very low entropy, such as a Gaussian distribution with low variance. Such a prior plays a more active role in determining where the parameters end up.
An infinitely strong prior places zero probability on some parameters and says that these parameter values are completely forbidden, regardless of how much support the data gives to those values.
We can imagine a convolutional net as being similar to a fully connected net, but with an infinitely strong prior over its weights.
This infinitely strong prior says that the weights for one hidden unit must be identical to the weights of its neighbor, but shifted in space.
The prior also says that the weights must be zero, except for in the small, spatially contiguous receptive field assigned to that hidden unit.
Overall, we can think of the use of convolution as introducing an infinitely strong prior probability distribution over the parameters of a layer. This prior says that the function the layer should learn contains only local interactions and is equivariant to translation. Likewise, the use of pooling is an infinitely strong prior that each unit should be invariant to small translations.
One key insight is that convolution and pooling can cause underfitting. Like any prior, convolution and pooling are only useful when the assumptions made by the prior are reasonably accurate. If a task relies on preserving precise spatial information, then using pooling on all features can increase the training error.
Some convolutional network architectures (Szegedy et al., 2014a) are designed to use pooling on some channels but not on other channels, in order to get both highly invariant features and features that will not underfit when the translation invariance prior is incorrect. When a task involves incorporating information from very distant locations in the input, then the prior imposed by convolution may be inappropriate.
Another key insight from this view is that we should only compare convolutional models to other convolutional models in benchmarks of statistical learning performance. Models that do not use convolution would be able to learn even if we permuted all of the pixels in the image. For many image datasets, there are separate benchmarks for models that are permutation invariant and must discover the concept of topology via learning, and models that have the knowledge of spatial relationships hard-coded into them by their designer.

WORKS

画出以下⽹络结构示意图：AlexNet, VGGNet, Inception, ResNet
卷积神经网络中的1x1卷积核

本文标签： Convolutional Networks

版权声明：本文标题：Convolutional Networks（1）内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/dianzi/1728891856a1178143.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

Convolutional Networks（1）

CONTENTS

The Convolution Operation

Motivation

Pooling

Convolution and Pooling as an Infinitely Strong Prior

WORKS

更多相关文章

超分辨率——Dual Residual Networks Leveraging the Potential of Paired Operations for Image Restoration

ICLR 2017 | Attention 和 Memory Networks

Feature Pyramid Networks for Object Detection论文翻译——中英文对照

论文复现：Learning Efficient Convolutional Networks through Network Slimming

M-net: A Convolutional Neural Network for deep brain structure segmentation

End-to-end Structure-Aware Convolutional Networks for Knowledge Base Completion

Squeeze-and-Excitation Networks论文翻译——中英文对照

综述论文阅读”A comprehensive survey on graph neural networks“（TNNLS2020）

综述笔记：A comprehensive survey on Graph Neural Networks

《A Comprehensive Survey on Graph Neural Networks》学习笔记

2019A Comprehensive Survey on Graph Neural Networks被700

In addition, because 5G networks are more oriented to industry scenarios

CG Industry magazines and online social networks

ArnetMiner: Extraction and Mining of Academic Social Networks

POJ 3538 Domestic Networks（DP）

Character-level Convolutional Networks for Text Classification

CS5229 Advanced Computer Networks

OpenFlow:Enabling Innovation in Campus Networks 译文

OpenFlow:Enabling Innovation in CampusNetworks

Feature Pyramid Networks for Object Detection论文翻译——中文版

发表评论

推荐文章

基于Html+Css+javascript的游戏网页制作（游戏主题）-英雄联盟6页

如何给Google Chrome增加proxy

经典网络LeNet-5介绍及代码测试(Caffe, MNIST, C++)

win10显示无法连接到Internet但是能上网

revit 对计算机最低配置,了解revit对电脑配置要求是什么

热门文章

使用arduino D1 wifi模块（WeMos D1）连接阿里云物联网平台并成功实现APP点亮板载LED（四）---阿里云物联网平台使用基础

8.7 Oracle(版本19c）下载安装教程-保姆级

蓝牙解锁(AutoLock) for mac--蓝牙WiFi解锁工具

2024年度福建省职业院校技能大赛高职组“软件测试”赛项规程

windows mobile注册表修改小全

1018 锤子剪刀布 (20分) 测试点五问题

U盘刻录后如何恢复原来容量大小[windows下] - 转

NVIDIA_CUDA版本查看方法_WIN10

博途v14电脑要求_博途v15对电脑配置要求

人工智能基础（高中版）教材补充和资源分享之一 无限光明的未来、过去和现在

最新文章

UOS 商店已过期试用

破解华强北子腾系统手表应用商店应用和主题商店表盘，实现表盘自由！

Ubuntu软件下载速度慢解决

ubuntu18.04 下载软件速度慢的解决办法

mac安装非应用商店的软件会出现软件损坏，打开允许任何来源选项仍然会提示文件已损的解决办法

华为metebook电脑如何修改应用商店中新应用的默认安装路径

[linux]深度软件中心无法下载软件的临时解决办法

ipa下载安装神器？苹果软件这么容易就被搞定了

【windows装机】换新电脑后，如何正确重新安装微软应用商店，就这个倒腾了我大半天，最开始老是报错

应用商店第一课--软件的架构

应用商店打开服务器错误,应用商店出错的修复方法

windows 没有应用商店，直接安装所需应用的解决方案

如何把microsoft store里面的软件添加到桌面

win10应用商店无法下载应用，或者下载完无法安装（错误代码：0x80073D0A）

Ubuntu 16.04apt-get下载软件的存放位置

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

人工智能基础（高中版）教材补充和资源分享之一无限光明的未来、过去和现在

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载