本系列文章面向深度学习研发者,希望通过Image Caption Generation,一个有意思的具体任务,深入浅出地介绍深度学习的知识。本系列文章涉及到很多深度学习流行的模型,如CNN,RNN/LSTM,Attention等。本文为第6篇。
李理:从Image Caption Generation理解深度学习(part I)
李理:从Image Caption Generation理解深度学习(part II)
李理:从Image Caption Generation理解深度学习(part III)
李理:自动梯度求解 反向传播算法的另外一种视角
作为一个框架或者工具,显然不能使用手工计算的方式,另外数值计算效率太低,一般只能用来做gradient check。剩下的两种就是符号求导和自动求导了,目前的框架都是用的自动求导。
【注:theano说自己是Symbolic Differentiation ,但含义并不是数学上的Symbolic Diff,感兴趣的读者可以参考这里。】
1. 用户可以使用用户基本函数(也有叫操作op的)来定义计算图的;
2. 用户只能用更上层的函数。
而另外一些框架或者工具可能就没有那么灵活。但他们本质都是类似的——我们通过某种方式(代码或者配置文件)定义一个计算图,并且定义哪些是变量【可训练的】,哪些是【常量】(或者批量给定的值如tensorflow里的placeholder),以及损失函数,它就能自动地帮我们计算损失函数对每个可训练参数的导数,而且大部分框架把梯度下降的常用方法都封装好了,我们只有指定一些参数,比如batch大小,learning rate等等。
当然有一些框架如theano并不做这些,它只帮助我们求梯度,这样的工具更“底层”一些,当然对AI使用者要求更高一些,也会更灵活一些,适合对细节感兴趣的用户和那些需要自己“创造”神经网络结构的用户——很多学术界的人很喜欢theano,而像caffe,torch,keras等就是更”上层“的工具,使用它时,我只需要定义一个一个CNN或者DNN的层就行,这个层有多少hidden unit,激活函数用什么,是否dropout,用什么loss function,然后其余的事情就不用管了。
请仔细阅读安装需要的软件。我这里根据我的环境(Ubuntu 14.04 LTS)列举一些安装的命令。
cd assignment2sudo pip install virtualenv # This may already be installedvirtualenv .env # Create a virtual environmentsource .env/bin/activate # Activate the virtual environmentpip install -r requirements.txt # Install dependencies# Work on the assignment for a while ...deactivate # Exit the virtual environment
virtualenv可以理解为一个虚拟的python环境,和系统的环境可以隔离开,而且安装程序也不需要root权限。用的时候记得source .env/bin/active!
cd cs231n/datasets ./get_datasets.sh
python setup.py build_ext --inplace
五、启动ipython notebook
(.env) lili@lili-desktop:~/cs231n/assignment2$ ipython notebook
如果没有用过 ipython notebook,请先阅读此参考资料。确保了解基本的操作,知道怎么执行cell等基本概念后再往下阅读。
def layer_forward(x, w): """ Receive inputs x and weights w """ # Do some computations ... z = # ... some intermediate value # Do some more computations ... out = # the output cache = (x, w, z, out) # Values we need to compute gradients return out, cache
def layer_backward(dout, cache): """ Receive derivative of loss with respect to outputs and cache, and compute derivative with respect to inputs. """ # Unpack cache values x, w, z, out = cache # Use values in cache to compute derivatives dx = # Derivative of loss with respect to x dw = # Derivative of loss with respect to w return dx, dw
一上来我们会从cache里读取出输入,输出和中间值。然后就计算对每个变量的local gradient,然后乘以后层传过来的dout,得到最终的dLoss/dw。然后返回。
def rel_error(x, y): """ returns relative error """ return np.max(np.abs(x - y) / (np.maximum(1e-8, np.abs(x) + np.abs(y))))
这个函数计算两个ndarray【两个数或者两个向量或者两个矩阵】的相对误差,主要会用来做gradient check,也就是用numerical gradient和我们计算的gradient比较,如果相对误差比较小,那么就说明我们的gradient可能是正确的。【如果误差较大肯定不对,但是误差小不见得一定对,就像我们的单元测试,通过了单元测试不见得就没bug,但是没通过肯定有bug。】
安装方法和作业2是一样的,不过在pip install -r requirements.txt时可能会提升pillow-3.0已经存在了,打开requirements.txt,里面有两个pillow的版本,删除一个就行了。
cifar10的数据10类图片,’plane’, ‘car’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck。总共有6万张标注的数据,其中50000张训练数据,10000张测试数据。而在这里,我们把50000张中的49000用来真正训练,1000张用来validate。
def affine_forward(x, w, b): """ Computes the forward pass for an affine (fully-connected) layer. The input x has shape (N, d_1, ..., d_k) and contains a minibatch of N examples, where each example x[i] has shape (d_1, ..., d_k). We will reshape each input into a vector of dimension D = d_1 * ... * d_k, and then transform it to an output vector of dimension M. Inputs: - x: A numpy array containing input data, of shape (N, d_1, ..., d_k) - w: A numpy array of weights, of shape (D, M) - b: A numpy array of biases, of shape (M,) Returns a tuple of: - out: output, of shape (N, M) - cache: (x, w, b) """ out = None ############################################################################# # TODO: Implement the affine forward pass. Store the result in out. You # # will need to reshape the input into rows. # ############################################################################# pass ############################################################################# # END OF YOUR CODE # ############################################################################# cache = (x, w, b) return out, cache
这个函数计算affine (全连接)层的forward pass 仿射变换 (Affine transformation)看起来很数学,其实我们只有知道它是一个简单的线性变化就行了。如果想了解细节,可以参考wiki。
N是batch大小,为了提高计算效率,我们一般同时计算一个batch的forward和backward pass。剩下的维度为什么是变长的呢?其实是为了方便。因为CNN的filter的大小是不固定的。不过我们可以简单的把这个多维的tensor展开成一个一维的向量,因为全连接的层是不会考虑不同输入的空间位置的【而CNN是考虑空间关系的,所以在图像处理中效果更好,后面我们在介绍CNN时会详细介绍】。我们把展开后的向量的维度记为D:
w是numpy的ndarray,维度是(D, M),这个很容易理解,对于全连接的层,输入神经元是D,输出神经元是M,参数w就是(D,M)的矩阵。【说明,如果读者还记得前面的代码,我们之前是反过来的,w是M*D的矩阵。其实不论怎么记都不影响,只是一种习惯而已,不过计算的时候有的要转置,有的不需要。我们只要记住一点,满足矩阵的乘法条件就行了!】
输出 out
输出out是ndarray,维度是 (N, M)。
cache保存这一层的输入和中间变量,这里cache = (x, w, b),cache是一个tuple,保存了x,w,b,在backward的阶段会用到。
实现函数 affine_forward
N = x.shape[0] x_temp = x.reshape(N,-1)
$ ipythonimport numpy as npnp.reshape?
输出out是N M,W是D M,x_temp是N D,那么唯一合法的乘法就是x_temp W了。所以out的计算如下:
out = x_temp.dot(w) + b
如果不计算一个batch只计算一个,那么N就是1,那么就可以相加,现在我们一次计算了N个训练数据的W x,那么b却是一样的(N次计算W和b是不变的),如果我们不用broadcast的技巧,那么需要复制b成为N M的矩阵,这会浪费空间。
从上层(后面)传过来的dLoss/dout,维度是和out一样的,(N, M)。
x: 输入 维度是 (N,d1,…dk)
w: 权重矩阵,维度是 (D, M)
b: bias,维度是(M,)
dx: dLoss/dx, 维度是 (N,d1,…,dk)
dw: dLoss/dw, (D, M)
db: dLoss/db (M,)
x, w, b = cache dx, dw, db = None, None, None ############################################################################# # TODO: Implement the affine backward pass. # ############################################################################# db = np.sum(dout, axis = 0) x_temp = x.reshape(x.shape[0],-1) dw = x_temp.T.dot(dout) dx = dout.dot(w.T).reshape(x.shape) ############################################################################# # END OF YOUR CODE # #############################################################################return dx, dw, db
dw = x_temp.T.dot(dout)
dx = dout.dot(w.T).reshape(x.shape)
最后是db,如果batch等于1,那么很简单db=dout,但现在dout是N个训练样本的梯度,所以需要加起来。具体用到的是np.sum函数【当然也可以写个for循环,但是这会比较低效而且代码看起来很罗嗦】, db = np.sum(dout, axis = 0)
In [10]: dout=np.array([[1,2,3],[4,5,6]]) In [11]: dout Out[11]: array([[1, 2, 3], [4, 5, 6]]) In [12]: dout.sum(axis=0) Out[12]: array([5, 7, 9])
这个cell实现ReLU的forward pass。
out = np.maximum(0, x)
这个cell实现ReLU的backward pass。
dx = (x >= 0) * dout
In [15]: x=np.array([1,-1])In [16]: x>=0Out[16]: array([ True, False], dtype=bool)
In [15]: x=np.array([1,-1])In [16]: x>=0Out[16]: array([ True, False], dtype=bool)In [17]: y=np.array([2.0,3.0])In [18]: (x>=0)*yOut[18]: array([ 2., 0.])
所以numpy的 x>=0 其实就是数学上的indicator函数:
具体代码在 cs231n/layer_utils.py
def affine_relu_forward(x, w, b): """ Convenience layer that perorms an affine transform followed by a ReLU Inputs: - x: Input to the affine layer - w, b: Weights for the affine layer Returns a tuple of: - out: Output from the ReLU - cache: Object to give to the backward pass """ a, fc_cache = affine_forward(x, w, b) out, relu_cache = relu_forward(a) cache = (fc_cache, relu_cache) return out, cachedef affine_relu_backward(dout, cache): """ Backward pass for the affine-relu convenience layer """ fc_cache, relu_cache = cache da = relu_backward(dout, relu_cache) dx, dw, db = affine_backward(da, fc_cache) return dx, dw, db
课程代码已经给出了,因为这个函数本来应该是在第一个作业来完成的。因为我们跳过了作业1,所以还是需要理解其中的代码,svm loss我们就不仔细介绍了,感兴趣的同学参考这里。我们来简单的讲一下softmax loss,因为这个loss在神经网络中非常常见。详细的介绍请阅读此参考资料。
首先需要澄清一个概念,并没有一个loss function叫softmax loss。它指的是在输出层加一个softmax函数,然后用cross entropy的损失函数。
cross entropy 损失函数
而实际的分类结果应该是1,2,…,K中的一个,我们可以用one-hot的方式来表示,比如分类的结果是2,我们可以表示成[0, 1, 0, …, 0]的形式。
那么我们可以用cross-entroy来计算真实的概率p=[0,1,0…0]和模型输出的概率q的”距离“,具体细节参考 这里。
举个例子:假设K=5,假设真实的分类是2,分类器的输出是[0.1, 0.7, 0.1, 0.1, 0],那么损失应该就是 -log0.7。如果分类器的输出是[0.3, 0.7, 0, 0, 0],那么损失还是-log0.7,可以看出,它之关注真实分类的值,这是很合理的一个loss。如果分类器在第二个元素越大,那么分类器分成第二类的概率就越大,所以log值也越大【最大是log1=0,没有损失】,-log就越小,损失也越小!
softmax loss的代码
def softmax_loss(x, y): """ Computes the loss and gradient for softmax classification. Inputs: - x: Input data, of shape (N, C) where x[i, j] is the score for the jth class for the ith input. - y: Vector of labels, of shape (N,) where y[i] is the label for x[i] and 0 <= y[i] < C Returns a tuple of: - loss: Scalar giving the loss - dx: Gradient of the loss with respect to x """ probs = np.exp(x - np.max(x, axis=1, keepdims=True)) probs /= np.sum(probs, axis=1, keepdims=True) N = x.shape[0] loss = -np.sum(np.log(probs[np.arange(N), y])) / N dx = probs.copy() dx[np.arange(N), y] -= 1 dx /= N return loss, dx
probs = np.exp(x - np.max(x, axis=1, keepdims=True))
比如输入x=[1000000, 1000000],则exp(1000000)会溢出,那怎么办呢?
In [21]: x=np.array([[1,3,5],[2,1,1]]) In [22]: np.max(x) Out[22]: 5In [23]: np.max(x, axis=1) Out[23]: array([5, 2]) In [24]: np.max(x, axis=1, keepdims=True) Out[24]: array([[5], [2]]) In [25]: np.max(x, axis=1).shape Out[25]: (2,) In [26]: np.max(x, axis=1, keepdims=True).shape Out[26]: (2, 1)
最后再用np.exp求这个(N,C)矩阵的没一个元素的exp值(universal function)。
probs /= np.sum(probs, axis=1, keepdims=True)
N = x.shape[0] loss = -np.sum(np.log(probs[np.arange(N), y])) / N
首先需要理解probs[np.arange(N), y]
In [39]: x=np.array([[1,3,5],[2,1,1]])In [40]: probs = np.exp(x - np.max(x, axis=1, keepdims=True))In [41]: probs /= np.sum(probs, axis=1, keepdims=True)In [42]: N = x.shape[0]In [43]: y=np.array([1,0])In [44]: probs[np.arange(N), y]Out[44]: array([ 0.11731043, 0.57611688])
前面我们说过了,我们需要计算”真实“分类对应的下标的log值。”真实“的分类下标就是y,比如上面的例子中x是两个训练数据,y是对应的正确分类下标值1和0。那么我们需要求第0行的第1列和第1行的第0列,求它们的-log,然后加起来。我们可以写for循环来做。但是在numpy里,ndarray提供了方便的方法来slice数组的一部分,np.arange和python标准的range类似,不过得到的是ndarray,得到(0,1,…,N-1)这个N个数,然后probs[np.arange(N), y]分别用这两个一维数组来slice得到一个一维的数组,相当于[probs[0,1], probs[1,0]]
我们分为两种情况,第一种情况是 i=j,首先回忆一下:
第二种情况是 i≠j
对于求和下标k分为两种情况: k=i和k≠i ,分别代入上面的公式得到:
最后一步用到的是 ∑kyk=1。
推导有些复杂,记忆起来其实不复杂,softmax+cross entropy的梯度就是模型预测的结果p减去lable y。
dx = probs.copy() dx[np.arange(N), y] -= 1 dx /= N
我们需要实现probs - y,不过公式里的y是one-hot表示的向量,而我们这里的y是下标【如果不考虑batch N】。所以这里先从probs里复制一份给新的变量dx【我个人觉得直接修改probs也没有问题】,因为y只有在对应的label的下标才是1,所以 dx[np.arange(N), y] -= 1,然后除以N得到平均的dx。
class TwoLayerNet(object): """ 这个类实现两层全连接的神经网络,使用ReLU激活函数,softmax loss。我们假设输入的维度是D,hidden unit是H,输出是C维。 网络的结构是 affine - relu - affine - softmax。 注意:这个类并不会实现梯度下降算法;相反,它会使用一个单独的 Solver 对象来实现参数的优化。 模型可以学习(训练)的参数应该放到self.params这个dict里,key是参数名,value是对应的numpy ndarray。 """ def __init__(self, input_dim=3*32*32, hidden_dim=100, num_classes=10, weight_scale=1e-3, reg=0.0): """ 初始化一个新的神经网络。 输入: - input_dim: 一个整数,代表输入的向量的大小,默认3*32*32[cifar-10的数据]。 - hidden_dim: 一个整数,代表hidden unit的个数,默认100。 - num_classes: 一个整数,代表输出分类的个数,默认10[cifar-10的分类数]。 - dropout: 标量,范围是0-1,代表dropout的概率 - weight_scale: 一个标量,代表用来随机初始化wegiht的标准差,默认值1e-3。 - reg: 一个标量,L2 正则化参数 """ self.params = {} self.reg = reg ############################################################################ # TODO: 初始化两层神经网络的weights和biases。Weights # # 用高斯分别来初始化,均值是0,标准差是weight_scale,biases初始化为0# # 所有的 weights 和 biases 应该保存在 self.params, 第一层的 # # weights 和 biases 使用 key 'W1' 和 'b1', 第二层的用 'W2'和 'b2' # ############################################################################ self.params['W1'] = np.random.normal(0, weight_scale, (input_dim, hidden_dim)) #使用np.random.normal函数来生成指定大小的矩阵,标准差是weight_scale self.params['b1'] = np.zeros(hidden_dim) self.params['W2'] = np.random.normal(0, weight_scale, (hidden_dim, num_classes)) self.params['b2'] = np.zeros(num_classes) ############################################################################ # END OF YOUR CODE # ############################################################################ def loss(self, X, y=None): """ 计算一个batch的数据的loss和gradient。 输入: - X: ndarray,shape是 (N, d_1, ..., d_k) - y: label数组, shape (N,). y[i] 是 X[i]的label,取值范围是{0,1,...,C-1} 返回: 如果y是None,则运行测试时的forward pass【说明:测试时不需要计算最后一个softmax,因为我们最后只是为了选择一个分类,而softmax是单调的函数 argmax softmax[x1,x2] = argmax [x1,x2]。】 并且返回: - scores: 数组 shape是 (N, C) 代表分类的得分,scores[i, c]是 X[i] 分成类 c的得分。 如果y不是None, 则运行一次训练时的forward和backward,返回一个tuple: - loss: 一个标量,代表loss - grads: 一个dict,key和self.params的key一样,值则是对应的梯度。 """ scores = None ############################################################################ # TODO: 计算两层神经网络的forward pass# # 计算scores # ############################################################################ affine_relu_out, affine_relu_cache = affine_relu_forward(X, self.params['W1'], self.params['b1']) affine2_out, affine2_cache = affine_forward(affine_relu_out, self.params['W2'], self.params['b2']) scores = affine2_out ############################################################################ # END OF YOUR CODE # ############################################################################ # 如果y是None 那么我们是在test mode,只需要返回scores if y is None: return scores loss, grads = 0, {} ############################################################################ # TODO: Implement the backward pass for the two-layer net. Store the loss # # in the loss variable and gradients in the grads dictionary. Compute data # # loss using softmax, and make sure that grads[k] holds the gradients for # # self.params[k]. Don't forget to add L2 regularization! # # # # NOTE: To ensure that your implementation matches ours and you pass the # # automated tests, make sure that your L2 regularization includes a factor # # of 0.5 to simplify the expression for the gradient. # ############################################################################ loss, dscores = softmax_loss(scores, y) loss += 0.5 * self.reg * ( np.sum(self.params['W1'] * self.params['W1']) + np.sum(self.params['W2'] * self.params['W2'])) affine2_dx, affine2_dw, affine2_db = affine_backward(dscores, affine2_cache) grads['W2'] = affine2_dw + self.reg * self.params['W2'] grads['b2'] = affine2_db affine1_dx, affine1_dw, affine1_db = affine_relu_backward(affine2_dx, affine_relu_cache) grads['W1'] = affine1_dw + self.reg * self.params['W1'] grads['b1'] = affine1_db ############################################################################ # END OF YOUR CODE # ############################################################################ return loss, grads
请仔细阅读代码,其实代码相当简单,不过之前没有介绍L2 regulariation(正则化),这里简单介绍一下,详细的内容参考这里。
目的是为了防止overfitting(过拟合),所以在Loss function里增加:
对应到代码 λ就是self.reg参数,所以有这样一行代码:
loss += 0.5 * self.reg * (np.sum(self.params['W1'] * self.params['W1']) + np.sum(self.params['W2'] * self.params['W2']))
同样的,在计算每个weights的时候梯度的时候也要加上 λw
grads['W2'] = affine2_dw + self.reg * self.params['W2']
""" A Solver encapsulates all the logic necessary for training classification models. The Solver performs stochastic gradient descent using different update rules defined in optim.py. Solver封装了用于训练分类器模型的所有逻辑。Solver用定义于optim.py的更新规则来进行随机梯度下降。 The solver accepts both training and validataion data and labels so it can periodically check classification accuracy on both training and validation data to watch out for overfitting. solver同时接受用于训练和验证的数据与标签,所以它能周期的检查训练和验证数据上的准确率从而避免过拟合。 To train a model, you will first construct a Solver instance, passing the model, dataset, and various optoins (learning rate, batch size, etc) to the constructor. You will then call the train() method to run the optimization procedure and train the model. 如果想训练一个模型,你首先需要构造一个Solver对象,传给它model,dataset和一些选项(learning rate, batch siez等等)给它的构造函数。然后你调用它的train()方法来进行参数优化和训练模型。 After the train() method returns, model.params will contain the parameters that performed best on the validation set over the course of training. In addition, the instance variable solver.loss_history will contain a list of all losses encountered during training and the instance variables solver.train_acc_history and solver.val_acc_history will be lists containing the accuracies of the model on the training and validation set at each epoch. train()方法返回之后,model.params保存的是在验证集结果最好的参数。此外,solver.loss_history里保存了训练过程中的所有loss。solver.train_acc_history和solver.val_acc_history保存了每个epoch结束后在训练数据和验证数据上的准确率。 Example usage might look something like this: 用法可能如下: data = { 'X_train': # training data 'y_train': # training labels 'X_val': # validation data 'X_train': # validation labels } model = MyAwesomeModel(hidden_size=100, reg=10) solver = Solver(model, data, update_rule='sgd', optim_config={ 'learning_rate': 1e-3, }, lr_decay=0.95, num_epochs=10, batch_size=100, print_every=100) solver.train() A Solver works on a model object that must conform to the following API: 传给Solver的model对象必须遵循如下API: - model.params must be a dictionary mapping string parameter names to numpy arrays containing parameter values. model.params必须是一个dict,key是参数名,value是对应的值的ndarray - model.loss(X, y) must be a function that computes training-time loss and gradients, and test-time classification scores, with the following inputs and outputs: model.loss(X,y)必须是一个函数,它计算训练时的loss和梯度【y is not None】,测试时的分类得分【y is None]。它的输入和输出如下: Inputs: 输入 - X: Array giving a minibatch of input data of shape (N, d_1, ..., d_k) X:minibatch的输入数据,维度是(N, d_1, ..., d_k) - y: Array of labels, of shape (N,) giving labels for X where y[i] is the label for X[i]. y: minibatch个标签,shape是(N,),y[i]是X[i]的label Returns: 返回: If y is None, run a test-time forward pass and return: - scores: Array of shape (N, C) giving classification scores for X where scores[i, c] gives the score of class c for X[i]. 如果y是None,返回测试时的分类得分scores,shape是(N,C),其中scores[i,c]是X[i]分类为c的得分。 If y is not None, run a training time forward and backward pass and return a tuple of: - loss: Scalar giving the loss - grads: Dictionary with the same keys as self.params mapping parameter names to gradients of the loss with respect to those parameters. 如果y不是None,进行一次训练时的前向和后向计算,并且返回: loss:一个标量代表loss grads:一个dict,key和self.params一样,value是对应的梯度。 """ def __init__(self, model, data, **kwargs): """ Construct a new Solver instance. 构造一个新的Solver对象 Required arguments: 需要的参数: - model: A model object conforming to the API described above - model: 一个model对象需要满足上面描述的API。 - data: A dictionary of training and validation data with the following: - data: 一个dict包含如下数据: 'X_train': Array of shape (N_train, d_1, ..., d_k) giving training images 'X_val': Array of shape (N_val, d_1, ..., d_k) giving validation images 'y_train': Array of shape (N_train,) giving labels for training images 'y_val': Array of shape (N_val,) giving labels for validation images 'X_train': 训练图像的ndarray,shape是(N_train, d_1, .., d_k) 'X_val': 验证集的图像的ndarray,shape是(N_val, d_1, .., d_k) 'y_train': 训练图像的lable,shape是(N_train,) 'y_val': 验证图像的lable,shape是(N_val,) Optional arguments: 可选参数: - update_rule: A string giving the name of an update rule in optim.py. Default is 'sgd'. - update_rule: optim.py里的update rule的名字,默认'sgd' - optim_config: A dictionary containing hyperparameters that will be passed to the chosen update rule. Each update rule requires different hyperparameters (see optim.py) but all update rules require a 'learning_rate' parameter so that should always be present. - optim_config: 一个dict包含传给update rule的超参数。不同的update rule有不同的超参数(请参考optim.py)。但是所有的update rules必须有'learning_rate'这个参数。 - lr_decay: A scalar for learning rate decay; after each epoch the learning rate is multiplied by this value. - lr_decay: 每轮epoch后learning_rate都会乘以这个参数,让learning_rate越來越小。 - batch_size: Size of minibatches used to compute loss and gradient during training. - batch_size: batch大小 - num_epochs: The number of epochs to run for during training. - num_epochs: 训练的epochs - print_every: Integer; training losses will be printed every print_every iterations. - print_every: 每隔多久打一下训练loss - verbose: Boolean; if set to false then no output will be printed during training. """ self.model = model self.X_train = data['X_train'] self.y_train = data['y_train'] self.X_val = data['X_val'] self.y_val = data['y_val'] # Unpack keyword arguments self.update_rule = kwargs.pop('update_rule', 'sgd') self.optim_config = kwargs.pop('optim_config', {}) self.lr_decay = kwargs.pop('lr_decay', 1.0) self.batch_size = kwargs.pop('batch_size', 100) self.num_epochs = kwargs.pop('num_epochs', 10) self.print_every = kwargs.pop('print_every', 10) self.verbose = kwargs.pop('verbose', True) # Throw an error if there are extra keyword arguments if len(kwargs) > 0: extra = ', '.join('"%s"' % k for k in kwargs.keys()) raise ValueError('Unrecognized arguments %s' % extra) # Make sure the update rule exists, then replace the string # name with the actual function if not hasattr(optim, self.update_rule): raise ValueError('Invalid update_rule "%s"' % self.update_rule) self.update_rule = getattr(optim, self.update_rule) self._reset() def _reset(self): """ Set up some book-keeping variables for optimization. Don't call this manually. """ # Set up some variables for book-keeping self.epoch = 0 self.best_val_acc = 0 self.best_params = {} self.loss_history = [] self.train_acc_history = [] self.val_acc_history = [] # Make a deep copy of the optim_config for each parameter self.optim_configs = {} for p in self.model.params: d = {k: v for k, v in self.optim_config.iteritems()} self.optim_configs[p] = d def _step(self): """ Make a single gradient update. This is called by train() and should not be called manually. """ # Make a minibatch of training data num_train = self.X_train.shape[0] batch_mask = np.random.choice(num_train, self.batch_size) X_batch = self.X_train[batch_mask] y_batch = self.y_train[batch_mask] # Compute loss and gradient loss, grads = self.model.loss(X_batch, y_batch) self.loss_history.append(loss) # Perform a parameter update for p, w in self.model.params.iteritems(): dw = grads[p] config = self.optim_configs[p] next_w, next_config = self.update_rule(w, dw, config) self.model.params[p] = next_w self.optim_configs[p] = next_config def check_accuracy(self, X, y, num_samples=None, batch_size=100): """ Check accuracy of the model on the provided data. Inputs: - X: Array of data, of shape (N, d_1, ..., d_k) - y: Array of labels, of shape (N,) - num_samples: If not None, subsample the data and only test the model on num_samples datapoints. - batch_size: Split X and y into batches of this size to avoid using too much memory. Returns: - acc: Scalar giving the fraction of instances that were correctly classified by the model. """ # Maybe subsample the data N = X.shape[0] if num_samples is not None and N > num_samples: mask = np.random.choice(N, num_samples) N = num_samples X = X[mask] y = y[mask] # Compute predictions in batches num_batches = N / batch_size if N % batch_size != 0: num_batches += 1 y_pred = [] for i in xrange(num_batches): start = i * batch_size end = (i + 1) * batch_size scores = self.model.loss(X[start:end]) y_pred.append(np.argmax(scores, axis=1)) y_pred = np.hstack(y_pred) acc = np.mean(y_pred == y) return acc def train(self): """ Run optimization to train the model. """ num_train = self.X_train.shape[0] iterations_per_epoch = max(num_train / self.batch_size, 1) num_iterations = self.num_epochs * iterations_per_epoch for t in xrange(num_iterations): self._step() # Maybe print training loss if self.verbose and t % self.print_every == 0: print '(Iteration %d / %d) loss: %f' % ( t + 1, num_iterations, self.loss_history[-1]) # At the end of every epoch, increment the epoch counter and decay the # learning rate. epoch_end = (t + 1) % iterations_per_epoch == 0 if epoch_end: self.epoch += 1 for k in self.optim_configs: self.optim_configs[k]['learning_rate'] *= self.lr_decay # Check train and val accuracy on the first iteration, the last # iteration, and at the end of each epoch. first_it = (t == 0) last_it = (t == num_iterations + 1) if first_it or last_it or epoch_end: train_acc = self.check_accuracy(self.X_train, self.y_train, num_samples=1000) val_acc = self.check_accuracy(self.X_val, self.y_val) self.train_acc_history.append(train_acc) self.val_acc_history.append(val_acc) if self.verbose: print '(Epoch %d / %d) train acc: %f; val_acc: %f' % ( self.epoch, self.num_epochs, train_acc, val_acc) # Keep track of the best model if val_acc > self.best_val_acc: self.best_val_acc = val_acc self.best_params = {} for k, v in self.model.params.iteritems(): self.best_params[k] = v.copy() # At the end of training swap the best params into the model self.model.params = self.best_params
另外,参数的更新封装在optim.py里,Solver和optim.py的协议为:next_w, next_config = self.update_rule(w, dw, config)。这两点说明都在_step函数里能看到用法:
def _step(self): """ Make a single gradient update. This is called by train() and should not be called manually. """ # Make a minibatch of training data num_train = self.X_train.shape[0] batch_mask = np.random.choice(num_train, self.batch_size) X_batch = self.X_train[batch_mask] y_batch = self.y_train[batch_mask] # Compute loss and gradient loss, grads = self.model.loss(X_batch, y_batch) self.loss_history.append(loss) # Perform a parameter update for p, w in self.model.params.iteritems(): dw = grads[p] config = self.optim_configs[p] next_w, next_config = self.update_rule(w, dw, config) self.model.params[p] = next_w self.optim_configs[p] = next_config
def sgd(w, dw, config=None): """ Performs vanilla stochastic gradient descent. config format: - learning_rate: Scalar learning rate. """ if config is None: config = {} config.setdefault('learning_rate', 1e-2) w -= config['learning_rate'] * dw return w, config
核心的代码就一行 w -= config[‘learning_rate’] * dw
model = TwoLayerNet() solver = None############################################################################### TODO: Use a Solver instance to train a TwoLayerNet that achieves at least ## 50% accuracy on the validation set. ###############################################################################for k, v in data.iteritems(): print '%s: ' % k, v.shape model = TwoLayerNet(hidden_dim=100, reg= 1e-02) solver = Solver(model, data, update_rule='sgd', optim_config={ 'learning_rate': 1e-03, }, lr_decay=0.95, num_epochs=10, batch_size=100, print_every=49000) solver.train()############################################################################### END OF YOUR CODE ###############################################################################
读者可能会问,为什么要用learning_rate=1e-3,reg=1e-2?换别的参数好像到不了50%的validation 准确率。这个就是训练神经网络的一些tricks了。谁也不能提前知道,只能通过不断的尝试才能找到比较好的超参数。感兴趣的同学请参考这里和这里,网络上也有很多训练神经网络的tricks,读者也可以自行搜索学习。另外后面的作业里也有除了sgd之外收敛速度更快的优化算法,比如RMSProp 和 Adam,感兴趣的同学可以参考这里。我后面只会把代码实现,细节部分请读者自己琢磨。
class FullyConnectedNet(object): """ A fully-connected neural network with an arbitrary number of hidden layers, ReLU nonlinearities, and a softmax loss function. This will also implement dropout and batch normalization as options. For a network with L layers, the architecture will be {affine - [batch norm] - relu - [dropout]} x (L - 1) - affine - softmax where batch normalization and dropout are optional, and the {...} block is repeated L - 1 times. Similar to the TwoLayerNet above, learnable parameters are stored in the self.params dictionary and will be learned using the Solver class. """ def __init__(self, hidden_dims, input_dim=3*32*32, num_classes=10, dropout=0, use_batchnorm=False, reg=0.0, weight_scale=1e-2, dtype=np.float32, seed=None): """ Initialize a new FullyConnectedNet. Inputs: - hidden_dims: A list of integers giving the size of each hidden layer. - input_dim: An integer giving the size of the input. - num_classes: An integer giving the number of classes to classify. - dropout: Scalar between 0 and 1 giving dropout strength. If dropout=0 then the network should not use dropout at all. - use_batchnorm: Whether or not the network should use batch normalization. - reg: Scalar giving L2 regularization strength. - weight_scale: Scalar giving the standard deviation for random initialization of the weights. - dtype: A numpy datatype object; all computations will be performed using this datatype. float32 is faster but less accurate, so you should use float64 for numeric gradient checking. - seed: If not None, then pass this random seed to the dropout layers. This will make the dropout layers deteriminstic so we can gradient check the model. """ self.use_batchnorm = use_batchnorm self.use_dropout = dropout > 0 self.reg = reg self.num_layers = 1 + len(hidden_dims) self.dtype = dtype self.params = {} ############################################################################ # TODO: Initialize the parameters of the network, storing all values in # # the self.params dictionary. Store weights and biases for the first layer # # in W1 and b1; for the second layer use W2 and b2, etc. Weights should be # # initialized from a normal distribution with standard deviation equal to # # weight_scale and biases should be initialized to zero. # # # # When using batch normalization, store scale and shift parameters for the # # first layer in gamma1 and beta1; for the second layer use gamma2 and # # beta2, etc. Scale parameters should be initialized to one and shift # # parameters should be initialized to zero. # ############################################################################ pass ############################################################################ # END OF YOUR CODE # ############################################################################ # When using dropout we need to pass a dropout_param dictionary to each # dropout layer so that the layer knows the dropout probability and the mode # (train / test). You can pass the same dropout_param to each dropout layer. self.dropout_param = {} if self.use_dropout: self.dropout_param = {'mode': 'train', 'p': dropout} if seed is not None: self.dropout_param['seed'] = seed # With batch normalization we need to keep track of running means and # variances, so we need to pass a special bn_param object to each batch # normalization layer. You should pass self.bn_params[0] to the forward pass # of the first batch normalization layer, self.bn_params[1] to the forward # pass of the second batch normalization layer, etc. self.bn_params = [] if self.use_batchnorm: self.bn_params = [{'mode': 'train'} for i in xrange(self.num_layers - 1)] # Cast all parameters to the correct datatype for k, v in self.params.iteritems(): self.params[k] = v.astype(dtype) def loss(self, X, y=None): """ Compute loss and gradient for the fully-connected net. Input / output: Same as TwoLayerNet above. """ X = X.astype(self.dtype) mode = 'test' if y is None else 'train' # Set train/test mode for batchnorm params and dropout param since they # behave differently during training and testing. if self.dropout_param is not None: self.dropout_param['mode'] = mode if self.use_batchnorm: for bn_param in self.bn_params: bn_param[mode] = mode scores = None ############################################################################ # TODO: Implement the forward pass for the fully-connected net, computing # # the class scores for X and storing them in the scores variable. # # # # When using dropout, you'll need to pass self.dropout_param to each # # dropout forward pass. # # # # When using batch normalization, you'll need to pass self.bn_params[0] to # # the forward pass for the first batch normalization layer, pass # # self.bn_params[1] to the forward pass for the second batch normalization # # layer, etc. # ############################################################################ pass ############################################################################ # END OF YOUR CODE # ############################################################################ # If test mode return early if mode == 'test': return scores loss, grads = 0.0, {} ############################################################################ # TODO: Implement the backward pass for the fully-connected net. Store the # # loss in the loss variable and gradients in the grads dictionary. Compute # # data loss using softmax, and make sure that grads[k] holds the gradients # # for self.params[k]. Don't forget to add L2 regularization! # # # # When using batch normalization, you don't need to regularize the scale # # and shift parameters. # # # # NOTE: To ensure that your implementation matches ours and you pass the # # automated tests, make sure that your L2 regularization includes a factor # # of 0.5 to simplify the expression for the gradient. # ############################################################################ pass ############################################################################ # END OF YOUR CODE # ############################################################################ return loss, grads
weight_scale = 5e-2
learning_rate = 1e-3
weight_scale = 5e-2
learning_rate = 5e-3
next_w = w v = config['momentum'] * v - config['learning_rate'] * dw next_w += v
#rmsprop next_x = x config['cache'] = config['decay_rate'] * config['cache'] + (1 - config['decay_rate']) * (dx * dx) x += -config['learning_rate'] * dx / (np.sqrt(config['cache']) + config['epsilon'])
#adam config['t'] += 1 config['m'] = config['beta1'] * config['m'] + (1 - config['beta1']) * dx config['v'] = config['beta2'] * config['v'] + (1 - config['beta2']) * (dx ** 2) mb = config['m'] / (1 - config['beta1'] ** config['t']) vb = config['v'] / (1 - config['beta2'] ** config['t']) next_x = x - config['learning_rate'] * mb / (np.sqrt(vb) + config['epsilon'])
X_val= data['X_val'] y_val= data['y_val'] X_test= data['X_test'] y_test= data['y_test'] lr =1e-03 #需要调的参数ws = 1e-02 #需要调的参数model = FullyConnectedNet([100, 100, 100, 100], weight_scale=ws, dtype=np.float64,use_batchnorm=False, reg= 1e-2) solver = Solver(model, data, print_every=100, num_epochs=10, batch_size=25, update_rule='adam', optim_config={ 'learning_rate': lr, }, lr_decay = 0.95, #需要调的参数 verbose = True ) solver.train() plt.subplot(2, 1, 1) plt.plot(solver.loss_history) plt.title('Loss history') plt.xlabel('Iteration') plt.ylabel('Loss') plt.subplot(2, 1, 2) plt.plot(solver.train_acc_history, label='train') plt.plot(solver.val_acc_history, label='val') plt.title('Classification accuracy history') plt.xlabel('Epoch') plt.ylabel('Clasification accuracy') plt.show() best_model = model