Abstract:
Within the practical application of neural networks, the number of parameters in the network is much larger than the number of samples in the dataset, however, the network still has good generalization characteristics. Traditionally considered that such over-parameterized and non-convex models can easily fall into local minima while searching for the optimal solution and show low generalization performance, but in fact it is not. Although under some regularization conditions it is possible to effectively control the network generalization error, it is still difficult to explain the generalization problem for large networks. In our work, we determine the difference between the overfitting step and the feature learning step by quantifying the impact of updating one sample during gradient descent on the entire training process, revealing that neural networks generally have less impact on other samples during the overfitting step. In addition, we use the fisher information matrix to mask the gradient produced by the backpropagation process, thereby slowing down the neural network’s overfitting behavior and improving the neural network’s generalization performance.