Hello, this is my webpage

Deep learning does stochastic gradient descent. So the gradient is modeled as a determinstic gradient with Gaussian noise added. The variance of the gradient decreases with increase in batch size.
This formula computes the generalization gap of stochastic gradient descent. nt is the learning rate, t is the number of steps, n is the dataset size, and lastly B is the batch size. Increase in batch size will increase learning accuracy assume that there is no overfitting.
The convergence rate is O(1/t) in deterministic gradient descent. t is the number of gradient descent steps. Basically it means that the loss function will descrease at rate 1/t.
If the loss function is not skewed, e.g. in quadratic loss function, the condition number (largest eigenvalue divided by lowest eigenvalue is small), the loss function will look like a circular bowl with no zig-zag in gradient descent, so it will converge in exponential rate.
For stochastic gradient descent, the decrease rate of loss function is 1/sqrt(t), which is much slower than ordinary gradient descent.
The error bound of stochastic gradient descent can be summarized by 2 terms, one is the deterministic rate 1/t and another is the non-deterministic rate determined by the batch size B.
The VC dimension of linear SVM is d+1. Because the hyperplane can shatter the input space into D + 1 possible regions. This is the same for perceptron (1 layer) in deep learning.
For soft margin SVM, the VC dimension increases with slack variable. The slack variable controls the regularization of learning. The higher the slack variable, the lesser the regularization. Usually with high dimension input, we add regularization to avoid overfitting. In the case of deep learning, usually we create a large network to overfit the data, then we add regularization to increase the validation accuracy on the test dataset. The variable inside the f(.) function is the slack variable.
This is the VC dimension of decision tree. Basically it means that decision tree will shatter the input space into 2 to the power of D subspace as represented by the leaves, with D different classes at the leaves.
Note that the Bagging or Boosting uses majority vote.
There are similarity between multilayer perceptron VC dim and ensembles VC dim. Is there a relationship between them? Can we convert MLP to ensembles learning?
Transformer model vc dim is same as MLP and it is related to size of attention layer matrix, n^2. Transformer has better scaling law than convolutional network, therefore I think transformer is the future.
Restricted isometry property is when matrix A becomes othonormal when machine learning with sparsity constraints. Sparsity greatly reduces the number of training samples to learn the weights.
Sparsity increases the accuracy bound by sqrt of s where s is the sparsity level.
Sparsity in network reduce the per-step cost of training stochastic gradient descent from W to s.
These metaheuristic algorithms have poorer accuracy and convergence rate bounds than deep learning algorithms. The adding of noise in diffusion image generation network is like stimulated annealing algorithm. Genetic algorithm is used in deep learning architecture search. Reinforcement learning algorithm is generally better than genetic algorithm in neural architecture search.
Generally, a cycle free graph will have a more stable weights updating.
Asynchronous weights updating is generally more stable than synchronous weights updating. That is also why stochastic gradient descent is better conventional gradient descent.
Tree-like graph approximation of multi-branches deep learning network can estimate certain accuracy bounds of the deep learning network.
Some similar variance-bias bounds are used in Neural Tangent Kernel (NTK) theory to predict the performances and learning behaviors of deep learning network.
Often, to derive a bound for a problem, the original algebraic formulation is too rigid to be transformed to compute aggregated properties of the loss function of problem.
So the problem is relaxed by approximating it with another set of algebraic equations. Then the equations are further transformed to generate aggregated properties bounds.
Later, I will talk about travelling salesman bounds.
Certain families of orthogonal polynomials such as Lengendre Polynomial can be used to design activation functions which is similar to Kolmogorov-Arnold Network (KAN) network. Pade polynomial approximation, similar to Talyor series can be used to approximate relu and other types of activation functions, which can be used to study the mathematical properties of deep learning networks.