Tan Sing Kuang (The Bolshevik)

Hello, this is my webpage

alt text

Machine Learning Bounds

I gather all these information from ChatGPT. It is a fast way to search for relevant information as it is difficult to find the specific webpage on each topic. You should try to derive the bounds. Although you cannot derive new bounds, deriving old bounds help you understand the techniques and algorithms better.

Batch Size

Deep learning does stochastic gradient descent. So the gradient is modeled as a determinstic gradient with Gaussian noise added. The variance of the gradient decreases with increase in batch size. alt text

This formula computes the generalization gap of stochastic gradient descent. nt is the learning rate, t is the number of steps, n is the dataset size, and lastly B is the batch size. Increase in batch size will increase learning accuracy assume that there is no overfitting.

Gradient Descent

The convergence rate is O(1/t) in deterministic gradient descent. t is the number of gradient descent steps. Basically it means that the loss function will descrease at rate 1/t. alt text

If the loss function is not skewed, e.g. in quadratic loss function, the condition number (largest eigenvalue divided by lowest eigenvalue is small), the loss function will look like a circular bowl with no zig-zag in gradient descent, so it will converge in exponential rate. alt text

For stochastic gradient descent, the decrease rate of loss function is 1/sqrt(t), which is much slower than ordinary gradient descent. alt text

The error bound of stochastic gradient descent can be summarized by 2 terms, one is the deterministic rate 1/t and another is the non-deterministic rate determined by the batch size B.

Stochastic Gradient Descent, Newton Method, Fixed and Decreasing Learning Rate, Conjugate Accelerated Nesterov Gradient Descent, Stability, Sensitivity, Differential Equations Bounds

VC Dimension

Vapnik–Chervonenkis dimension is a measure of the capacity of a machine learning model or algorithm. How much information can it learns. Higer VC dim is usually better but it may overfit and learn the noise in the data. We can further fix this by using regularization, which I will talk more in the later part. Some models such as K nearest neighbors classifier has infinite VC dimension. The VC dimension will keep on increasing as the number of data points increase. Non parametric probbability distribution estimation is another example.

Decision Tree, Support Vector Machine, Quadratic Kernels

The VC dimension of linear SVM is d+1. Because the hyperplane can shatter the input space into D + 1 possible regions. This is the same for perceptron (1 layer) in deep learning. alt text

For soft margin SVM, the VC dimension increases with slack variable. The slack variable controls the regularization of learning. The higher the slack variable, the lesser the regularization. Usually with high dimension input, we add regularization to avoid overfitting. In the case of deep learning, usually we create a large network to overfit the data, then we add regularization to increase the validation accuracy on the test dataset. The variable inside the f(.) function is the slack variable. alt text

This is the VC dimension of decision tree. Basically it means that decision tree will shatter the input space into 2 to the power of D subspace as represented by the leaves, with D different classes at the leaves.

Multilayer Perceptron, Ensembles Max and Mean Outputs, Compare Similarity

Note that the Bagging or Boosting uses majority vote. There are similarity between multilayer perceptron VC dim and ensembles VC dim. Is there a relationship between them? Can we convert MLP to ensembles learning?

Transformer model, Attention layer

Transformer model vc dim is same as MLP and it is related to size of attention layer matrix, n^2. Transformer has better scaling law than convolutional network, therefore I think transformer is the future.

Group Convolution, Alexnet double branches bound

Chebyshev Bound

Variance Reduction

Convexity

Super Convergence, L-smooth Learning Rate, Cross Entropy Loss L-smooth Factor, Quadratic Loss with Large Condition Number, PL Condition

Sparsity Learning

Restricted isometry property is when matrix A becomes othonormal when machine learning with sparsity constraints. Sparsity greatly reduces the number of training samples to learn the weights. alt text

Sparsity increases the accuracy bound by sqrt of s where s is the sparsity level. alt text

Sparsity in network reduce the per-step cost of training stochastic gradient descent from W to s.

Support Vector Machine, Number of training samples

L2 Norm Regularization Bound

Principal Component Analysis, Linear Discriminant Analysis, Independent Component Analysis Bounds

Geometric Complexity Entropy Bounds

Matrix Singular Value Decomposition Bounds

Genetic Algorithm, Stimulated Annealing, Ant Colony Bounds

These metaheuristic algorithms have poorer accuracy and convergence rate bounds than deep learning algorithms. The adding of noise in diffusion image generation network is like stimulated annealing algorithm. Genetic algorithm is used in deep learning architecture search. Reinforcement learning algorithm is generally better than genetic algorithm in neural architecture search.

Markov Chain Bounds

Graph Neural Network Convergence Requirements

Generally, a cycle free graph will have a more stable weights updating. Asynchronous weights updating is generally more stable than synchronous weights updating. That is also why stochastic gradient descent is better conventional gradient descent.

Markov Random Field Bounds

Tree-like graph approximation of multi-branches deep learning network can estimate certain accuracy bounds of the deep learning network.

K Nearest Neighbors Bounds, Variance-Bias Bounds

Some similar variance-bias bounds are used in Neural Tangent Kernel (NTK) theory to predict the performances and learning behaviors of deep learning network.

Convex Relaxation Bounds

These are some of the relaxed optimization functions to find approximate solutions to three coloring problem with certain bounds to the optimal solutions. alt text

Often, to derive a bound for a problem, the original algebraic formulation is too rigid to be transformed to compute aggregated properties of the loss function of problem. So the problem is relaxed by approximating it with another set of algebraic equations. Then the equations are further transformed to generate aggregated properties bounds. Later, I will talk about travelling salesman bounds. alt text

Clustering Algorithm Bounds

Mathematical Series Bounds

Fourier series has many uses in signal processing, computer vision and deep learning. For example in computer vision, to align (image registeration) an image with another image. Fourier features are used in machine learning and deep learning. alt text

Certain families of orthogonal polynomials such as Lengendre Polynomial can be used to design activation functions which is similar to Kolmogorov-Arnold Network (KAN) network. Pade polynomial approximation, similar to Talyor series can be used to approximate relu and other types of activation functions, which can be used to study the mathematical properties of deep learning networks.

Finite Automata Bounds

Automata, regular expression and state machines are often used in grammar and spelling corrections. They can represent algorithm flowchart and study the convergence (or termination property) of algorithm. Large language model is a Transformer deep learning model, which is memoryless and can be represented by Automata. Some of the mathematical bounds can be used to infer properties of large language model. If you reverse the order of characters in a text and train an finite automata to recognize it, it may sometimes lead to exponential number of states with respect to number of characters in a text, but is polynomial states size if the text is not reversed. alt text