Structural analogy between the Galton board and neural network architectures
What structural analogies can be drawn between the workings of a 19th-century mechanical device — the Galton board — and the neural networks underpinning contemporary language models? Both systems process distributions and conditional probabilities, aggregate elementary simple effects, and produce emergent complexity through massification. This article explores these parallels and their limits, questioning the emergence of complexity as an intrinsic driver of the emergence of intelligence.
The Galton board is a mechanical device that illustrates the normal distribution (or binomial law) through the movement of balls falling onto pegs and sorting into bins. Its mathematical modelling is described by the probability tree of the binomial distribution: it realises a succession of Bernoulli trials, i.e. N independent draws of a random variable with two values.
When a ball is released at the top, the probability of landing in a specific column depends on the number of paths from top to that column. At each level, the probability of going left is 0.5 and right is 0.5. The probability of a path at rank N is 0.5^N, but the number of routes to a given position varies. This number is represented by a binomial coefficient, determined via Pascal’s triangle.
The binomial distribution is a discrete probability law. As the number of draws N tends to infinity, the binomial probability mass function converges with the normal distribution density (law of large numbers and Moivre-Laplace theorem).
A digital neuron (perceptron model) receives an upstream numerical signal, processes it through an internal mathematical function, and depending on the result, emits a downstream signal. Three signals X1, X2, X3 with weights w1, w2, w3 (synaptic weights) are transmitted to the perceptron, which computes their weighted sum. This sum may be adjusted by a bias, then sent to the activation function: if the sum does not exceed a threshold S, no signal is transmitted; otherwise, the signal propagates to the next layer.
Perceptrons are organised in successive layers: input layer (N perceptrons ingesting the input vector), k hidden layers, output layer (J perceptrons emitting the output vector). In a full mesh, between two layers of K perceptrons, there are K² synapses, and between N layers, K^N synapses — as many model parameters.
In LLMs based on transformers, two components structure the architecture:
The encoder processes and represents input information for the decoder. The multi-head attention mechanism captures contextual dependencies between words at different positions in the input sequence. It contextualises the prompt: the weight of the sub-word “go” differs depending on whether the encoder encodes “today it will go well” or “go do your homework”. The positional feed-forward network adds nonlinear information and captures positional relationships.
The decoder generates the output sequence from the encoder’s continuous representations, via three sub-modules: self-attention, encoder-decoder cross-attention, and positional feed-forward network. Output is produced by a linear layer followed by a softmax function assigning probabilities to vocabulary words.
The reading heads use three matrices: the query matrix (Q) to interrogate contextual relationships, the key matrix (K) to evaluate relative word importance, and the value matrix (V) containing contextual information. The mechanism computes the dot product Q·K, normalises by the square root of the key dimension, applies softmax to obtain attention weights, and weights V. This process runs in parallel for each reading head.
GPT thus uses conditional probabilities to “guess” the next sub-word completing the input prompt, and its attentional mechanism enables it to guess the next sentence — which continuously feeds back into the input data to guess the one after that.
Both systems process distributions and conditional probabilities. The Galton board assigns the probability of being at a particular topological point based on the path taken. An LLM uses its probabilistic weight matrices (defined through training) to guess the most probable word or sentence completing the prompt.
To overcome objections (pyramidal topology, equiprobability, scalar input vector), a thought experiment modifies the Galton board: rectangular topology with staggered pegs (layer 2n = K pegs, layer 2n+1 = K+1 pegs), vectorial input distribution (N columns pre-filled with varying quantities of balls), and modification of peg size and geometry (round, triangular, potato-shaped) to break equiprobability. Physics naturally applies the softmax function since p(A) + p(Ā) = 1.
Both systems aggregate effects to produce results. In the Galton board, each ball undergoes a series of collisions determining its final position. In neural networks, input signals are transformed and weighted through layers to produce a response. Training the “lasagnified” Galton board would consist in modifying peg size, shape and position until obtaining the desired output distribution — rough and very slow, but requiring only gravitational energy once trained.
Both systems exhibit complex behaviours emerging from unitarily simple interactions. The Galton board generates a complex distribution from elementary mechanical collisions. Neural networks model complex functions from simple interconnected neurons.
Strange attractors in chaos theory are characterised by: nonlinearity (variables interacting in complex ways), sensitivity to initial conditions (butterfly effect), fractal structures (similar patterns at different scales — colouring Pascal’s triangle by parity yields the Sierpiński triangle), and chaotic behaviour (nearby trajectories diverging exponentially).
The fundamental difference between the two systems: for the LLM, emergence rests on probabilistic weight matrices and local minima in gradient computations; for the Galton board, it belongs to chaos theory — initial position, momentum, peg interaction, shape and texture are all parameters of a nonlinear dynamical system.
In both cases, system architecture and the laws governing unitary mechanisms are “simple”. Complexity emerges from the mass effect: it is the number of parameters, synapses or pegs that adds as many degrees of freedom to adapt to different input/output pairs. With 175 billion parameters (GPT-3.5) and five times more for GPT-4, those probabilistic inferences emerge that suggest the embryo of artificial general intelligence.
Yet the brain’s architecture bears no resemblance to a lasagna dish. It is a fractal architecture interconnecting specialised computing units (visual cortex, language area, motor areas) via large axonal bundles. Synaptic massification renders illusory the ability to understand the system’s detailed workings, and therefore to explain a result or “decision” a posteriori. Large Language Models will struggle to meet ethical AI requirements: datasets difficult to describe with metadata, robustness potentially compromised by strange attractors, and the impossibility of determining probabilistic model behaviour at these parameter scales.