Neural Networks, Information Theory and Knowledge Representation

Julian Smith

Ph.D.

The University of Edinburgh

1995

Contents

Introduction 3
Chapter 1: Some common unsupervised neural network
learning algorithms
9
Chapter 2: The problems of implementing neural networks
that use lateral connections
23
Chapter 3: Using unsupervised feature-detectors in a model
of reading aloud
31
Information theory, and an introduction to chapters 4-6 45
Chapter 4: The Maximum-Entropy-Filter (M.E.F.) principle 49
Chapter 5: Applying the M.E.F. principle to vision 64
Chapter 6: A Neural network implementation of the
M.E.F. principle
78
Chapter 7: Conclusion 89
References 94

Introduction

This thesis is concerned with investigating the modelling of cognition using neural networks, with emphasis on making the network models biologically plausible.

It is impossible to build neural network models of the brain from precise mathematical models of individual neurons, for two reasons: first, the exact behaviour of real neurons is not known and, second, even if we knew the precise behaviour of real neurons, present day computers would not be powerful enough to model large numbers of them.

Despite these problems, using networks of units as cognitive models is still attractive because it is known that the brain is constructed in this way. For example, one hope is that the precise behaviour of neurons is not important to brain function, and that the fact that the brain is made up of many relatively simple units connected together is important. Hence we can attempt to model cognition using networks of idealised units which approximately correspond to real neurons.

Although knowledge of the details of neuron behaviour is incomplete, there are a number of details which are relatively certain (Kuffler et al [1984], Hodgkin & Huxley [1952], Douglas & Martin [1990]):

The simplifications which lie behind most neural network models are that the signal frequency can convey a continuous signal, and that the neurons alter their behaviour by adjusting their synapse strengths, so changing how much or how little a particular input affects the neuron.

Much of the enormous amount of neural network research done in the last few years has been prompted by the invention of the back propagation learning algorithm. This enables neural network models to be made which can learn to solve extremely complicated problems, sometimes in ways similar to real cognition. Unfortunately, a back propagation neural network is a rather extreme distortion of the skeleton neural network described above, as it postulates the existence of error signals which travel in the opposite direction to the main neuron output signals, so as to inform neurons about whether they are having a good or bad influence on neurons further down the network.

While such additional features certainly add to the power of neural networks, it is preferable to make as few assumptions which go beyond known physiological evidence as possible. Otherwise, we could add all sorts of features to make a particular modelling task easier, but which might make the task completely different from the real task which the brain has to cope with.

A common criticism of back propagation is that it is a supervised algorithm. I don't consider this to be the main problem though; at some stage there has to be feedback involved in the brain, which is similar to the idea of supervision - for example semi-autonomous networks supervising each other.

However, back propagation's use of separate error signals which are provided for each unit, is a much more important issue. The main criticism of these individual error signals is that there is no biological evidence that they exist in the brain. They have been invented because they provide a particularly simple and effective way of making networks learn to perform arbitrary tasks: with error signals back propagating from unit to unit, one can effectively optimise the performance of each unit individually, by using the chain rule from differential calculus to find the effect that each unit is indirectly having on the output of the whole network.

Apart from there being little biological evidence for the existence of error signals to individual neurons in the brain, it is clear that much useful processing of data can take place without such error signals. A particularly persuasive example of this is the work in Finch & Chater [1992] and Finch et al [1995], where statistical analysis of a large corpus of written text resulted in the creation of word-categories which were similar to those used by linguists, and also seem to have some semantic validity. For examples of more general statistical analysis of data by unsupervised neural networks, see the the work on principal components analysis in Plumbley [1991], Földiák [1991], Földiák [1992], and some of the work described later in this thesis.

If we restrict ourselves to non-error correcting network learning algorithms, we can't perform arbitrary data transformations - a detailed error signal would be required for this. Instead, we must rely on there being some statistical structure present in the input data set - for example, if you put completely random data into a principal components network, the weight vectors will move around aimlessly and never converge to a stable state. However, real data does indeed have lots of structure, so simple statistical methods are worth considering for at least the early stages of sense-data processing.

For this reason, the work in this thesis is an investigation into how the brain might extract and use statistical structure in input data.

Thesis summary

A very brief description of each of the chapters in the thesis is included below. The central idea in the thesis is presented in chapter 4. Its consequences are explored in chapters 5 and 6, and ideas for further work based on this idea are explained in chapter 7.

Chapters 1 and 2 discuss unsupervised neural network learning, and the idea that significant statistical information is present in environmental data. Chapter 3 described a (non-network) model of reading aloud.

Chapter 1

A description of some common unsupervised neural network learning algorithms.

Chapter 2

A discussion of problems with unsupervised network models which have recurrent connections. One can make this sort of network perform Principal Components Analysis on the data, e.g. Plumbley [1991], Földiák [1992]. Unfortunately, I have found significant problems with the practical behaviour of this sort of network, caused by the use of recurrent connections. The lateral connections mean that the network is susceptible to oscillations, and to overcome this, one has to use what I call a 'differential state update' method, in which the state of each unit is increased by an amount proportional to the units inputs, rather than being set directly to a value proportional to the inputs. This method is implicitly used by Földiák [1992]. I have found that even with differential state update, the network can take many iterations to stabilise, which is at variance with known response times in everyday psychological experiments, where there is virtually no time for real neural networks to iteratively approach a stable set of activations.

Chapter 3

This is on a slightly different subject matter from the rest of this thesis. A model of reading aloud is described, which was developed using the ideas presented earlier about how the brain could take advantage of structure contained in the statistics of sense data, although the statistical analysis is not done with neural networks.

Chapter 4

This chapter presents the central idea in this thesis. It explains how information theory can be used to develop a general definition of a high-level representation. This is a very important idea, as it provides a mathematical framework for research into how the brain transforms sense data into a useful high-level form, and suggests ways in which this could be done using unsupervised neural networks.

High-level cognitive functions like walking, talking, abstract thinking etc., are clearly only possible because we deal with the world at a 'higher level' than simple sense data. Hence the importance of the general definition of a high-level representation presented in this chapter.

Chapter 5

This presents an application of the ideas developed in chapter 4, to low-level vision. A very simple statistical model of visual data is described, and the filter which would convert this data into a high-level representation is derived. The filter shows some similarities to the early visual centres in the brain, and provides a striking example of how the rather abstract information-theoretic ideas from chapter 4 can have very interesting and realistic consequences.

Chapter 6

A very simple neural network learning rule is described which attempts to make the output of the network have maximum entropy, in accordance with the ideas developed in chapter 4. The rule uses one extra signal in learning, representing the average activation of all units, to provide a crude form of de-correlation. The rule is applied to input patterns which are bit-images of the digits 0-9, and units are shown to approximately develop weights which make them respond to different input patterns.

Chapter 7

A summary of the thesis, and some ideas for future work which could extend the work on maximum-entropy representations and neural networks.

Acknowledgements

I am very grateful to my supervisor, Nick Chater, who has helped in the development of many of the ideas in this thesis. Of particular influence were many interesting discussions about information theory and maximum-entropy representations, and their relevance to unsupervised neural network models.

I would also like to thank Chris Huckle for reading through and commenting on later versions of the thesis.

Declaration (University of Edinburgh Postgraduate Study Regulation 3.4.7)

  1. This thesis has been composed by myself.
  2. The work described in this thesis is my own.

Chapter 1 Some common unsupervised neural network learning algorithms

This chapter is a description of some common local and non-error-correcting network learning algorithms.

Some notes on error-correction

A very powerful way of training a neural network is to use a form of back propagation (Werbos [1974], Rumelhart et al [1986], Parker [1985]), where error signals are calculated and sent to individual units. Unfortunately there is no evidence that such error-correcting signals at the level of individual neurons occur in real brains. Instead, a unit's weights have to be changed using the much less specific (from a weight-changing point of view) information contained just in the inputs to that unit. This means that a layer of units in a realistic (i.e. biologically plausible) network cannot be expected to organise themselves in order to perform any given function unless that function can be realised by the action of learning rules which are in some sense localised to each unit's inputs and output.

This is a serious restriction on a neural network. For example, if one wants the following network to solve the Exclusive-OR problem, and the output unit ends up at some stage with weights that approximately OR the two inputs, then it would be very useful for a hidden unit to AND the two inputs, as this would provide the output unit with just the extra information that it requires in order to solve the task.

A simple network which could solve the XOR problem if the hidden unit is given an error signal.

However, with no single-unit error signals, there is no particular reason why the hidden unit should suddenly decide to AND its inputs. Thus it would seem that not being able to have detailed error signals is a serious restriction on network learning ability.

Reinforcement learning (Barto and Anandan [1985], Barto and Jordan [1987], Grossman [1990]) is a sort of halfway house between this seemingly completely unsupervised type of network and the back-propagation algorithm. With reinforcement learning, there is a single error, which is known to the entire system, and which condenses the performance of the whole network into a single number. In the Exclusive OR problem mentioned above, any random fluctuation of the hidden unit's weights which made it perform more like an AND gate (and assuming by chance that the weight from hidden to output was negative) would result in the network performance improving. This could be detected, and the hidden unit's weights could be made to continue changing in this direction.

However, this example illustrates the low power which any reinforcement algorithm has - it has only one number to guide the evolution of hundreds of weights in a reasonably sized network. One could increase the number of reinforcement signals available, but it still seems that individual units in the brain manage to change their weights in useful ways without using many external learning signals.

If a unit is not given any explicit information like "in future, respond more strongly to this input pattern", how is it to learn? The most common solution is to use a variant of the Hebbian learning rule. Often, a non-local controller is also included, for example in competitive learning, in order to make a network's weights evolve in a particular way. This has obvious biological problems, although inhibitory connections can be used to partly mimic the effect of such global controllers. Unfortunately, the use of inhibitory connections often results in the units in a network taking a long time to reach stable activations, and even in oscillatory behaviour. This means that the network is slow, a serious drawback considering the speed with which visual stimuli are recognised by people (Sejnowski [1986], Marr [1982], Morell [1972]).

To sum up, any unsupervised network learning model which doesn't use individual unit error signals can only process input data in a specific built-in way determined by the architecture of the model and the learning rule used by the units.

Neural networks function by being repeatedly presented with data, so the evolution of weights will be determined by the frequency with which the various input patterns occur. Thus one can look at the network as responding to particular statistics of the input dataset, as determined by the architecture and learning rules. The statistics that the network works with may not always be of a clean mathematical type, such as calculation of the principal components of the input distribution, but I think that regarding the processing of a neural network as being statistical ensures that one has a realistic expectation of its capabilities.

Some well-known unsupervised non error-correcting neural networks

Some of the algorithms mentioned below use a global supervisor to control some of the training. This is obviously a problem, but it is sometimes possible to use lateral inhibitory connections to implement the effect of the global controller, so the behaviour of these algorithms is interesting in the present context.

Most of the following algorithms can be viewed as trying to make units decorrelated and/or maximise the mutual information between the input and output patterns. Chapter 4 of this thesis will give a general theoretical principle which explains why these two processes are useful.

This section doesn't include any original work, except for some of the comments on Linsker's network and its implications for purely statistical methods of data-transformation.

Competitive learning

In competitive learning (Rosenblatt [1962], von der Malsburg [1973], Fukushima [1975][1980], Kohonen [1982], Rumelhart and Zipser [1985], the unit whose weight vector best matches the input vector is chosen as the winner, and is allowed to fire, and its weight vector is moved closer to the actual input.

A problem with this is that some units can get ignored initially because their initial weights don't match the input pattern sufficiently well, and so the other units' weights continuously get adjusted to fit the input patterns better. This results in the group of units keeping the same weights, and so never winning the competition.

One way of reducing this problem is to give units an adjustable threshold parameter which is subtracted from the weighted input. If a unit has not won the competition to be the best matched input for a long time, the threshold is lowered, making the unit more likely to win a competition in the future. This mechanism also works as a conscience (Rumelhart and Zipser [1985]) in units which happen to win too many competitions, by raising their threshold and so making them less likely to win in the future.

Each unit of the network can be looked on as responding to inputs that are in a certain category, so the network performs a form of cluster analysis.

One can attempt to implement competitive learning with inhibitory connections. The feature-detecting networks described in chapter 2 can be viewed like this.

Clearly, the competitive learning algorithm tries to force units to behave differently from each other. This has important parallels with the information theoretic analysis of unsupervised learning described in chapter 4 of this thesis.

Kohonen networks

Kohonen networks are similar to competitive learning networks, except that the units are treated as if they are in a topographic array in (for instance) two dimensions (see Kohonen [1982] and Kohonen [1989] for a more detailed description of Kohonen networks). For each input pattern presentation, the unit with its weight vector closest to the input vector is nominated as the winner, and the weight vector of this winning unit is moved closer to that particular input, as in normal Hebbian learning.

Those units which are close to the winning unit in the topographic array also have their weights updated in the same way, though with a smaller learning rate determined by their distance form the winning unit. This means that units that are close together in the topographic map will eventually have similar weight vectors, so that after training the network will map the high dimensional input onto the much smaller dimensional space that the units are in, in the sense that inputs that are close together in the high dimensional input space will be mapped to neighbouring units in the unit's space.

Unfortunately, Kohonen networks can take many thousands of iterations to reach a stable set of weights, which is not really surprising considering the difficulty of mapping a high dimensional space to a lower dimensional one while attempting to preserve topographic information.

Linsker's self-organising network (Linsker [1986][1988])

Linsker's network is completely unsupervised, but when fed with random noise, units develop which perform visually important functions such as centre-surround and orientation-sensitive detectors. This seemingly miraculous behaviour is a result of the units having limited receptive fields.

The network uses linear units. It is well known that any multi-layer linear network is equivalent to a single-layer linear network, because each layer just performs a matrix multiplication, and the transformation given by successive multiplication by any number of matrices can always be performed by just one matrix multiplication. However, the multi-layer aspect of Linsker's network is important for learning.

The units in a particular layer of the network receive inputs only from those units within their receptive fields in the previous layer. These receptive fields overlap, so the outputs from a layer will each go to more than one unit in the next layer. The learning rule for the weights is a generalised Hebbian rule:

equation

where is the activation of unit , the -th input to the layer, and is the weight for the -th input to the -th unit. are constants, which determine the particular patterns of weights that develop.

The weights are also constrained to lie in a certain range .

The four parameters in the learning rule have to be carefully selected so that the network would develop the feature detectors described earlier. The addition of locally excitatory connections in the layers that developed orientation-sensitive units made the units in the layer have a smoothly changing orientation preference, similar to those found biologically.

Linsker shows how the network maximises the variance of the output unit activations, subject to constraints on weight size. When the inputs are normally distributed, and there is gaussian noise on all lines, maximising the output variance also maximises the Shannon information transmission from inputs to outputs. Linsker uses this as a general principal (Infomax): units and architectures should be designed to maximise the Shannon information they transmit to subsequent layers. The assumption that inputs have a gaussian distribution is hardly realistic however, so Linsker's actual network learning rule is not a generally useful one. The Infomax principle has links with the Maximum Entropy Filter principle developed in chapter 4.

Networks which perform Principal Components analysis

Most of the results presented here are already well-known, but are included here because chapter 2 presents the results of simulations of principal component networks, and the mathematical derivations below can aid the understanding of the results presented there. See Hertz et al [1991] for a similar treatment, and also Sanger [1989] and Oja [1989].

Given many data points in an -dimensional space, one can represent these points as linear combinations of basis vectors. The basis vectors can be chosen at random, with the proviso that they are linearly independent. Intuitively, the "First Principal Component" of a set of data points is the direction in which the data points vary most widely. If this direction is chosen as a basis vector, then the best way of representing the data using just one parameter is to take components of the data points along this Principal Component. Here, "best way of representing the data" means the method whose reconstruction of the input has the least mean-square error. The principal component is also the eigenvector of the correlation matrix of the data which has largest eigenvalue.


A set of data points in 2 dimensions, with the arrow representing the principal component of the data

Adding a second basis vector enables one to represent the data more accurately. Choosing the second basis vector to be along the direction which shows the highest variation, after subtracting the component along the first P.C., defines the second principal component. For the data in the diagram, there can only be 2 principal components because the data itself is only two-dimensional.

It is possible to make an unsupervised neural network find the principal components of input data. In fact the simple Hebbian algorithm with a linear unit will tend to align its weight vector with the first principal component of the input data. Mathematically, the first principal component is the direction of the maximal eigenvector of the correlation matrix of the input data:

The proof of this is straightforward, but rather long-winded. We will use a matrix notation for the various quantities as this simplifies things: We will consider a single linear unit with inputs, and let the activation of this unit be . The inputs will be represented as an -row column matrix, , and the weights of the unit will be represented as , a row matrix with columns. Hence the state of the linear unit will be .

We are interested in the development of the weight vector when the unit is exposed to a set of different input patterns, so for each different input pattern enumerated by m, we will set .

Our unit has a Hebbian learning rule which in terms of components of the weight and input matrices is , where is the learnrate. In matrix notation this is , because is a row matrix while is a column matrix.

To see what happens to the weight vector when the unit is trained with many input patterns, we need to know what the average weight change is, by averaging over all input patterns. We will also substitute in the expression for the state of the unit in terms of the input pattern and the weight vector, , and ignore the learnrate:

equation

equation

equation

where C is the covariance matrix of the input patterns, , assuming the input patterns have zero mean.

To show that the weight vector will tend to align itself along the first principal component of the input patterns, we will express it as a linear combination of all the eigenvectors of , and look at how the coefficients of this linear combination change, instead of how the whole weight vector changes.

We will let the eigenvectors of be , and the coefficients of the linear expansion be , where n runs from to in both cases.

Then the weight vector can be written . Hence we get . Also, . Because is an eigenvector of , we must have , where is the eigenvalue of the eigenvector , so .

Hence becomes . Because the are orthogonal, we can equate their coefficients to get .

Clearly, if all the a's are initially of roughly equal size, then the a for the maximal eigenvector will increase the fastest, because its will be the largest. If some sort of weight limitation procedure is used, then the weight will not increase in length indefinitely, but its direction will still tend towards that of the maximal eigenvector.

There are many ways of keeping the components of the weight vector bounded, such as dividing all components by the length of the vector after each training step which will keep , though this is clearly non-local with respect to individual synapses.

A different way of generalising the Hebb rule to keep the weights bounded is the Oja rule (Oja [1982]), which keeps for linear units, and has the attractive property that it doesn't involve any summing over all the weights on a neuron: each synapse update is dependent on only that synapse and the activation of the unit, and not dependent on any other synapse strength. This rule is: , or in matrix notation, .

This Oja rule with a single unit will always find the first principal component of the input distribution, regardless of the initial weight values, and will ensure that :

As before, the state of the unit is . The learning rule is , so to find the average weight change for all input patterns, , we need to know and .

For , we substitute to get , where is the correlation matrix for the set of input patterns.

Similarly, for we get:
.

Hence the average weight change is

For the weight vector to be stable, we must have , so at equilibrium. The R.H.S. is a scalar times the weight vector , so this equation says that any stable weight vector must be an eigenvector of , the covariance matrix of the input distribution.

Given that is an eigenvector of , with some eigenvalue , we must have and . If we substitute these back into the above expression that says that, at equilibrium, , we get , so (assuming ). Hence the Oja rule makes the length of the weight vector be unity at equilibrium.

We can also investigate how the weight vector changes in terms of its components along the eigenvectors of , and show that the only stable equilibrium is when the weight vector is the maximal eigenvector: As before, we shall express as a linear combination of all the unit eigenvectors of , , where the eigenvector has eigenvalue . The expression for the average change in the weight vector becomes:

equation

equation

equation

equation

equation

equation

equation

The second term on the R.H.S. is the same for all , so the component of along the eigenvector with greatest will increase fastest.

If the weight vector has already reached an equilibrium, then only one of the a's will be 1, say and the rest zero because the weight vector will be an eigenvector and will have unit length. Hence the above expression will be . This means that any perturbation of the weight vector along an eigenvector which has a higher eigenvalue than will be unstable. This means that the only stable equilibrium is when the weight vector is along the maximal eigenvector with eigenvalue , as no other eigenvector has a higher eigenvalue. This also corresponds to the principal component of the input data.

It is possible to make an unsupervised net with two units which respond to the first and second eigenvectors of the input data set. This is done by using two linear units training with a Hebb rule. Normally these would each tend to the maximal eigenvector. To prevent this, we can arrange for a lateral connection from the output of unit 1 to the input of unit 2, which learns in an anti-Hebbian way, i.e. the strength of the weight will decrease if the two units are on together. Unit 1 will detect the first principal component as usual, as its inputs are identical to the normal principal component detecting case. To find what unit 2 will do, we just note that the lateral connection will change until the two units are uncorrelated. Because the normal weights will try to converge to the highest-eigenvalue eigenvector, the unit will eventually find the second principal components. Also, the lateral connection will end up at zero.

A fuller analysis is rather tedious, but is given here for completeness:

The states of the two units are given by: , and . We can expand the expression for to get .

If we assume for the moment that the lateral connection changes much more quickly than the normal weights, we find that as a stable equilibrium at : We have . This is . Because is positive, this equilibrium is stable.

Hence being at equilibrium means that , or .

We can now find the average change in : . We can substitute the expression found earlier, , to get:

The expression for found earlier was . Substituting this gives:

equation

equation

equation

equation

equation

equation

equation

equation

equation

equation

equation

equation

equation

equation

equation

equation

The first two terms are the normal Oja learning rule, while the last is easier to understand if we substitute in the fact that will converge to the first principal component of , in which case , so and . This makes the third term: . This is a vector which is perpendicular to the current vector when has reached unit length, and is directed away from the first principal component. This third term is zero when and are perpendicular.

Also, when the network has stabilised, it turns out that the lateral connections decay to zero. This can be proved mathematically by solving the simultaneous equations for the two units. The lateral connection is still needed however, as it acts to reverse any small random change in the input weights which makes the units become slightly correlated again.

Another way of looking at the network is that the second unit receives an input equivalent to a simple unit with no lateral connections, but with the first principal component removed from its inputs. This is because after the first unit has found the first principal component, its weight vector will be , so the inhibitory weight from unit 1 to unit 2 will be , which is minus the component of along the first principal component. The 'input' to this lateral weight will be the activation of unit 1, which is just the magnitude of the input vector along the first principal component, so this will cancel exactly the component of the input vector along the first principal component, hence it will find the second principal component.

The anti-Hebbian lateral connection scheme can be extended to have any number of units, for example (Sanger [1989]):

A network which detects the first three Principal Components of its input distribution

From a biological point of view, the structure of these networks is rather contrived because of the unsymmetrical arrangement of the lateral connections. More plausible would be to have anti-Hebbian lateral connections between all pairs of units (Oja [1989]). This arrangement is described and modelled in chapter 2, and the -unit network learns to span the -dimensional sub-space of the original -dimensional inputs. Unfortunately, real modelling of units with this pattern of lateral connections is complicated because the network is recurrent, so that the network can take many iterations before the activations are stable. Moreover, there is evidence that the output axons from neurons do not form both excitatory and inhibitory connections, so the lateral connections in the principal component model would have to be mediated by extra neurons (e.g. Plumbley [1991]). See chapter 2 for the details of some simulations of this type of network.

Despite these implementational problems, P.C.A. is an attractive algorithm for a network, because it performs a form of data compression. The most variable parts of the input data will often be the most informative from an informational-theory point of view1, so the simple network can be thought of as extracting the most important parts of the input data. This idea will be further developed in chapter 4.


1  The precise definition of the most informationally useful part of the input data depends on the sort of noise present and the form of the input data. If both of these are gaussian, then representing the data in terms of the principal components is optimal.


Földiák's use of anti-Hebbian learning to provide de-correlation (Földiák [1990])

This is similar to principal component networks, in that it uses simple units with lateral connections. The main differences are that the units are non-linear, and the units have adjustable thresholds. The network is designed so that it learns to represent input data in a particular way:

Földiák describes two extreme ways in which a network can encode information:

The best way is somewhere between these two extremes - sparse coding. Földiák describes a network that uses non-linear units with anti-Hebbian lateral connections. The parameters of the model which control the threshold of the units are arranged so that most units will be off for any input pattern. This makes the network perform sparse coding. The particular compromise between grandmother and fully-distributed representation is determined by the average number of units which are allowed to fire for each pattern.

The network units develop into feature-detectors, and the network can develop units which respond to commonly occurring patterns in the input data. Of particular interest is that the output for frequently occurring input patterns is sparser than the output for infrequent input patterns. This would make for easier recognition of commonly occurring patterns by a subsequent layer, and also means that the output of the network is efficient from an information theory point of view.

Földiák uses a differential state update in order to model the network, although doesn't discuss this. Instead of the usual equation for finding the new state of unit with a non-linear response function :

equation


Földiák uses a differential equation of the form:

equation

Note the last term. This is designed so that when stability is reached (), the state of the unit will equal the total input, so that stability corresponds to a network with normal units (i.e. without differential state update) at stability, even though such a network would not always reach this state because of instability problems.

Networks with connectivities less than 100%

All the networks considered so far, except for the Linsker network, have had 100% connectivity, i.e. each unit receives input from every unit in the previous layer (or from every input to the network for a one-layer network). It is curious that very little work has been done on networks which are not fully connected, as the brain certainly hasn't got 100% connectivity.

It is intuitively clear, and also follows in a particularly direct way from information theory (see chapter 4), that in order to be useful, units should respond differently from each other, otherwise one might as well use fewer units. This is done in principal components networks by the use of lateral connections; back propagation networks solve the same problem by sending error signals to different units which tend to amplify any small initial variations in each unit's weight vectors.

However, if a network has limited connectivity this problem is, at least partially, solved automatically. One idea for further work following on from this thesis is to investigate what effect limited connectivity has on the problem of making networks transform data into useful representations.

From the point of view of the later work in this thesis about processing visual data, Linsker's generation of interesting receptive fields from completely random data is, at first glance, problematic, because the information theory approach only works when the input data has significant structure. However, it turns out that the first layer in Linsker's network is actually adding structure to the (structureless) input data because of the limited receptive fields, and this structure is very similar to that assumed in chapter 6 as a first approximation of real visual data.

This follows fairly clearly: First, the input units receive completely random data and so develop uniform weights. Secondly, the receptive fields of nearby second-layer units overlap. This means that the activations of nearby second-layer units will be correlated. Hence the third layer will be receiving structured data, and so can develop the interesting weight patterns described earlier.

Chapter 2 The problems of implementing neural networks that use lateral connections

Lateral connections are usually used in neural networks to inhibit units when other units are more strongly activated. Thus they encourage units to develop different weight vectors, and so ensure that units behave in different ways even if, as is common in neural network simulations, they are connected to identical inputs.

Here, I will illustrate some of the problems which arise when lateral inhibition is used in neural network simulations. I will first use principal components networks as examples, because they are one of the simplest unsupervised network types. Then I will give a description of problems which I encountered when trying to make a neural network develop 'feature detectors', of a kind used in my model of reading aloud (described in chapter 3).

Principal Components networks

Neural networks which find the principal components of their input data were briefly described in an earlier chapter. They are one of the simplest examples of unsupervised networks which use lateral connections. Here, I will give some examples of simulations of such networks in action, to demonstrate some of the problems involved in simulating such networks.

In all of the following diagrams the input patterns, which have varying numbers of dimensions, are represented as dots on a 2D graph. The weight vectors of the units are represented as vectors from the origin of the same graph, and the evolution of the weight vectors is also drawn so that one can see how the weights develop.

The inputs to the networks are identical in each case, and are generated from random numbers in such a way that the first principal component of the data is horizontal, the second vertical, and any others are naturally perpendicular to the first two, so cannot be easily shown. That the first principal component is along the -axis can be seen by noticing that the density of input patterns at the extremes of the -axis is high. The maximum length of the input vectors was 1; in the diagrams, the inputs are shown to a slightly smaller scale than the weight vectors for clarity.

This input method was chosen as it is easy to generate quickly, and has well defined principal components. In general, the principal component of the distribution is a vector whose components, expressed in terms of the axis, are all zero except for the component.

Because the inputs were generated using random numbers rather than choosing an input from a training set, each iteration was the presentation of just one pattern.

All units in the networks discussed in this chapter are linear, and their learning rule is the Oja learning rule, described in chapter 1, which is (ignoring the learnrate):

equation

Here, is input number , is the state of unit , and is the strength of the connection from input number to unit .

First, a single-unit network. The weight vector can be seen to quickly find the first principal component of the input data:



Figure 1
Single-unit P.C.A network.
2500 iterations
learnrate=0.01

The next diagram is for a two-unit network, which has just one lateral connection from unit 1 to unit 2, (i.e. unit 2 is inhibited by unit 1, but unit 1 is not affected by unit 2). This connection learns using an anti-Hebbian rule, as described earlier. This network looks like:

The weights quickly align themselves with the first two principal components, and it turns out that the lateral connection strength ends up at 0 for this network. Hence the two units end up uncorrelated, and the asymmetric connections mean that finding the state of the network doesn't require any iterating, if one calculates the state of unit 1 before that of unit 2.



Figure 2
Feed-forward weights of two-unit asymmetric P.C.A network.
1000 iterations
learnrate=0.1
lateral learnrate=0.2

Hence there is no problem with the time taken for the network activations to stabilise for each input pattern.

However, strictly one-way connections between units are not likely to occur in the brain, and the units above are activated very unequally. For example, the first unit, which responds to the first principal component, has a higher R.M.S. activation than any others, the second unit, which responds to the second principal component, has a higher R.M.S. activation than all but the first unit, etc.

A network with anti-Hebbian connections between all pairs of units can be constructed, which looks like:

The learning process for this network, when presented with the same input distribution, is illustrated next:



Figure 3
Two-unit symmetric P.C.A network.
3300 iterations
learnrate=0.01
lateral learnrate=0.02

The lateral connections in the above simulation were both . The variances of the two units were different though: 0.132 and 0.288 for the unit with the upper and lower vector respectively. These variances depend on the starting point of the vectors; in some circumstances it would be useful if the two units would share the variance equally, but this doesn't seem to happen.

N.B. the reason why the lateral connections keep the units uncorrelated is that the anti-Hebbian learning rule for the lateral connection from unit to unit is . Hence the average change in the strength of this lateral connection is . This means that the lateral connection will be at equilibrium when the two units at either end are uncorrelated. The minus sign makes this a stable equilibrium.

The fact that the lateral connections don't decay to zero in the symmetrical case is rather problematic, as it means that the network is always going to take some time to stabilise. In the last example, an average of 2.4 iterations were required to stabilise the network, where each iteration updated both units in a random order (The criteria for the network being stable was that the average percentage change in the states was less than 10%). While this actually means that only 0.4 of an extra iteration was needed, things get very much worse when there are more units.

For example, the next diagram is for three units:




Three-unit symmetric P.C.A network.
3800 iterations
learnrate=0.01
lateral learnrate=0.02

The weights for this network have significant components perpendicular to the page, but they are still very similar, and so the lateral connections needed to make the units uncorrelated are strong. This means that more iterations are required before the network stabilises - an average of 5.4 iterations.

For the record, and to give an idea of the magnitudes involved, the input weights for this network after training were:
Unit 1 Unit 2 Unit 3
Input 1 0.821 0.847 0.786
Input 2 0.417 -0.517 0.523
Input 3 0.38 -0.0507 -0.339
Input 4 -0.00126 0.0022 0.00698

The input data was constructed so that the principal components of the input data was that vector which has all components zero except for the . As can be seen, all units have a very small weight to the fourth input, so the network is only sensitive to the sub-space formed by the first three principal components.

The lateral connection strengths were:
Unit 1 Unit 2 Unit 3
Unit 1 -0.462 -0.748
Unit 2 -0.462 -0.42
Unit 3 -0.748 -0.42

and the correlations between units were:
Unit 1 Unit 2 Unit 3
Unit 1 0.13 -0.0037 -0.00046
Unit 2 -0.0037 0.22 -0.00071
Unit 3 -0.00046 -0.00071 0.114

In this last table, the diagonal elements show the variances of the units.

A five-unit network needed approximately 8.3 iterations before it gave a stable output for each input pattern.

Subjects in psychological experiments have reaction times of a few tenths of a second which, given the slow speed of neurons, doesn't allow much time for gradually approaching a stable set of neuron firing levels. This suggests that there is a very real problem with using networks, like these P.C.A. networks, which have strong lateral connections (Marr [1972], Sejnowski [1986]).

Feature-detecting networks

For the model of reading aloud described in chapter 3 (which, although it uses simple feature-detectors, doesn't make use of neural-network feature-detectors), I was interested in making a neural network in which each unit would develop a weight vector which was tuned to the presence or absence of particular groups of letters in the visual field. The networks tried never worked particularly well, so I ended up looking at the more general problem of how data should be processed, in particular using information theory, as described in chapters 4-6. However, there follows a description of the networks that I tried in order to make feature detectors:

I wanted the outputs of the network to be 'digital' - each unit either on or off depending on whether the unit's chosen group of letters was present. Because of this requirement, I used a variation of the differential state update rule used by Földiák [1990], .

The modification used was to omit the last term, resulting in a differential state update rule .

Also, the response function of the units was chosen so that the outputs were clipped between zero and one, otherwise the states could increase indefinitely. Like Földiák's network, each unit had a threshold which changed so that the unit's average activation was somewhere between a minimum and maximum allowed value (there has to be a range for the average activation so that the units can learn to detect features which have various frequencies).

The network wasn't particularly good at functioning as a feature detector, and certainly performed too poorly to be used in the reading model. Nevertheless, it exposed some of the pitfalls of simulating networks with lateral connections.

Various state update strategies were tried in an attempt to make the network reach a stable set of activations a quickly as possible, but even for a small network with 10 units and 104 inputs, iterating the state update equation took 10-20 iterations, with several 'intelligent' tweaks such as changing the time step to be as large as possible while keeping the state-changes reasonably small. The input data was the first four letters of a list of words, encoded by having 26 input units for each letter position, and turning on the four input units which correspond to the input word.

There were problems with using both batch state update and random state update. Batch update gave oscillations, while random state update, although it reached a stable set of activations, could sometimes make the network respond differently to identical input patterns, because the first unit to be chosen to be updated would inhibit other units, and so stand a better chance of being on after the network had stabilised.

It was hoped that using an adjustable threshold would make units learn to respond to a highly specific input pattern, and so develop a set of weights which could be described as 'digital', i.e. each weight either very strong or very weak. This happened only to a limited extent though. It was also found that even if the input weights were fixed (e.g. zero learnrate), the units could still decorrelate themselves and maintain a reasonable average activation by changing just the lateral connections and the thresholds.

Chapter 3 Using unsupervised feature-detectors in a model of reading aloud2


2  This chapter is based on my paper of the same title published in the Irish Journal of Psychology, Smith [1993].


Overview

In this chapter, a model of reading is described that uses simple unsupervised feature-detectors (not neural network-based) and a very rudimentary model of eye movements to learn how to convert input words into phonemes. The model will function as a table lookup where possible; otherwise it will piece together a pronunciation for the input word. The learning process has been deliberately chosen to be a statistical processing of the training database (much of which could be implemented as simple unsupervised neural networks) with no explicit rule-based learning. This does mean that the performance of the model is relatively poor compared to back-propagation neural network models of reading aloud (Rumelhart & McClelland [1986], Sejnowski & Rosenberg [1987], Seidenberg & McClelland [1989], Bullinaria [1994][1995][1996]), but it has the advantage that it can deal with any length of word, and it depends only on the simple statistics of the training set of words. Also, with specific teaching it is expected that the performance will improve markedly.

The model is similar to Analogy-based models such as Dedina & Nusbaum [1986][1991], Sullivan & Damper [1990][1991][1992]. Pronouncing by analogy is also discussed in Glushko [1979][1981], while Damper [1995] gives a general review of self-learning models of reading aloud.

The model can also learn to perform the reverse task of converting input sounds into words.

Introduction

The relationships between letters/words and their pronunciations (usually in terms of phonemes) has been called a 'quasi-regular system' (Seidenberg & McClelland [1989]) because although there are well defined rules, there are also many different exception words, the pronunciations of which are completely unrelated to any other words. An example of a rule which normally applies is the pronunciation of the letter 'a' in words such as 'apple' and 'plan'; this rule is sometimes overridden by the more complicated 'rule of "e" ' in words such as 'ate' and 'plane'. There are many more examples of these two rules in action (which is why they are given the status of rules in the first place). However, the pronunciation of the first syllable of 'colonel' does not follow rules that are applicable to any other words; similarly for 'yacht'. In between these extremes is a range of rules which are often applicable, but also frequently broken, with the result that reading is a very difficult task for a potential computational model.

Assuming that the brain is some sort of vast neural network, and ignoring the usual problems of not knowing exactly what a neuron actually does etc., the problem is how to build a trainable model of reading which uses neural-network units. It would obviously be desirable for a model of reading to learn how to convert text to phonemes by being presented with examples of correct mappings, as that is how people learn to read (albeit with some specialised instruction); this is just the sort of thing that one would expect a neural network to be capable of. However, the only way to get a neural network to do anything as complicated as reading is to use a training method like back propagation, which has well known (though disputed) problems concerning biological plausibility.

Neural networks training algorithms which don't use back propagation do exist, but are only capable of performing simple data transformations such as principal component analysis or feature detection. Also, Hebbian learning can be viewed as making units function as coincidence detectors. While such transformations on their own couldn't perform a complicated task such as reading, they can be used to make the reading task easier in the same way as having edge detectors makes visual recognition tasks easier. The model described here has feature detectors that are trained to respond to commonly occurring groups of letters such as 'ing' or 'th', and similarly for commonly occurring groups of phonemes. It is hypothesized that such feature detectors could be part of the visual and auditory/vocal systems as one would expect the visual system to automatically build up such a list of commonly seen patterns, and the auditory system to learn to recognise commonly heard sounds (this process could perhaps be the result of the information-theoretic generation of compressed representations, discussed in later chapters).

The input representation in neural network models

The way the input word is presented to a conventional neural network model is important. If the letters are presented all at once (e.g. by having a set of 26 units for each letter position), the network will have no indication that a particular letter represents the same symbol irrespective of where it is in the word. On the other hand, presenting the letters one by one (maybe in a moving window, as in NetTalk, Sejnowski & Rosenberg [1987]) is problematic because there are usually fewer phonemes than letters in words (an extreme example is 'eight' which has five letters but only 2 phonemes). This would mean that the network would have to learn to wait until it had seen enough letters before outputing a phoneme. NetTalk avoided this issue by having a pre-processor which added blanks to the sound part of the input, so that the text and sound were always aligned. Seidenberg & McClelland [1989] used a different approach, in which the whole word was presented to the network in one go, but using a distributed encoding of the presence or absence of Wickelfeatures (three adjacent letters / phonetic features) in order to encode the largely position-invariant nature of letter pronunciations. Unfortunately, the use of this scheme meant that the output of Seidenberg & McClelland's network was very difficult to convert into a list of phonemes, because it was in such a distributed representation. In fact, to find out what the output of their network actually was, Seidenberg & McClelland were forced to work backwards; they first calculated what the network output would be for all plausible output sounds, and then chose the one that was closest to the actual output.

Another criticism of the Seidenberg & McClelland model is that although it performs very well on the 2998-word data set it was trained on, it would not be easy to extend the model to multi-syllable words, as a Wickelfeature description of a long word is more likely to be ambiguous and difficult to decode.

Evidence from eye movements

The above models all assume some sort of fixed method of presenting the input to the model, either in a moving window or as Wickelfeatures. However, the input data about the text being read comes originally from the eyes, so it would make sense to look at how the eyes take in the text.

There is evidence that sometimes a word is fixated on more than once, and that two short words are sometimes read with just one fixation. For example R. E. Morrison [1983], in an investigation into fixation positions for different viewing distances, (where he found that fixation points in words were unaffected by viewing distance), gave the following figure with fixation points of a subject together with the text the subject was reading:


The crazed boy was locked into a cage
. . . . . . .

The unruly guy was heaved into a cell
. . . . . . .


(the two sentences were at different distances from the subject, but that is not relevant here.)

This shows that that the input to the text-to-phoneme part of the brain is sometimes already roughly segmented into usefully sized pieces that are smaller than whole words. Of course, the eye movements that cause this segmentation must be at least partially controlled by the text to phoneme processing module, but this doesn't mean that one should expect a neural network model to do this automatically - such a mechanism would be almost impossible to develop from scratch, so a model would have to develop a different way of solving the problem. The fact that eye movements are potentially capable of segmenting words in a useful way, and appear to be used sometimes within a single word, means that they are important to a model of reading. Hence ignoring the existence of eye movements constrains the reading model in an unsatisfactory way, for example Seidenberg & McClelland's use of distributed Wickelfeatures. NetTalk's moving window is also unsatisfactory, because people's eyes don't progress smoothly from letter to letter of the word they are reading.

If eye movements are used as the main method of splitting up words, this would be good news for neural network models in general because these models on their own are completely unsuited to problems that involve looking at different parts of input data in a serial manner. A system that has available the facility to control (via muscular control) the position of attention is more appropriate to neural networks.

Of course, eye movements don't perform segmentation completely, as the information from the eyes is not cut off at sharp end points. The important point is that they allow the use of position-specific feature detectors (e.g. simple unsupervised neural networks) instead of much more complicated position-independent feature detectors.

It should be pointed out that eye movement data is, of course, very complicated. As a result, the model described below doesn't claim to be a model of eye movements, but rather to incorporate, and make use of, the existence of eye movements in a model of reading.

General description of the model

Once some method of segmentation has been found for the model, the problem remains of what to do with the parts of words that form the input to the model. The method chosen here is probably the simplest possible: build up a list of known parts of words (referred to from now on as text features), and then find the best pronunciation for each of these text features by finding the string of phonemes (referred to from now on as sound features) that most often occurs in the same words. The finding of the best pronunciation for each text feature is done in a very simple statistical way; it is described in the next section.

When required to pronounce a word, the model simply parses the word into known text features, and then outputs the pronunciations of these text features in the same order as they are found in the word. The model attempts to parse the input word into the longest known text features possible, so that for instance 'th' would not be split up if it was recognised as a single text feature. Thus the model is effectively a table lookup, but at the sub-word level.

If an input word is matched by a single text feature, then the model will function as a normal table lookup, and so will be able to cope with exception words such as 'yacht'. Conversely, if a word as a whole is unknown, it will be split into known patterns of letters (text features) which the model would hopefully deal with correctly.

The model possesses similarities to that by Coltheart et al. [1992]; in their model, single letter to single phoneme matches are developed, but if a word contains more letters than phonemes, it attempts to map single letters to pairs of phonemes. Other more advanced rules are developed, for example to deal with cases in which a letter is often pronounced in more than one way. While their model certainly performs very well, it uses very high-level and complex decision processes in learning, which is a serious drawback.

Learning

This is a two-stage process:

The particular method used here to find the pronunciations of the text features (i.e. using a list of common sound features) was chosen partly because it is computationly efficient, but it also has the advantage that a lot of the work (the finding of text and sound features) is done without reference to the correlations between text and sounds in the data set, i.e. before explicit learning-to-read-aloud occurs. This would correspond, for example, to a child learning to recognise and pronounce certain sounds before he/she ever comes into contact with text, or to the child learning about text without any concurrent speech input.

Reading words

When presented with text to be converted into sound, the model looks for the presence of previously learnt text features, starting with the longest known features. The input word is scanned from its first letter for the presence of a recognised group of letters. When part of the word has been recognised as a feature, its position is noted, and then that part of the word is replaced by blanks, so that the letters can't be noted twice. The word is then re-scanned for the presence of other features, until all the letters have been blanked. (This will always happen, because single letters are recognised text features).

Essentially this means that the word is parsed into non-overlapping known text features. The important part is that longer features have precedence over shorter ones. This is so that words like 'eight' are not pronounced like 'e-i-gu-hu-t', the method provides for the automatic priority of exception pronunciations when needed. Also, priority is given to those text features that occurred frequently in the training set. This parsing method is reminiscent of eye movements in that it is not strictly left-to-right.

Having found the text features, each one is replaced by the sound feature with which it was most commonly associated in the training data, and these sound features are then output in the order that the text features occurred in the original word.

As an example, consider the pronunciation of the word 'training'. The first and longest text feature to be found might be 'rain'. The search is then continued on 't....ing', resulting in 'ing', and then a search in 't.......' resulting in 't'. The matching sounds for these text features are then found, and output in the same order: 't' 'rAn' 'iN'.

Results

The performance of the model depends on how many separate features are allowed to be recognised. The model will function as a lexicon if it is allowed to detect enough long features, otherwise the piecing-together procedure will be used. Using the Seidenberg & McClelland word list, it was possible to get 92% accuracy by allowing the model to generate so many text features that most words were dealt with in one piece (there were 3000 features allowed of each length from 1 to 6 letters/phonemes). Errors were still made because whole words sometimes occur inside other words (for example the word 'apt' also occurs in 'leapt' and 'rapt'), so the pronunciation chosen is not just dependent on the whole word. A possible way around this problem could be to have beginning-of-word and end-of-word symbols. Also, seven-letter words were not pronounced correctly because, in most cases, the word was split into a six-letter text feature whose corresponding pronunciation was the whole word. See this chapter's appendix for a complete list of the incorrectly-pronounced words.

When fewer features are allowed, forcing the model to build up the pronunciation of each word from a list of text features, the performance is very dependent on the quality of the mappings from text features to sound features. Most of the time, these mappings are sensible, but often the model makes mistakes, such as matching the text feature 'spe' to the sound feature 'sp' instead of 'spe'. This appears to occur because the sound of vowels is often much more variable than consonants, so that the longer sound feature 'spe' doesn't occur consistently enough with the text feature to have a higher score than the shorter sound feature 'sp'.

The mutual information given by the probability distributions of pairs of text/sound features can also be used as a measure to score the sound features3. This gives only slightly better performance.


3  I.e. the score for a particular pair of sound and text features is taken to be:. Here, and are the probabilities of a word containing the text feature, the probability of a word not containing the text feature and the probability of a word containing the text feature but not containing the sound feature etc. Note that this method doesn't require that the score is weighted by the length of the sound feature - the mutual information measure effectively corrects for rarely-occurring features automatically.


With 200 features of each length from 1 to 5 letters/phonemes, the model scored only 44% on the Seidenberg & McClelland data set. One shouldn't expect much more than this because the model has no way of coping with the large number of exception words when it has so few recognised features. Also, the only criterion that the model has to go on is how many words contain pairs of features; children are given the simpler matches explicitly when they are taught to read.

A typical error is to map the letters 'pt' to the phonemes 'ept' instead of to 'pt'. This occurs because out of all the words that have the letters 'pt' in them, many contain the letters 'ept' and hence the phonemes 'ept'. For example, the five highest scores for the text feature 'pt' in one run were:
Sound feature
Score
ept 338
pt 322
ep 292
ke 186
p 145

(These scores were obtained using the mutual-information method)

The reverse error can sometimes occur, for example mapping the letters 'arch' to the phonemes 'rtS' instead of 'ortS'. In general, when an incorrect match is made, the score for the correct sound feature is almost as high as the chosen best match. For example the scores for the text feature 'oll' were:
Sound feature
Score
rOl 90
Ol 80
rO 61
l 47
O 35

Problems also occur when the sound feature that is the best match for a particular text feature is not part of the sound features list. This can be avoided by scanning all words containing a particular text feature for sound features, and picking the most common one, but this takes too much computer time for regular testing, especially with large training lexicons. When tried, this method results in an improvement from 44% to 52% words correct when trained and tested on the 2998-word Seidenberg & McClelland database, and using 200 features of each length from 1 to 5 letters/phonemes.

The performance of the model on non-words was simulated by using the same text and sound features that gave 92% accuracy on the Seidenberg & McClelland set, but forcing the model to use at least two text features per word. The performance dropped to 54%, but it should be remembered that this is including all the irregular words in the data set.

The model has also been tested on the reverse task of converting input phonemes into text. The difference between the two operations is that there are usually more letters than phonemes in words, and there are often alternative spellings for the same sound (e.g. leek-leak). Performance was again measured using the Seidenberg & McClelland database, and with 200 features of each length from 1 to 5 letters/phonemes.

Performance was only 31%, but many of the incorrect spellings were phonetically correct, for example: zink (zinc), tril (trill), scool (school), kee (key), fackt (fact), croke (croak), kare (care), ile (aisle). Counting these cases as correct increased the score to 49%.

Psychological data and problems

Double dissociations between the reading of non-words and exception words have been observed in patients with reading problems (Humphreys & Evett [1985], Coltheart et al [1980]). These could occur in this model:

An inability to pronounce non-words could be caused by:

Inability to pronounce exception words could be caused by:

Thus the model can be thought of as being able to account for double dissociation, even though it doesn't possess separate modules for phonological and lexical reading.

There is evidence (e.g. in Humphreys & Evett [1985]) that the pronunciation of novel words is slowed if the word is homophonic with a real word, compared to a similar novel word that is nonhomophonic; e.g. BRANE takes longer to pronounce than BRAME. The model described here would not show this effect, or any of the other small variations in response time found for various types of novel word, because it is modelling only the most basic level of reading aloud. There may be some interactions between different ways of parsing words that could account for some variations in response time, and a more detailed model of the feature-detectors might show variations in response time for different categories of words.

A possible weakness of the model is that only one pronunciation of each text feature is ever considered when reading; some sort of mixing of sound features akin to the operating of some analogy models of reading (such as Sullivan & Damper [1992]) may be needed, but this would greatly complicate the model.

The simplest improvement to the model would be to amend the chosen pronunciations of the text features whose pronunciations (derived from the statistics of the training set) cause errors when the model is tested. While this is contrary to the spirit of an automatically trainable model, it has some justification in the fact that children are taught explicit pronunciations for many single letters, and they certainly receive error-correcting feedback when they make a mistake.

In defence of the apparently poor performance of this model on non-words, it should be pointed out that people don't always give the same pronunciations of novel words. In Coltheart et al. [1992], the criteria for deciding whether a model's output is correct is that it should match a pronunciation made by any one of 20 human subjects.

This general class of model has positive aspects; it is capable of coping with any length of word, and the simple detectors of commonly occurring features could almost be expected to exist in the brain. The algorithm also works in reverse -the model can learn to convert phonemes into text, if the training data is swapped around. Hence the feature detectors developed are useful for transcribing as well as reading aloud.

Possible improvements and further work

The model is extremely simple, and there are many aspects which could be improved. Ideally, the a feature detector which responds to (say) 'ing' would be influenced slightly by the letters outside of its three-letter window, so that detection of features would be context dependent.

Some more specific ideas are:

One experiment which has not yet been conducted is to look at people's eye movements when they are presented with novel words. The model described here depends very heavily on the use of eye movements to segment novel words, so results of such experiments would be very interesting, and could potentially suggest developments to this kind of model.

Appendix

There follows a list of the incorrectly-pronounced words when the model was allowed to use up to 3000 text and sound features of each length from 1 to 6.

Each incorrectly-pronounced word is shown as two lines in the following form:

Text Sound
Text-features Sound-features

Where Text and Sound are the text and sound of the input word, Text-features is a space-separated list of the text features into which the model split the input word, and Sound-features is the corresponding list of space-separated sound features.

ache Ak
ache kaS

apt apt
apt pt

arc ork
arc ortS

arch ortS
arch rtS

are or
are Ar

arm orm
arm rm

aunt ant
aunt nt

bas bo
bas bAs

bass bas
bass bAs

bear bAr
bear bErd

bing biN
bing bindZ

boot bUt
boot bU

bough bW
bough b*t

bow bW
bow bO

braille brAl
b raille b brAl

breadth bredT
b readth b bredT

breathe brED
breath e breT r

brig brig
brig br

butt b^t
butt byUt

cat kat
cat ka

clan klan
clan kla

clot klot
clot kl

cloth kl*T
cloth kl

con k*n
con kOn

cop kop
cop kOp

crib krib
crib skrIb

crow krO
crow kr

cry krI
cry kript

cub k^b
cub kyUb

cut k^t
cut kyUt

deal dEl
deal delt

dear dEr
dear derT

do dU
do d

doe dO
doe d^z

dove d^v
dove dOv

draught draft
d raught d draft

dream drEm
dream dremt

drought drWt
d rought d r*t

ear Er
ear r

earth erT
earth rT

ease Ez
ease Es

fan fan
fan fa

flu flU
flu fl

fort fOrt
fort fOr

freight frAt
f reight f frAt

gag gag
gag gAdZ

gal gal
gal g

gas gas
gas ga

glad glad
glad gl

glimpse glimps
g limpse g glimps

go gO
go g

grad grad
grad gr

grim grim
grim gr

grip grip
grip gr

hack hak
hack ak

hag hag
hag ag

hair hAr
hair Ar

ham ham
ham am

hang haN
hang CAndZ

hank hank
hank ank

hard hord
hard ord

hare hAr
hare Ar

hark hork
hark ork

harm horm
harm orm

harp horp
harp orp

hart hort
hart ort

has haz
has As

haste hAst
haste Ast

hat hat
hat at

haw h*
haw *

hay hA
hay A

heal hEl
heal helT

heap hEp
heap Ep

hear hEr
hear her

heart hort
heart hor

heat hEt
heat Et

heck hek
heck ek

heel hEl
heel El

hell hel
hell el

help help
help elp

hem hem
hem Em

hen hen
hen en

hence hens
hence ens

here hEr
here Ar

hey hA
hey A

hick hik
hick ik

hid hid
hid Id

hide hId
hide Id

high hI
high I

hill hil
hill il

him him
him im

hip hip
hip ip

his hiz
his is

hit hit
hit it

hive hIv
hive Iv

hoc hok
hoc ok

hoe hO
hoe SU

hone hOn
hone On

hook huk
hook uk

hoop hUp
hoop Up

hoot hUt
hoot Ut

hop hop
hop op

horn hOrn
horn Orn

hose hOz
hose Oz

host hOst
host Ost

hot hot
hot ot

house hWs
house hW

house hWz
house hW

huck h^k
huck ^k

hug h^g
hug hyUdZ

hum h^m
hum ^m

hump h^mp
hump ^mp

hunk h^nk
hunk ^nk

hunt h^nt
hunt ^nt

hush h^S
hush ^S

hut h^t
hut ^t

jut dZ^t
jut dZ

laid lAd
laid plad

lead led
lead lEd

leap lEp
leap lept

lease lEs
lease lE

lid lid
lid lId

limb lim
limb klIm

live liv
live lIv

lose lUz
lose klOs

me mE
me m

mean mEn
mean ment

moot mUt
moot mU

no nO
no n

now nW
now nO

once w^ns
once ns

one w^n
one On

our Wr
our Or

pad pad
pad spAd

pal pal
pal p

past past
past st

pear pAr
pear perl

pie pI
pie pE

plus pl^s
plus pl^

prim prim
prim pr

pro prO
pro pr

quit kwit
quit kw

rang raN
rang rAndZ

rat rat
rat rAt

read rEd
read red

real rEl
real relm

rib rib
rib rIb

rind rInd
rind grind

road rOd
road br*d

rob rob
rob rOb

roe rO
roe TrOz

rot rot
rot r*T

rough r^f
rough r*

rouse rWz
rouse rW

route rWt
route rUt

row rO
row rW

sag sag
sag sAdZ

say sA
say sez

scar skor
scar sk

schnook Snuk
s chnook s Snuk

scour skWr
scour skerdZ

scourge skerdZ
s courge s skerdZ

scrap skrap
scrap skr

scratch skratS
s cratch s skratS

screech skrEtS
s creech s skrEtS

sear sEr
sear sertS

sit sit
sit sIt

ski skE
ski ski

slat slat
slat sl

sleight slAt
s leight s slAt

slid slid
slid sl

slim slim
slim sl

slop slop
slop sl

soot sut
soot sU

sooth sUT
sooth sU

sour sWr
sour sOrs

sow sW
sow sO

spa spo
spa sp

spat spat
spat sp

spin spin
spin sp

spit spit
spit sp

splurge splerdZ
s plurge s splerd

squeeze skwEz
s queeze s skwEz

stag stag
stag stAdZ

staunch st*ntS
s taunch s st*ntS

steal stEl
steal stelT

stealth stelT
s tealth s stelT

straight strAt
st raight st strAt

strange strAndZ
s trange s strAnd

strength streNT
st rength st streNT

stretch stretS
s tretch s stretS

strip strip
strip str

sue sU
sue swAd

suit sUt
suit swEt

sun s^n
sun s^

swam swam
swam swomp

swan swon
swan swank

swat swot
swat swo

teak tEk
teak stAk

tear tEr
tear tAr

tent tent
tent ten

than Dan
than an

the D^
the D

thin Tin
thin Ti

thou DW
thou T*t

though DO
though T*t

thought T*t
though t T*t t

through TrU
t hrough t TrU

thru TrU
thru Tr

ton t^n
ton tOn

tong t*N
tong t^N

toot tUt
toot tU

trip trip
trip tr

try trI
try trist

twelfth twelfT
t welfth t twelfT

use yUs
use Ws

vie vI
vie vyU

wag wag
wag wAdZ

was w^z
was w

wind wInd
wind nd

wind wind
wind nd

wing wiN
wing twindZ

won w^n
won wOnt

word werd
word rd

wound wUnd
wound nd

wound wWnd
wound nd

writ rit
writ rI

wrought r*t
w rought w r*t

yea yA
yea y

year yEr
year yern

Information theory, and an introduction to chapters 4-6

Unsupervised neural networks can only respond to statistical structure in their input data. What is needed in order to make real progress in understanding how unsupervised networks can be used in models of cognition is a concrete mathematical way of talking about statistical structure. Luckily, there is a branch of mathematics which is devoted to the analysis of statistical structure - information theory. For an introduction to information theory, see Shannon [1948][1963] and Plumbley [1991].

Information theory deals solely in terms of statistical structure - there is no (explicit) concept of data meaning anything. As far as an information theoretic analysis is concerned, a set of data is characterised by the set of probabilities of each different pattern occurring (this is much more informative in the continuous case than the discrete one). From this set of values, one can calculate the entropy of the dataset, and the average amount of information it conveys about another dataset and so on. Because of this reliance only on the probabilities of different input patterns, the particular bit-patterns (if the data is encoded in binary form) of the data are irrelevant. For example, an important calculation in information theory is to find the entropy of a set of patterns. The entropy is a measure of how random a set of data is, and is a maximum when each data pattern has the same probability. This is just entropy, , where is the number of patterns and is the probability of pattern occurring.

For example. if our dataset consists of four input patterns, with probabilities 0.3, 0.1, 0.4, 0.2, we could represent these four patterns in any way we want, for example as , or , where the subscripts denote the probability of each pattern. One could even represent the four patterns as bitmaps of the four characters abcd, with the same probabilities as above, i.e. probability of (a)=0.3, probability of (b)=0.1, probability of (c)=0.4 and probability of (d)=0.2. Information theory would treat all of these representations identically, with the entropy of this set of data being .

The reason why information theory is applicable to neural networks, and knowledge representation in general, is that when there is noise present the simple definition of entropy described above is affected, indirectly, by the actual bit patterns used to describe the patterns - usually when the data is represented with continuous values rather then discrete ones. In this case, there is an infinite number of possible patterns, because the noise will distort each pattern by a random amount.

As an example of this, consider a set of 4 one-dimensional patterns, where each pattern is a real number between (say) 0 and 10. Let the set of patterns be . We can represent this set of patterns as:




Probability distribution for a set of four discrete one-dimensional patterns.4


4  The probability distribution for these discrete patterns is a set of four delta-functions, so there is no scaling on the -axis.


If these patterns are distorted by random noise, the probability distribution in the space of patterns will look like:





Probability distribution for four one-dimensional patterns, after distortion by gaussian noise of width 0.4.

As can be seen, the probability distribution has changed, and this affects (for example) calculations of the entropy of the data. In particular, if the particular values of the input patterns are changed, while their probabilities are unchanged and the size of the noise is also unchanged, the probability density curve would change and so the entropy of the data would be different.

The reason why I consider information theory to be important to the study of models of cognition is that there is a strong link between the information-theoretic concept of maximising the amount of information transmitted by the output of a black-box about the black-box's input, and the intuitive idea that sense data should be represented in terms of high-level concepts. This will be explained in chapter 4. These ideas also correspond closely with the Minimum Description Length (M.D.L.) principle (Rissanen [1989]).

A consequence of linking information theory with the idea of developing high-level categories is that it provides a concrete mathematical idea of what high-level representations are. It will be shown in chapter 4 that one can actually derive the precise function which should be applied to sense data in order to transform it into a representation which is in terms of high-level categories. This work can be regarded as a generalisation of the ideas of Barlow and Földiák (Barlow [1989]; Földiák [1992]) on reducing redundancy. An example of the technique is described in chapter 5, in which the characteristics of the visual filter best fitted to simple visual data is derived, and shown to consist of elements which find, amongst other things, Fourier transforms of the input data.

These ideas about information theory and the analysis of sense data are extremely general, and as such are expressed most naturally using mathematics rather than, for example, as a neural network learning algorithm. Naturally, the reasons for preferring neural network models of cognition still apply, so I have developed a very crude neural network algorithm which attempts to learn weights that make the network process data in accordance with the theory. This learning algorithm will be described in chapter 6

Chapter 4 The Maximum-Entropy-Filter (M.E.F.) principle

The raw sense-data that is available to the brain is not completely random, but contains an enormous amount of structure. For example, most visual images have large areas of one colour, and sounds usually have harmonics present.

It is because raw data contains such structure that the brain is able to extract high-level structure and form an internal representation based on high-level structure and work with this rather than the original low-level data (Barlow [1989]).

Hence it is interesting, for example, to view the early visual system5 as a black box or filter which transforms the retinal image into a high-level representation. However, a 'high-level' representation, though sounding intuitively reasonable, is hardly a precise concept; we need to somehow define what a high-level representation is more rigorously before we can attempt to develop a visual filter which transforms a retinal image into such a high-level representation.


5  Throughout this chapter, I will concentrate on visual sense data. However, the main ideas presented could, in principle, be applied to hearing, touch, taste etc.


One characteristic of high-level representations is that they are very efficient - instead of describing an image of a car by giving the millions of pixel colours which make up the image, we just say something like 'A red Ferrari'. This works for most common images but, for example, it doesn't work very accurately when we try to describe the picture on an untuned television screen. So it would seem that high-level descriptions will only work accurately for a subset of the set of all possible input pictures, such as those which are seen very often, like people's faces.

This suggests that our internal high-level representation of the images on our retina is determined to a large degree by the statistics of our visual environment, are most detailed for the most common visual images.

Going back to the idea of making the high-level representation as efficient as possible, this requires that it doesn't contain redundancies between the elements of the high-level representation, as these could be removed, resulting in an even more compact representation. For example, we think of a picture of a car in terms of the make of the car, its colour, its orientation on the page and so on. All of these properties are largely independent.

So far, it looks like high-level representations are characterised by being the most compact representation of the data (Rissanen [1989]). This works very well for discrete input patterns, and one could, for example, look at file-compression techniques used on computer systems to see how normal text could be represented. Wolff [1982] discusses how learning language is similar to data compression.

However, it is not obvious how to apply these ideas to the case where the input data is continuous, because in this case there is no way of making a more compact representation than the original input representation unless we are prepared to lose some detail.

The way of looking at the case of continuous input patterns developed here will be the following: we assume that our input data is presented with no distortion, and we have to transform it with some sort of filter into a representation which has limited bandwidth (or, equivalently, has some sort of output noise present) in such a way as to preserve as much of the original data as possible. The motivation for doing this is that, in the continuous case, trying to preserve as much data as possible in a limited bandwidth corresponds to representing the data as efficiently as possible in the discrete case. Hence it will be the continuous equivalent of the discrete case where we equated a compact representation with a high-level representation6. Also, we will assume that the output noise is constant additive noise, which is reasonable because it corresponds approximately to the case where the output has a constant finite resolution.


6  Other studies, for example Plumbley [1991], look at a similar problem, but also postulate input noise. We are interested here in developing high-level representations for data though, rather than how to best filter-out any input noise. Ignoring input noise will also allow a more general result to be derived.


There are two ways of seeing how this would work:

The first is to consider the earlier discussion about compact codes, and to remember that they look random. In the continuous case, this generalises to the internal representation having maximum-entropy.

The second is to use the idea, from information theory, of mutual information, and find an output representation which contains, on average, the most information about the input pattern, given that the output is distorted by noise. This calculation is carried out in appendix A at the end of this chapter, which concludes that the output should have maximum entropy.

Importantly, this result doesn't depend on the size of the output noise, as long as the output noise is constant.

Hence it seems that we can form high-level representations by trying to maximise the entropy of the representation. This will be referred to from now on as the maximum-entropy-filter (M.E.F.) principle.

A maximum-entropy representation might not sound like being equivalent to a high-level representation, and the arguments presented so far about high-level representations have very little concrete evidence to support them. However, the example presented in chapter 5, where the M.E.F. principle is applied to a very simplified model of early visual images gives a striking example of how such a representation could be formed, and of how the characteristics of the required filter seem very interesting and relevant to the study of real cognitive systems. Also, the next section shows how a system which generates a maximum-entropy representation would have to contain conventional constructs such as feature detectors.

Maximising entropy can be thought of as a generalisation of Barlow's principle (Barlow [1989]) that the sensory systems should minimise correlations within a representation. In particular, entropy is sensitive to 'hidden' structure within data which the correlation measure ignores. For example, if a three-bit representation is given the set of 'images' {000, 011, 101, 110}, one can calculate the correlation between any pair of bits, and it turns out to be zero. Hence such a scheme would contain no structure from a correlation point of view. However, the dataset is, in fact, highly redundant - each bit is the XOR of the other two. The entropy of the dataset exposes this redundancy very clearly, because it is less than the maximum entropy for a bit representation.

In general, the entropy measure will be sensitive to any redundancy, even a very complicated dependence between many parts of the representation. To see how this works, we need to look at what is involved when we try to form a maximum-entropy representation, and we can do this by looking at the probability distributions in the space of all input/output patterns

M.E.F. in terms of input/output probability distributions

There is a unique solution for making the output probability distribution have maximum entropy when there is uniform additive noise, which is to make the output probability density constant, i.e. make all output pattern patterns equally probable. This is a standard result, and appendix B at the end of this chapter contains a proof.

If the input patterns are discrete, and there is a finite number of them, there is no way of making such an output probability distribution, even with output noise, since the output probability distribution will be given by convolving the output patterns with the output noise, resulting (for gaussian output noise) in a a set of gaussians (one for each input pattern) centred at different places in output space. All we can do to increase the maximum entropy of this output distribution is to move the centres of the gaussians (i.e. the outputs with zero noise) around so that the output probability distribution is as flat as possible (where the measure of 'flatness' is the output entropy). The neural network learning rule described in chapter 6 uses this scheme because neural networks are most easily thought of as having discrete input patterns.

If our input patterns are infinite in number though, we are effectively dealing with an input probability distribution instead of individual input patterns. In this case, the M.E.F. principle says that the input patterns should be transformed in such a way that the output probability distribution is flat, and it is theoretically possible to do so.

Considering how the M.E.F. principle would work in terms of input/output probability distributions is particularly useful because the statistical structure in input data which we have been considering manifests itself in the input probability density being non-uniform.

For example, if the input pattern is a patterns of activations on a 100 by 100 array of pixels, giving 10,000 pixels in all, then input space would be 10,000-dimensional, and each point in this 10,000-dimensional space would represent a complete image. A set of input patterns would thus be represented by specifying the probabilities of any possible pattern occurring. If we are told that a set of input patterns for this 100 by 100 retina contains a disproportionate number of faces, this means that the probability of an input pattern being in that small region of the 10,000-dimensional space which corresponds to faces, is greater than normal. The goal of the M.E.F. principle is to find a representation for this input distribution which has a constant probability distribution.

At first sight, this seems almost trivial - a constant probability distribution is very easy to make. However, it is very important to remember that we have to make the flat output probability distribution using only the (very un-flat) input distribution. For example, we can't use a random number generator inside the neural network/black box which does the transformation (this intuitively obvious restriction is also formally required by the maximising-information-transmission calculation in appendix A, as the information conveyed by the outputs about the inputs would be seriously degraded if we used random number generators).

In fact, transforming the input data into a representation which has a flat probability distribution is an extremely complicated task. This reflects the difficulty of forming high-level representations.

To see what would happen for the case of faces having higher than normal probability of occurring, we could represent the input probability distribution by pretending it has only 2 dimensions, and using a third dimension to represent the probability distribution. This gives us a surface which represents the input probability distribution:



Figure 1
Input probability density

In order to make a uniform output probability distribution, the system would have to expand the region in input space, 'Faces', so that it occupies a large part of output space, while all other regions in input space, 'Other images', would have to be condensed into the remainder of the output space, making the probability distribution look like:



Figure 2
Output probability density

This is a very non-trivial operation, as the system would have to transform each input vector in one of two ways, according to whether the input vector was in input region 'Faces' or not. The simplest way of doing this would probably be to have a module which detects whether an input pattern is from within input region 'Faces' and causes the input vector to be transformed in one way if this is so, and a different way if not.

This illustrates how the seemingly abstract M.E.F. principle would lead to the development of conventional concepts such as feature detectors.

Note that the output representation automatically provides more output space for images of faces, and so can represent faces much more accurately than other images.

However, the M.E.F. principle goes further - any additional irregularities in the input probability distribution would need to be dealt with in a similar way. For example, it might be that within the 'Faces' region of input space, there is a certain area which has lower than normal probability (this could be the part which represents oriental faces, if the input data corresponds to the visual experience of a European person). In order to transform this more complicated input probability distribution into a flat probability distribution, the system would have to not only detect and selectively transform faces/non-faces but, within the space of faces, it would have to selectively detect and transform oriental faces. As before, this is a highly non-trivial task, and would have to involve the generation of feature detectors for oriental faces.

It is important to bear in mind that so far we have been considering a very over-simplified input probability distribution. In practise, the probability surface will be full of extremely complicated peaks and troughs, and of course this will all be in a space of many thousands of dimensions.

Hopefully it will be clear now that the M.E.F. principle requires that this sort of process be done for every level of detail in the input probability distribution and so the M.E.F. principle corresponds very closely to conventional ideas on how data should be represented in terms of many different high-level concepts. Very importantly, the M.E.F. principle allows this statistical and hierarchical decomposition to be viewed in a particularly simple information theoretic framework.

Incidentally, viewing the maximum-entropy-filter principle as meaning that the output probability should be flat, removes any considerations of how large the output noise is (as long as the noise is constant additive noise). This is because a flat probability distribution before noise will always give a flat probability distribution after constant additive noise7.


7  I don't know of a proof that a flat distribution before noise is the only solution, however.


In some studies of systems which transform input data into a different representation, (for example Atick and Redlich [1990]), varying the output noise-width governs how the system behaves - whether data is represented very accurately, or less accurately but with more resistance to noise. Because the M.E.F. principle can be viewed as making a uniform output probability distribution, it is unaffected by these considerations (except for the case of a finite number of input patterns, when the output probability distribution cannot be flat). This is very convenient as it removes the need to tune the noise width to the task being performed.

Uniqueness of a constant probability distribution

A slightly different way of looking at the M.E.F. process is to forget about information theory and high-level representations, and simply consider that, as has been mentioned before, all a black box such as an unsupervised neural network can do is to transform input data into a different representation.

If we look at this in terms of the input/output probability distribution, then transforming into a constant probability output distribution is, in some sense, a special transformation, because a uniform probability distribution is the simplest possible probability distribution.

Continuous mappings

Requiring a maximum entropy representation (or, equivalently for most realistic cases, constant probability representations) doesn't constrain the representation completely. For example, we could swap chunks of output space around, which would give a very different output representation while keeping the output entropy at the maximum. In particular, this means that a maximum-entropy output representation is possible whose topology is very different from the topology of the input representation - i.e. similar input patterns are not mapped to similar output patterns or, equivalently, the mapping from input to output is not a smooth mapping.

The reasons for requiring a maximum-entropy representation are not affected by this problem, but to conform to the overriding intuitive idea that internal representations should be usable high-level representations, we need to require that the mapping from input representation to output representation be as smooth as possible.

Neural networks will automatically generate reasonably smooth mappings so, practically, this isn't too serious a problem.

Also, it is possible that a purely information theoretic treatment would require a smooth mapping, when there is input noise: If we consider the case of there being discrete patterns with input and output noise, there is always a chance that any error caused by input noise will be cancelled by the random output noise. However, this beneficial effect will be strongest when nearby input patterns correspond to nearby output patterns, as this gives the greatest chance that output noise will be large enough to move the output back into the correct pattern. Hence I conjecture that smooth mappings would in fact follow from a more complete information-theoretic treatment, as well as being generated automatically by practical systems such as neural networks.

Input noise

Some of the work in Plumbley [1991] would seem to contradict the work presented above, as it deals with the same problem of deriving an ideal filter for particular input data, but ends up with a filter whose characteristics are affected by the signal-to-noise ratio of the input data at various frequencies. However, input noise is not considered in the M.E.F. principle, which is only about forming high-level categories from input data, rather than processing input data which has significant noise present in such a way as to reduce the effect of the input noise.

To optimally transform input data which has input noise into a high-level representation, one would have to first increase the signal-to-noise ratio in each input line using filters such as in Plumbley [1991], and then transform the multi-input data in accordance with the M.E.F. principle.

Psychological limitations of the M.E.F. principle

The M.E.F. principle assumes that the brain is interested in those high-level categories which are enshrined in the statistics of the input data. This works well in the case of pictures of people's faces, for example, because faces clearly form a large fraction of the visual experience of people (although only because people's eyes are directed towards other faces). Also, work such as Finch & Chater [1992], in which words are classified into very natural-looking categories purely by grouping together those words which usually occur in similar contexts, suggests that there is much useful information in the statistics of ordinary text.

However, it is probably the case that evolution has designed the brain to pay special attention to (and develop high-level representations for) some aspects of sense-data which are actually quite rare. For example, the instinctive reflex to duck when a fast-moving object approaches one's head is unlikely to be helped by the visual system learning about such objects just from experience, as this type of situation doesn't occur very often. Hence maybe there would have to be hard-wired detection of the trajectory and speed of such objects in the brain.

This idea also has relevance to the problem of dealing with input noise discussed above. How could a purely statistical system know what noise is present in input data? - such information is needed in order to determine the optimal filters in Plumbley [1991]. One might think that noise gives itself away by being a completely random element of the input, but this can apply equally to useful parts of the input data. For example, the colour of Smarties covers the whole spectrum fairly evenly, but it would be a poor visual system indeed which decides from this that the colour content of the retinal images of a pile Smarties is just random noise and so should be discarded.

Hence we cannot expect the purely statistical technique such as the M.E.F. principle to work on its own if there is significant input noise to a system. This noise would have to be dealt with using a hard-wired method such as evolving low-pass filters in early visual pathways, as described in Plumbley [1991] or Atick & Redlich [1993].

How to construct an ideal filter - the Filter as Inverse Generator (F.I.G.) theorem.

We have determined that the output of a filter should have maximum entropy, as this maximises the mutual information between the filter's output and the input and, more importantly, seems to be equivalent to representing the data in a high-level way.

There is a simple trick which can sometimes aid the theoretical calculation of a filter which converts some particular data into a maximum-entropy representation. This trick will be used in the next c