Copyright © Bill Wilson, 1998 - 2012 | Contact Info |
You should use The Machine Learning Dictionary to clarify or revise concepts that you have already met. The Machine Learning Dictionary is not a suitable way to begin to learn about Machine Learning.
Further information on Machine Learning can be found in the class web page lecture notes section. Other places to find out about machine learning would be the AAAI (American Association for Artificial Intelligence) Machine Learning page and their AI Reference Shelf for less specific information.
Other related dictionaries:
The Prolog Dictionary -
URL: http://www.cse.unsw.edu.au/~billw/prologdict.html
The Artificial Intelligence Dictionary -
URL: http://www.cse.unsw.edu.au/~billw/aidict.html
The NLP Dictionary (Natural Language Processing) -
URL: http://www.cse.unsw.edu.au/~billw/nlpdict.html
The URL of this Machine Learning Dictionary is http://www.cse.unsw.edu.au/~billw/dictionaries/mldict.html
This dictionary is limited to the ML concepts covered in COMP9414 Artificial Intelligence, at the University of New South Wales, Sydney:
activation level | activation function | Aq | asynchronous | attributes | axon | backed-up error estimate | backpropagation | backward pass in backpropagation | bias | biological neuron | C4.5 | C5 | cell body | clamping | classes in classification tasks | concept learning system (CLS) | conjunctive expressions | connectionism | covering algorithm | decision trees | delta rule | dendrite | entropy | epoch | error backpropagation | error surface | excitatory connection | expected error estimate | feature | feedforward networks | firing | forward pass in backpropagation | function approximation algorithms | generalization in backprop | generalized delta rule | gradient descent | hidden layer | hidden unit / node | hypothesis language | ID3 | inhibitory connection | input unit | instances | Laplace error estimate | layer in a neural network | learning program | learning rate | linear threshold unit | local minimum | logistic function | machine learning | momentum in backprop | multilayer perceptron (MLP) | neural network | neurode | neuron (artificial) | node | noisy data in machine learning | observation language | output unit | over-fitting | perceptron | perceptron learning | propositional learning systems | pruning decision trees | recurrent network | sequence prediction tasks | sigmoidal nonlinearity | simple recurrent network | splitting criterion in ID3 | squashing function | stopping criterion in backprop | supervised learning | symbolic learning algorithms | synapse | synchronous | target output | testing | threshold | training pattern | total net input | total sum-squared error | trainable weight | training pattern | tree induction algorithm | unit | unsupervised learning | weight | weight space | windowing in ID3 | XOR problem
A
In summary, the activation function is the result of applying a squashing
function to the total net input.
In the asynchronous case, if the yellow node fires first, then it uses
the then current value of its input from the red node to determine its
output in time step 2, and the red node, if it fires next, will use
the updated output from the yellow node to compute its new output in
time step 3.
In summary, the output values of the red and yellow nodes
in time step 3 depend on the outputs of the yellow and red nodes in time
steps 2 and 1, respectively.
In the synchronous case, each node obtains the current output of
the other node at the same time, and uses the value obtained to
compute its new output (in time step 2).
In summary, the output values of the red and yellow nodes
in time step 2 depend on the outputs of the yellow and red nodes in time
step 1.
This can produce a different result from the asynchronous method.
Some neural network algorithms are firmly tied to synchronous updates,
and some can be operated in either mode. Biological neurons normally
fire asynchronously.
Attributes are sometimes also called features.
B
See also expected error estimate.
The backward pass starts at the output
layer of the
feedforward network, and updates the incoming weights to
units in that layer using the delta rule. Then it works backward,
starting with the penultimate layer (last
hidden layer), updating the incoming weights to those layers.
Statistics collected during the forward pass
are used during the backward pass in updating the weights.
This has the effect of giving each hidden or output a trainable
threshold, equal to the value of
the weight from the bias unit to the unit.
See also propositional learning systems
and covering algorithm.
The algorithm - given a set of examples:
The local gradient, for an output node, is
the product to the derivative of the squashing
function evaluated at the total net input to node j, and
the error signal (i.e. the
difference between the target output and the actual output). In the
case of a hidden node, the local gradient is
the product of the derivative the squashing function (as above) and
the weighted sum of the local gradients of the nodes to which node j
is connected in subsequent layers of the net. Got it?
The rationale for this is as follows: –log2(p) is the amount of information in bits associated with an event of probability p - for example, with an event of probability ½, like flipping a fair coin, log2((p) is –log2(½) = 1, so there is one bit of information. This should coincide with our intuition of what a bit means (if we have one). If there is a range of possible outcomes with associated probabilities, then to work out the average number of bits, we need to multiply the number of bits for each outcome (–log2(p)) by the probability p and sum over all the outcomes. This is where the formula comes from.
Entropy is used in the ID3 decision tree induction algorithm.
Error backpropagation learning is often familiarly referred to just as
backprop.
The "point" defined by the current set of weights is termed
a point in weight space. Thus weight space is the
set of all possible values of the weights.
See also local minimum
and gradient descent.
See also forward pass,
backward pass,
delta rule, error surface,
local minimum,
gradient descent and momentum.
S is the set of instances in a node k is the number of classes (e.g. 2 if
instances are just being classified into 2 classes: say positive and
negative)
N is the is the number of instances in
S C is the majority class in S n out of N examples in S
belong to C
In practice, the nodes of most feedforward nets are partitioned into
layers - that is, sets of nodes, and the layers may be numbered
in such a way that the
nodes in each layer are connected only to nodes in the
next layer - that is, the layer with the next higher number.
Commonly successive layers are totally interconnected - each
node in the earlier layer is connected to every node in the next
layer.
The first layer has no input connections, so consists of
input units and is termed the input
layer (yellow nodes in the diagram below).
The last layer has no output connections, so consists of
output units and is termed the output
layer (maroon nodes in the diagram below).
The layers in between the input and output layers are termed
hidden layers, and consist of
hidden units (light blue nodes and brown nodes in the diagram below).
When the net is operating, the activations of
non-input neurons are computing using each neuron's
activation function.
See also symbolic learning algorithms.
Feedforward network. All connections (arrows) are in one direction; there are
no cycles of activation flow (cyclic subgraphs). Each colour identifies
a different layer in the network. The layers 1 and 2 are fully interconnected,
and so are layers 3 and 4. Layers 2 and 3 are only partly interconnected.
Moreover, with large complex sets of training patterns, it is likely that some errors may occur, either in the inputs or in the outputs. In that case, and again particularly in the later parts of the learning process, it is likely that backprop will be contorting the weights so as to fit precisely around training patterns that are actually erroneous! This phenomenon is known as over-fitting.
This problem can to some extent be avoided by stopping learning early. How does one tell when to stop? One method is to partition the training patterns into two sets (assuming that there are enough of them). The larger part of the training patterns, say 80% of them, chosen at random, form the training set, and the remaining 20% are referred to as the test set. Every now and again during training, one measures the performance of the current set of weights on the test set. One normally finds that the error on the training set drops monotonically (that's what a gradient descent algorithm is supposed to do, after all). However, error on the test set (which will be larger, per pattern, than the error on the training set) will fall at first, then start to rise as the algorithm begins to overtrain. Best generalization performance is gained by stopping the algorithm at the point where error on the test set starts to rise.
Δwji(n) = α Δwji(n–1) + η δj(n) yi(n)
in the notation of Haykin's text (Neural networks - a comprehensive foundation). The constant α is a termed the momentum constant and can be adjusted to achieve the best effect. The second summand corresponds to the standard delta rule, while the first summand says "add α × the previous change to this weight."
This new rule is called the generalized delta rule. The effect is that if the basic delta rule would be consistently pushing a weight in the same direction, then it gradually gathers "momentum" in that direction.
When an artificial neural network learning algorithm causes the weights of the net to change, it will do so in such a way that the current point on the error surface will descend into a valley of the error surface, in a direction that corresponds to the steepest (downhill) gradient or slope at the current point on the error surface. For this reason, backprop is said to be a gradient descent method, and to perform gradient descent in weight space.
See also local minimum.
In layered nets, each neuron in a given layer is connected by trainable weights to each neuron in the next layer.
See also observation language.
They achieve this by being able to alter their internal
state, q. In effect, they are computing a function
of two arguments, P(x | q) = y.
When the program is in learning mode, the program computes
a new state q' as well as the output y, as
it executes.
In the case of
supervised learning,
in order to construct q', one needs a set
of inputs xi and corresponding target outputs zi
(i.e. you want P(xi | q) = zi
when learning is complete). The new state function L
is computed as:
L(P, q, ((x1,z1), ...,
(xn, zn))) = q'
See also unsupervised learning,
observation language, and
hypothesis language.
When an artificial neural network
learning algorithm causes the total error
of the net to descend into a valley of the error surface, that
valley may or may not lead to the lowest point on the entire
error surface. If it does not, the minimum into which the total
error will eventually fall is termed a local minimum. The
learning algorithm is sometimes referred to in this case as
"trapped in a local minimum."
In such cases, it usually helps to restart the algorithm with
a new, randomly chosen initial set of
weights - i.e. at a new random point in weight space.
As this means a new starting point on the
error surface, it is likely to lead into a different valley, and
hopefully this one will lead to the true (absolute) minimum
error, or at least a better minimum error.
A related function, also sometimes used in backprop-trained
networks, is 2φ(x)–1, which can also be expressed
as tanh(x/2). tanh(x/2) is, of course, a smoothed version
of the step function which jumps from –1 to 1 at x = 0, i.e.
the function which = –1 if x < 0, and = 1 if x ≥ 0.
It is used to transform the total net input
of an artificial neuron in some implementations
of backprop-trained networks.
Learning might or might not occur, depending on the type of neural network and the mode of operation of the network.
Also referred to a neurode, node, or unit.
See also decision tree pruning and
generalization in backprop.
See graph for a node in a graph.
See also hypothesis language.
Perceptrons were originally used as pattern classifiers, where the term pattern is here used not in the sense of training pattern, but just in the sense of an input pattern that is to be put into on of several classes. Perceptual pattern classifiers of this sort (not based on perceptrons!) occur in simple animal visual systems, which can distinguish between prey, predators, and neutral environmental objects.
See also perceptron learning, and the XOR problem.
If for example, a node of the tree contains, say, 99 items
in class C1 and 1 in class C2, it is plausible that the 1 item
in class C2 is there because of an error either of classification
or of feature value. There can thus be an argument for regarding
this node as a leaf node of class C1. This termed pruning
the decision tree.
The algorithm given in lectures for deciding when to prune is as
follows:
.
See Aq and covering
algorithm.
At a branch node that is a candidate for pruning:
A recurrent connection is one that is part of a directed cycle, although term is sometimes reserved for a connection which is clearly going in the "wrong" direction in an otherwise feedforward network.
Recurrent networks include fully recurrent networks in which each neuron is connected to every other neuron, and partly recurrent networks in which greater or lesser numbers of recurrent connections exist. See also simple recurrent network.
This article is included for general interest - recurrent networks are not part of the syllabus of COMP9414 Artificial Intelligence.
Simple recurrent nets can tackle tasks like this, because they do have a kind of memory for recording information derived from activation values from previous time steps.
This article is included for general interest - sequence prediction tasks are not part of the syllabus of COMP9414 Artificial Intelligence.
Simple recurrent networks can learn
sequence prediction tasks.
See also recurrent networks.
This article is included for general interest - simple recurrent
networks are not part of the syllabus of COMP9414 Artificial
Intelligence.
The algorithm, in outline, is as follows:
This leaves the question of how to calculate the information
gain associated with using a particular attribute A. Suppose
that there are k classes C1,
C2, ..., Ck,
and that of the N instances
classified to this node,
I1 belong to class C1,
I2 belong to class C2, ..., and
Ik belong to class Ck.
In terms of the example
used in the lecture notes, (see also
calculations
in lecture notes),
k = 2 since
there are just two classes, positive and negative.
I1 = 4 and I2 = 3, and N =7,
and so p1 = 4/7 and p2 = 3/7, and E =
–p1log2(p1)
–p2log2(p2) = –(4/7)×log2(4/7) – (3/7)×log2(3/7).
In the example, the first attribute
A considered is size, and the first value of
size considered,
large, corresponds to a1,
in the example in the lecture notes, so J1,1 = 2 =
J1,2, and J1 = 4.
Thus q1,1 = J1,1/J1 = 2/4 = ½, and
q1,2 = J1,2/J1 = 2/4 = ½, and
E1 =
–q1,1log2(q1,1)
–q1,2log2(q1,2) = -½×log2(½) – ½×log2(½)
= 1.
Contrast unsupervised learning.
Such learning algorithms have the advantage that their
internal representations and their final output can be inspected
and relatively easily understood by humans.
See also function approximation
algorithms.
Let p1 = I1/N,
p2 = I2/N, ..., and
pk = Ik/N.
The initial entropy E at this node is:
–p1log2(p1)
–p2log2(p2) ...
–pklog2(pk).
Now split the instances on each value of the chosen attribute
A. Suppose that there are r attribute values
for A, namely a1, a2, ..., ar.
For a particular value aj, say, suppose that there are
Jj,1 instances in class C1,
Jj,2 instances in class C2, ..., and
Jj,k instances in class Ck,
for a total of Jj instances having attribute value
aj.
Let qj,1 = Jj,1/Jj,
qj,2 = Jj,2/Jj, ..., and
qj,k = Jj,k/Jj.
The entropy Ej associated with this attribute value
aj this position is:
–qj,1log2(qj,1)
–qj,2log2(qj,2) ...
–qj,klog2(qj,k).
Now compute:
E – ((J1/N).E1
+ (J2/N).E2 + ... + (Jr/N).Er).
This is the information gain for attribute A.
Note that Jj/N is the estimated probability
that an instance classified to this node will have value
aj for attribute A. Thus we are weighting
the entropy estimates Ej by the estimated
probability that an instance has the associated attribute value.
Similarly E2 = 1 and J2 = 2
(size = small),
and E3 = 0 and J3 = 1
(size = medium)
so the final information gain,
E – ((J1/N).E1
+ (J2/N).E2 + ... + (Jr/N).Er)
which turned out to be about 0.13 in the example.
= E – ((4/7)×E1 + (2/7)×E2 + (1/7)×E3)
Combinations of the two (e.g. whichever of the two occurs
first) and other stopping conditions are possible. See the reference
by Haykin (Neural networks: a comprehensive foundation
p. 153) for more details.
Compare bias.
See also activation function.
Given a training pattern,
its squared error is obtained by squaring the difference
between the target output of an output neuron and the actual output.
The sum-squared error, or pattern sum-squared error (PSS), is
obtained by adding up the sum-squared errors for each output neuron.
The total sum-squared error is obtained by adding up the PSS for each
training pattern.
The other weights, the ones that are subject to change when
the net is learning, are referred to as trainable weights.
In toy problems like the XOR problem,
only a few training patterns (in the case of XOR, just 4) may be
used. Patterns used for testing the trained network are referred to
as test patterns. Compare instance.
This leaves the question of how to choose the best
attribute to split on at any branch node. This issue is handled
in the article on splitting criterion in ID3.
Without windowing, such an algorithm can be really slow, as it
needs to do its information gain calculations (see
tree induction algorithms) over huge amounts of data.
With windowing, training is done on a relatively small sample
of the data, and then checked against the full set of training data.
Here is the windowing algorithm: