The Machine Learning Dictionary

for COMP9414

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

You should use The Machine Learning Dictionary to clarify or revise concepts that you have already met. The Machine Learning Dictionary is not a suitable way to begin to learn about Machine Learning.

Further information on Machine Learning can be found in the class web page lecture notes section. Other places to find out about machine learning would be the AAAI (American Association for Artificial Intelligence) Machine Learning page and their AI Reference Shelf for less specific information.

Other related dictionaries:

The Prolog Dictionary - URL: http://www.cse.unsw.edu.au/~billw/prologdict.html
The Artificial Intelligence Dictionary - URL: http://www.cse.unsw.edu.au/~billw/aidict.html
The NLP Dictionary (Natural Language Processing) - URL: http://www.cse.unsw.edu.au/~billw/nlpdict.html

The URL of this Machine Learning Dictionary is http://www.cse.unsw.edu.au/~billw/dictionaries/mldict.html

Topics Covered

This dictionary is limited to the ML concepts covered in COMP9414 Artificial Intelligence, at the University of New South Wales, Sydney:

activation level

The activation level of a neuron in an artificial neural network is a real number often limited to the range 0 to 1, or –1 to 1. In the case of an input neuron the value is obtained externally to the network. In the case of a hidden neuron or output neuron the value is obtained from the neuron's activation function. The activation of a neuron is sometimes thought of as corresponding to the average firing rate of a biological neuron.

activation function

In neural networks, an activation function is the function that describes the output behaviour of a neuron. Most network architectures start by computing the weighted sum of the inputs (that is, the sum of the product of each input with the weight associated with that input. This quantity, the total net input is then usually transformed in some way, using what is sometimes called a squashing function. The simplest squashing function is a step function: if the total net input is less than 0 (or more generally, less than some threshold T) then the output of the neuron is 0, otherwise it is 1. A common squashing function is the logistic function.

In summary, the activation function is the result of applying a squashing function to the total net input.

Aq

A propositional learning system, developed by Michalski.

asynchronous vs synchronous

When a neural network is viewed as a collection of connected computation devices, the question arises whether the nodes/devices share a common clock, so that they all perform their computations ("fire") at the same time, (i.e. synchronously) or whether they fire at different times, e.g. they may fire equally often on average, but in a random sequence (i.e. asynchronously). In the simplest meaningful case, there are two processing nodes, each connected to the other, as shown below.

In the asynchronous case, if the yellow node fires first, then it uses the then current value of its input from the red node to determine its output in time step 2, and the red node, if it fires next, will use the updated output from the yellow node to compute its new output in time step 3. In summary, the output values of the red and yellow nodes in time step 3 depend on the outputs of the yellow and red nodes in time steps 2 and 1, respectively.

In the synchronous case, each node obtains the current output of the other node at the same time, and uses the value obtained to compute its new output (in time step 2). In summary, the output values of the red and yellow nodes in time step 2 depend on the outputs of the yellow and red nodes in time step 1. This can produce a different result from the asynchronous method.

Some neural network algorithms are firmly tied to synchronous updates, and some can be operated in either mode. Biological neurons normally fire asynchronously.

attributes

An attribute is a property of an instance that may be used to determine its classification. For example, when classifying objects into different types in a robotic vision task, the size and shape of an instance may be appropriate attributes. Determining useful attributes that can be reasonably calculated may be a difficult job - for example, what attributes of an arbitrary chess end-game position would you use to decide who can win the game? This particular attribute selection problem has been solved, but with considerable effort and difficulty.

Attributes are sometimes also called features.

axon

The axon is the "output" part of a biological neuron. When a neuron fires, a pulse of electrical activity flows along the axon. Towards its end, or ends, the axon splits into a tree. The ends of the axon come into close contact with the dendrites of other neurons. These junctions are termed synapses. Axons may be short (a couple of millimetres) or long (e.g. the axons of the nerves that run down the legs of a reasonably large animal.)

backed-up error estimate

In decision tree pruning one of the issues in deciding whether to prune a branch of the tree is whether the estimated error in classification is greater if the branch is present or pruned. To estimate the error if the branch is present, one takes the estimated errors associated with the children of the branch nodes (which of course must have been previously computed), multiplies them by the estimated frequencies that the current branch will classify data to each child node, and adds up the resulting products. The frequencies are estimated from the numbers of training data instances that are classified as belonging to each child node. This sum is called the backed-up error estimate for the branch node. (The concept of a backed-up error estimate does not make sense for a leaf node.)

See also expected error estimate.

backward pass in backpropagation

The phase of the error backpropagation learning algorithm when the weights are updated, using the delta rule or some modification of it.

The backward pass starts at the output layer of the feedforward network, and updates the incoming weights to units in that layer using the delta rule. Then it works backward, starting with the penultimate layer (last hidden layer), updating the incoming weights to those layers.

Statistics collected during the forward pass are used during the backward pass in updating the weights.

backpropagation or backprop

see error backpropagation

bias

In feedforward and some other neural networks, each hidden unit and each output unit is connected via a trainable weight to a unit (the bias unit) that always has an activation level of –1.

This has the effect of giving each hidden or output a trainable threshold, equal to the value of the weight from the bias unit to the unit.

biological neuron

This is a very much simplified diagram of a biological neuron. Biological neurons come in a variety of types. There is a lot of further structure and physiology that could be considered. The features of a neuron shown above are those of most interest for those constructing artificial neural networks (other than spiking-neuron-based models, and those relying on synchronous activation, such as the Shastri and Ajjanagadde model: see L.Shastri and V. Ajjanagadde: Behavioral and Brain Sciences (1993) 16, 417-494).

However, from the artificial neural network point of view, a biological neuron operates as follows: electrical pulses from other neurons cause the transfer of substances called neurotransmitters (of which there are several varieties) from the synaptic terminals of a neuron's axon (think "output") across a structure called a synapse to the dendrites of other neurons (call them downstream neurons). The arrival of the neurotransmitter in the dendrite of the downstream neuron increases the tendency of the downstream neuron to send an electrical pulse itself ("fire"). If enough dendrites of a neuron receive neurotransmitters in a short enough period of time, the neuron will fire.

Caveat: neurotransmitter substances may be excitatory or inhibitory. The text above assumes that only excitatory neurotransmitters are involved. Inhibitory neurotransmitters, as the name suggests, reduce the tendency of a neuron to fire. Some neurons have a mixture of excitatory synapses and inhibitory synapses (i.e. synapses using inhibitory neurotransmitters) and will only fire if there is enough additional excitatory input to overcome the effect of the inhibitory synapses.

C4.5

C4.5 is a later version of the ID3 decision tree induction algorithm.

C5

C5 is a later version of the ID3 decision tree induction algorithm.

cell body

The cell body is the large open space in the "middle" of the neuron, between the dendrites and the axon, where the cell nucleus lives. A landmark in biological neuron architecture, but not of significance in relation to artificial neural networks (except for those trying to model biological neurons as distinct from using simplified neuron models to solve diverse problems).

clamping

When a neuron in an neural network has its value forcibly set and fixed to some externally provided value, it is said to be clamped. Such a neuron serves as an input unit for the net.

classes in classification tasks

In a classification task in machine learning, the task is to take each instance and assign it to a particular class. For example, in a machine vision application, the task might involve analysing images of objects on a conveyor belt, and classifying them as nuts, bolts, or other components of some object being assembled. In an optical character recognition task, the task would involve taking instances representing images of characters, and classifying according to which character they are. Frequently in examples, for the sake of simplicity if nothing else, just two classes, sometimes called positive and negative, are used.

concept learning system (CLS)

A decision tree induction program - a precursor of ID3.

conjunctive expressions

A conjunctive expression is an expression like: size=large and colour in {red, orange} that is the conjunction of two or more simple predicates like size=large. The term conjunction refers to the presence of the logical operator and, which joins or conjoins the simple predicates. Occasionally the conjunctive expression might just consist of a single predicate.

connectionism

Connectionism is the neural network approach to solving problems in artificial intelligence - the idea being that connectionists feel that it is appropriate to encode knowledge in the weighted connections between nodes in a neural net. The word "connectionist" is sometimes used with all the heat and force associated with political "isms": "He's a connectionist" can be just as damning, coming from the wrong person, as "He's a communist (or capitalist)". It is also sometimes used as a simple adjective, as in "NetTalk is a connectionist system."

covering algorithm

A covering algorithm, in the context of propositional learning systems, is an algorithm that develops a cover for the set of positive examples - that is, a set of conjunctive expressions that account for all the examples but none of the non-examples.

The algorithm - given a set of examples:

Start with an empty cover.
Select an example.
Find the set of all conjunctive expressions that cover that example.
Select the "best" expression x from that set, according to some criterion (usually "best" is a compromise between generality and compactness and readability).
Add x to the cover.
Go to step 2, unless there are no examples that are not already covered (in which case, stop).

decision trees

A decision tree is a tree in which each non-leaf node is labelled with an attribute or a question of some sort, and in which the branches at that node correspond to the possible values of the attribute, or answers to the question. For example, if the attribute was shape, then there would be branches below that node for the possible values of shape, say square, round and triangular. Leaf nodes are labelled with a class. Decision trees are used for classifying instances - one starts at the root of the tree, and, taking appropriate branches according to the attribute or question asked about at each branch node, one eventually comes to a leaf node. The label on that leaf node is the class for that instance.

delta rule

The delta rule in error backpropagation learning specifies the update to be made to each weight during backprop learning. Roughly speaking, it states that the change to the weight from node i to node j should be proportional to output of node j and also proportional to the "local gradient" at node j.

The local gradient, for an output node, is the product to the derivative of the squashing function evaluated at the total net input to node j, and the error signal (i.e. the difference between the target output and the actual output). In the case of a hidden node, the local gradient is the product of the derivative the squashing function (as above) and the weighted sum of the local gradients of the nodes to which node j is connected in subsequent layers of the net. Got it?

dendrite

A dendrite is one of the branches on the input end of a biological neuron. It has connections, via synapses to the axons of other neurons.

entropy

For our purposes, the entropy measure –Σ_i p_ilog₂p_i gives us the average amount of information in bits in some attribute of an instance. The information referred to is information about what class the instance belongs to, and p_i is the probability that an instance belongs to class i.

The rationale for this is as follows: –log₂(p) is the amount of information in bits associated with an event of probability p - for example, with an event of probability ½, like flipping a fair coin, log₂((p) is –log₂(½) = 1, so there is one bit of information. This should coincide with our intuition of what a bit means (if we have one). If there is a range of possible outcomes with associated probabilities, then to work out the average number of bits, we need to multiply the number of bits for each outcome (–log₂(p)) by the probability p and sum over all the outcomes. This is where the formula comes from.

Entropy is used in the ID3 decision tree induction algorithm.

epoch

In training a neural net, the term epoch is used to describe a complete pass through all of the training patterns. The weights in the neural net may be updated after each pattern is presented to the net, or they may be updated just once at the end of the epoch. Frequently used as a measure of speed of learning - as in "training was complete after x epochs".

error backpropagation learning algorithm

The error backpropagation learning algorithm is a form of supervised learning used to train mainly feedforward neural networks to perform some task. In outline, the algorithm is as follows:

Initialization: the weights of the network are initialized to small random values.
Forward pass: The inputs of each training pattern are presented to the network. The outputs are computed using the inputs and the current weights of the network. Certain statistics are kept from this computation, and used in the next phase. The target outputs from each training pattern are compared with the actual activation levels of the output units - the difference between the two is termed the error. Training may be pattern-by-pattern or epoch-by-epoch. With pattern-by-pattern training, the pattern error is provided directly to the backward pass. With epoch-by-epoch training, the pattern errors are summed across all training patterns, and the total error is provided to the backward pass.
Backward pass: In this phase, the weights of the net are updated. See the main article on the backward pass for some more detail.
Go back to step 2. Continue doing forward and backward passes until the stopping criterion is satisfied.

Error backpropagation learning is often familiarly referred to just as backprop.

error surface

When total error of a backpropagation-trained neural network is expressed as a function of the weights, and graphed (to the extent that this is possible with a large number of weights), the result is a surface termed the error surface. The course of learning can be traced on the error surface: as learning is supposed to reduce error, when the learning algorithm causes the weights to change, the current point on the error surface should descend into a valley of the error surface.

The "point" defined by the current set of weights is termed a point in weight space. Thus weight space is the set of all possible values of the weights.

See also local minimum and gradient descent.

excitatory connection

see weight.

expected error estimate

In pruning a decision tree, one needs to be able to estimate the expected error at any node (branch or leaf). This can be done using the Laplace error estimate, which is given by the formula

E(S) = (N – n + k – 1) / (N + k). where

S	is the set of instances in a node
k	is the number of classes (e.g. 2 if instances are just being classified into 2 classes: say positive and negative)
N	is the is the number of instances in S
C	is the majority class in S
n	out of N examples in S belong to C

featuresee attribute.

feedforward net

A kind of neural network in which the nodes can be numbered, in such a way that each node has weighted connections only to nodes with higher numbers. Such nets can be trained using the error backpropagation learning algorithm.

In practice, the nodes of most feedforward nets are partitioned into layers - that is, sets of nodes, and the layers may be numbered in such a way that the nodes in each layer are connected only to nodes in the next layer - that is, the layer with the next higher number. Commonly successive layers are totally interconnected - each node in the earlier layer is connected to every node in the next layer.

The first layer has no input connections, so consists of input units and is termed the input layer (yellow nodes in the diagram below).

The last layer has no output connections, so consists of output units and is termed the output layer (maroon nodes in the diagram below).

The layers in between the input and output layers are termed hidden layers, and consist of hidden units (light blue nodes and brown nodes in the diagram below).

When the net is operating, the activations of non-input neurons are computing using each neuron's activation function.

Feedforward network. All connections (arrows) are in one direction; there are no cycles of activation flow (cyclic subgraphs). Each colour identifies a different layer in the network. The layers 1 and 2 are fully interconnected, and so are layers 3 and 4. Layers 2 and 3 are only partly interconnected.

firing

In a biological neural network: neurons in a biological neural network fire when and if they receive enough stimulus via their (input) synapses. This means that an electrical impulse is propagated along the neuron's axon and transmitted to other neurons via the output synaptic connections of the neuron. The firing rate of a neuron is the frequency with which it fires (cf. activation in an artificial neural network.
In an expert system: when a rule in the expert system is used, it is said to fire.

function approximation algorithms

include connectionist and statistical techniques of machine learning. The idea is that machine learning means learning, from a number of examples or instances or training patterns, to compute a function which has as its arguments variables corresponding to the input part of the training pattern(s), and has as its output variables corresponding to the output part of the training patterns, which maps the input part of each training pattern to its output part. The hope is that the function will interpolate / generalize from the training patterns, so that it will produce reasonable outputs when given other inputs.