calc: Linear Fits
11.8.1 Linear Fits
------------------
The ‘a F’ (‘calc-curve-fit’) [‘fit’] command attempts to fit a set of
data (‘x’ and ‘y’ vectors of numbers) to a straight line, polynomial, or
other function of ‘x’. For the moment we will consider only the case of
fitting to a line, and we will ignore the issue of whether or not the
model was in fact a good fit for the data.
In a standard linear least-squares fit, we have a set of ‘(x,y)’ data
points that we wish to fit to the model ‘y = m x + b’ by adjusting the
parameters ‘m’ and ‘b’ to make the ‘y’ values calculated from the
formula be as close as possible to the actual ‘y’ values in the data
set. (In a polynomial fit, the model is instead, say, ‘y = a x^3 + b
x^2 + c x + d’. In a multilinear fit, we have data points of the form
‘(x_1,x_2,x_3,y)’ and our model is ‘y = a x_1 + b x_2 + c x_3 + d’.
These will be discussed later.)
In the model formula, variables like ‘x’ and ‘x_2’ are called the
“independent variables”, and ‘y’ is the “dependent variable”. Variables
like ‘m’, ‘a’, and ‘b’ are called the “parameters” of the model.
The ‘a F’ command takes the data set to be fitted from the stack. By
default, it expects the data in the form of a matrix. For example, for
a linear or polynomial fit, this would be a 2xN matrix where the first
row is a list of ‘x’ values and the second row has the corresponding ‘y’
values. For the multilinear fit shown above, the matrix would have four
rows (‘x_1’, ‘x_2’, ‘x_3’, and ‘y’, respectively).
If you happen to have an Nx2 matrix instead of a 2xN matrix, just
press ‘v t’ first to transpose the matrix.
After you type ‘a F’, Calc prompts you to select a model. For a
linear fit, press the digit ‘1’.
Calc then prompts for you to name the variables. By default it
chooses high letters like ‘x’ and ‘y’ for independent variables and low
letters like ‘a’ and ‘b’ for parameters. (The dependent variable
doesn’t need a name.) The two kinds of variables are separated by a
semicolon. Since you generally care more about the names of the
independent variables than of the parameters, Calc also allows you to
name only those and let the parameters use default names.
For example, suppose the data matrix
[ [ 1, 2, 3, 4, 5 ]
[ 5, 7, 9, 11, 13 ] ]
is on the stack and we wish to do a simple linear fit. Type ‘a F’, then
‘1’ for the model, then <RET> to use the default names. The result will
be the formula ‘3. + 2. x’ on the stack. Calc has created the model
expression ‘a + b x’, then found the optimal values of ‘a’ and ‘b’ to
fit the data. (In this case, it was able to find an exact fit.) Calc
then substituted those values for ‘a’ and ‘b’ in the model formula.
The ‘a F’ command puts two entries in the trail. One is, as always,
a copy of the result that went to the stack; the other is a vector of
the actual parameter values, written as equations: ‘[a = 3, b = 2]’, in
case you’d rather read them in a list than pick them out of the formula.
(You can type ‘t y’ to move this vector to the stack; see Trail
Commands.
Specifying a different independent variable name will affect the
resulting formula: ‘a F 1 k <RET>’ produces ‘3 + 2 k’. Changing the
parameter names (say, ‘a F 1 k;b,m <RET>’) will affect the equations
that go into the trail.
To see what happens when the fit is not exact, we could change the
number 13 in the data matrix to 14 and try the fit again. The result
is:
2.6 + 2.2 x
Evaluating this formula, say with ‘v x 5 <RET> <TAB> V M $ <RET>’,
shows a reasonably close match to the y-values in the data.
[4.8, 7., 9.2, 11.4, 13.6]
Since there is no line which passes through all the N data points,
Calc has chosen a line that best approximates the data points using the
method of least squares. The idea is to define the “chi-square” error
measure
chi^2 = sum((y_i - (a + b x_i))^2, i, 1, N)
which is clearly zero if ‘a + b x’ exactly fits all data points, and
increases as various ‘a + b x_i’ values fail to match the corresponding
‘y_i’ values. There are several reasons why the summand is squared, one
of them being to ensure that ‘chi^2 >= 0’. Least-squares fitting simply
chooses the values of ‘a’ and ‘b’ for which the error ‘chi^2’ is as
small as possible.
Other kinds of models do the same thing but with a different model
formula in place of ‘a + b x_i’.
A numeric prefix argument causes the ‘a F’ command to take the data
in some other form than one big matrix. A positive argument N will take
N items from the stack, corresponding to the N rows of a data matrix.
In the linear case, N must be 2 since there is always one independent
variable and one dependent variable.
A prefix of zero or plain ‘C-u’ is a compromise; Calc takes two items
from the stack, an N-row matrix of ‘x’ values, and a vector of ‘y’
values. If there is only one independent variable, the ‘x’ values can
be either a one-row matrix or a plain vector, in which case the ‘C-u’
prefix is the same as a ‘C-u 2’ prefix.