calc: Linear Fits

 
 11.8.1 Linear Fits
 ------------------
 
 The ‘a F’ (‘calc-curve-fit’) [‘fit’] command attempts to fit a set of
 data (‘x’ and ‘y’ vectors of numbers) to a straight line, polynomial, or
 other function of ‘x’.  For the moment we will consider only the case of
 fitting to a line, and we will ignore the issue of whether or not the
 model was in fact a good fit for the data.
 
    In a standard linear least-squares fit, we have a set of ‘(x,y)’ data
 points that we wish to fit to the model ‘y = m x + b’ by adjusting the
 parameters ‘m’ and ‘b’ to make the ‘y’ values calculated from the
 formula be as close as possible to the actual ‘y’ values in the data
 set.  (In a polynomial fit, the model is instead, say, ‘y = a x^3 + b
 x^2 + c x + d’.  In a multilinear fit, we have data points of the form
 ‘(x_1,x_2,x_3,y)’ and our model is ‘y = a x_1 + b x_2 + c x_3 + d’.
 These will be discussed later.)
 
    In the model formula, variables like ‘x’ and ‘x_2’ are called the
 “independent variables”, and ‘y’ is the “dependent variable”.  Variables
 like ‘m’, ‘a’, and ‘b’ are called the “parameters” of the model.
 
    The ‘a F’ command takes the data set to be fitted from the stack.  By
 default, it expects the data in the form of a matrix.  For example, for
 a linear or polynomial fit, this would be a 2xN matrix where the first
 row is a list of ‘x’ values and the second row has the corresponding ‘y’
 values.  For the multilinear fit shown above, the matrix would have four
 rows (‘x_1’, ‘x_2’, ‘x_3’, and ‘y’, respectively).
 
    If you happen to have an Nx2 matrix instead of a 2xN matrix, just
 press ‘v t’ first to transpose the matrix.
 
    After you type ‘a F’, Calc prompts you to select a model.  For a
 linear fit, press the digit ‘1’.
 
    Calc then prompts for you to name the variables.  By default it
 chooses high letters like ‘x’ and ‘y’ for independent variables and low
 letters like ‘a’ and ‘b’ for parameters.  (The dependent variable
 doesn’t need a name.)  The two kinds of variables are separated by a
 semicolon.  Since you generally care more about the names of the
 independent variables than of the parameters, Calc also allows you to
 name only those and let the parameters use default names.
 
    For example, suppose the data matrix
 
      [ [ 1, 2, 3, 4,  5  ]
        [ 5, 7, 9, 11, 13 ] ]
 
 is on the stack and we wish to do a simple linear fit.  Type ‘a F’, then
 ‘1’ for the model, then <RET> to use the default names.  The result will
 be the formula ‘3. + 2. x’ on the stack.  Calc has created the model
 expression ‘a + b x’, then found the optimal values of ‘a’ and ‘b’ to
 fit the data.  (In this case, it was able to find an exact fit.)  Calc
 then substituted those values for ‘a’ and ‘b’ in the model formula.
 
    The ‘a F’ command puts two entries in the trail.  One is, as always,
 a copy of the result that went to the stack; the other is a vector of
 the actual parameter values, written as equations: ‘[a = 3, b = 2]’, in
 case you’d rather read them in a list than pick them out of the formula.
 (You can type ‘t y’ to move this vector to the stack; see SeeTrail
 Commands.
 
    Specifying a different independent variable name will affect the
 resulting formula: ‘a F 1 k <RET>’ produces ‘3 + 2 k’.  Changing the
 parameter names (say, ‘a F 1 k;b,m <RET>’) will affect the equations
 that go into the trail.
 
    To see what happens when the fit is not exact, we could change the
 number 13 in the data matrix to 14 and try the fit again.  The result
 is:
 
      2.6 + 2.2 x
 
    Evaluating this formula, say with ‘v x 5 <RET> <TAB> V M $ <RET>’,
 shows a reasonably close match to the y-values in the data.
 
      [4.8, 7., 9.2, 11.4, 13.6]
 
    Since there is no line which passes through all the N data points,
 Calc has chosen a line that best approximates the data points using the
 method of least squares.  The idea is to define the “chi-square” error
 measure
 
      chi^2 = sum((y_i - (a + b x_i))^2, i, 1, N)
 
 which is clearly zero if ‘a + b x’ exactly fits all data points, and
 increases as various ‘a + b x_i’ values fail to match the corresponding
 ‘y_i’ values.  There are several reasons why the summand is squared, one
 of them being to ensure that ‘chi^2 >= 0’.  Least-squares fitting simply
 chooses the values of ‘a’ and ‘b’ for which the error ‘chi^2’ is as
 small as possible.
 
    Other kinds of models do the same thing but with a different model
 formula in place of ‘a + b x_i’.
 
    A numeric prefix argument causes the ‘a F’ command to take the data
 in some other form than one big matrix.  A positive argument N will take
 N items from the stack, corresponding to the N rows of a data matrix.
 In the linear case, N must be 2 since there is always one independent
 variable and one dependent variable.
 
    A prefix of zero or plain ‘C-u’ is a compromise; Calc takes two items
 from the stack, an N-row matrix of ‘x’ values, and a vector of ‘y’
 values.  If there is only one independent variable, the ‘x’ values can
 be either a one-row matrix or a plain vector, in which case the ‘C-u’
 prefix is the same as a ‘C-u 2’ prefix.