11.3 Presentation of data
The two formats available for
presenting data on paper are tabular and graphical ones and the relative merits
of these are compared below. In some circumstances, it is clearly best to use
only one or other of these two alternatives alone. However, in many data
collection exercises, part of the measurements and calculations are expressed
in tabular form and part graphically, so making best use of the merits of each
technique. Very similar arguments apply to the relative merits of graphical and
tabular presentations if a computer screen is used for the presentation instead
of paper.
11.3.1 Tabular data presentation
A tabular presentation allows data
values to be recorded in a precise way that exactly maintains the accuracy to
which the data values were measured. In other words, the data values are written
down exactly as measured. Besides recording the raw data values as measured,
tables often also contain further values calculated from the raw data. An
example of a tabular data presentation is given in Table 11.1. This records the
results of an experiment to determine the strain induced in a bar of material
that is subjected to a range of stresses. Data were obtained by applying a
sequence of forces to the end of the bar and using an extensometer to measure
the change in length. Values of the stress and strain in the bar are calculated
from these measurements and are also included in the table. The final row,
which is of crucial importance in any tabular presentation, is the estimate of
possible error in each calculated result.
A table of measurements and
calculations should conform to several rules as illustrated in Table 11.1:
(i) The table should have a title
that explains what data are being presented within the table.
(ii) Each column of figures in the
table should refer to the measurements or calculations associated with one
quantity only.
(iii) Each column of figures should
be headed by a title that identifies the data values contained in the column.
(iv) The units in which quantities in
each column are measured should be stated at the top of the column.
(v) All headings and columns should
be separated by bold horizontal (and sometimes vertical) lines.
(vi) The errors associated with each
data value quoted in the table should be given. The form shown in Table 11.1 is
a suitable way to do this when the error level is the same for all data values
in a particular column. However, if error levels vary, then it is preferable to
write the error boundaries alongside each entry in the table.
11.3.2 Graphical presentation of data
Presentation of data in graphical
form involves some compromise in the accuracy to which the data are recorded,
as the exact values of measurements are lost. However, graphical presentation
has important advantages over tabular presentation.
(i) Graphs provide a pictorial
representation of results that is more readily comprehended than a set of
tabular results.
(ii) Graphs are particularly useful
for expressing the quantitative significance of results and showing whether a
linear relationship exists between two variables. Figure 11.12 shows a graph
drawn from the stress and strain values given in the Table 11.1. Construction
of the graph involves first of all marking the points corresponding to the
stress and strain values. The next step is to draw some lines through these
data points that best represents the relationship between the two variables.
This line will normally be either a straight one or a smooth curve. The data
points will not usually lie exactly on this line but instead will lie on either
side of it. The magnitude of the excursions of the data points from the line
drawn will depend on the magnitude of the random measurement errors associated
with the data.
(iii) Graphs can sometimes show up a
data point that is clearly outside the straight line or curve that seems to fit
the rest of the data points. Such a data point is probably due either to a
human mistake in reading an instrument or else to a momentary malfunction in
the measuring instrument itself. If the graph shows such a data point where a
human mistake or instrument malfunction is suspected, the proper course of
action is to repeat that particular measurement and then discard the original
data point if the mistake or malfunction is confirmed.
Like tables, the proper
representation of data in graphical form has to conform to certain rules:
(i) The graph should have a title or
caption that explains what data are being presented in the graph.
(ii) Both axes of the graph should be
labelled to express clearly what variable is associated with each axis and to
define the units in which the variables are expressed.
(iii) The number of points marked
along each axis should be kept reasonably small – about five divisions is often
a suitable number.
(iv) No attempt should be made to
draw the graph outside the boundaries corresponding to the maximum and minimum
data values measured, i.e. in Figure 11.12, the graph stops at a point
corresponding to the highest measured stress value of 108.5.
Fitting curves to data points on a
graph
The procedure of drawing a straight
line or smooth curve as appropriate that passes close to all data points on a
graph, rather than joining the data points by a jagged line that passes through
each data point, is justified on account of the random errors that are known to
affect measurements. Any line between the data points is mathematically
acceptable as a graphical representation of the data if the maximum deviation
of any data point from the line is within the boundaries of the identified
level of possible measurement errors. However, within the range of possible
lines that could be drawn, only one will be the optimum one. This optimum line
is where the sum of negative errors in data points on one side of the line is
balanced by the sum of positive errors in data points on the other side of the
line. The nature of the data points is often such that a perfectly acceptable
approximation to the optimum can be obtained by drawing a line through the data
points by eye. In other cases, however, it is necessary to fit a line mathematically,
using regression techniques.
Regression techniques
Regression techniques consist of
finding a mathematical relationship between measurements of two variables y and
x, such that the value of variable y can be predicted from a measurement of the
other variable x. However, regression techniques should not be regarded as a
magic formula that can fit a good relationship to measurement data in all
circumstances, as the characteristics of the data must satisfy certain
conditions. In determining the suitability of measurement data for the
application of regression techniques, it is recommended practice to draw an
approximate graph of the measured data points, as this is often the best means
of detecting aspects of the data that make it unsuitable for regression
analysis. Drawing a graph of the data will indicate, for example, whether there
are any data points that appear to be erroneous. This may indicate that human
mistakes or instrument malfunctions have affected the erroneous data points,
and it is assumed that any such data points will be checked for correctness.
Regression techniques cannot be
successfully applied if the deviation of any particular data point from the
line to be fitted is greater than the maximum possible error that is calculated
for the measured variable (i.e. the predicted sum of all systematic and random
errors). The nature of some measurement data sets is such that this criterion
cannot be satisfied, and any attempt to apply regression techniques is doomed
to failure. In that event, the only valid course of action is to express the
measurements in tabular form. This can then be used as a x –y look-up table,
from which values of the variable y corresponding to particular values of x can
be read off. In many cases, this problem of large errors in some data points
only becomes apparent during the process of attempting to fit a relationship by
regression.
A further check that must be made
before attempting to fit a line or curve to measurements of two variables x and
y is to examine the data and look for any evidence that both variables are
subject to random errors. It is a clear condition for the validity of
regression techniques that only one of the measured variables is subject to
random errors, with no error in the other variable. If random errors do exist
in both measured variables, regression techniques cannot be applied and
recourse must be made instead to correlation analysis (covered later in this
chapter). A simple example of a situation where both variables in a measurement
data set are subject to random errors are measurements of human height and
weight, and no attempt should be made to fit a relationship between them by regression.
Having determined that the technique
is valid, the regression procedure is simplest if a straight-line relationship
exists between the variables, which allows a relationship of the form y = a +
bx to be estimated by linear least squares regression. Unfortunately, in many
cases, a straight-line relationship between the points does not exist, which is
readily shown by plotting the raw data points on a graph. However, knowledge of
physical laws governing the data can often suggest a suitable alternative form
of relationship between the two sets of variable measurements, such as a
quadratic relationship or a higher order polynomial relationship. Also, in some
cases, the measured variables can be transformed into a form where a linear
relationship exists. For example, suppose that two variables y and x are
related according to y = axc. A linear relationship from this can be
derived, using a logarithmic transformation, as log(y) = log(a) + c log(x) .
Thus, if a graph is constructed of
log(y) plotted against log(x) , the parameters of a straight-line relationship
can be estimated by linear least squares regression.
All quadratic and higher order
relationships relating one variable y to another variable x can be represented
by a power series of the form:
y = a0 + a1x
+ a2x2 +…+ apxp
Estimation of the parameters a0
...ap is very difficult if p has a large value. Fortunately, a
relationship where p only has a small value can be fitted to most data sets.
Quadratic least squares regression is used to estimate parameters where p has a
value of two, and for larger values of p, polynomial least squares regression
is used for parameter estimation.
Where the appropriate form of
relationship between variables in measurement data sets is not obvious either
from visual inspection or from consideration of physical laws, a method that is
effectively a trial and error one has to be applied. This consists of
estimating the parameters of successively higher order relationships between y
and x until a curve is found that fits the data sufficiently closely. What
level of closeness is acceptable is considered in the later section on
confidence tests.
Linear least squares regression
If a linear relationship between y
and x exists for a set of n measurements y1 ...yn, x1
...xn, then this relationship can be expressed as y = a + bx, where
the coefficients a and b are constants. The purpose of least squares regression
is to select the optimum values for a and b such that the line gives the best
fit to the measurement data.
The deviation of each point (xi,
yi) from the line can be expressed as di, where di
= yi (a + bxi).
The best-fit line is obtained when
the sum of the squared deviations, S, is a minimum, i.e. when
Example 11.1
In an experiment to determine the
characteristics of a displacement sensor with a voltage output, the following
output voltage values were recorded when a set of standard displacements was
measured:
Fit a straight line to this set of
data using least squares regression and estimate the output voltage when a
displacement of 4.5 cm is measured.
Solution
Let y represent the output voltage
and x represent the displacement. Then a suitable straight line is given by y =
a + bx. We can now proceed to calculate estimates for the coefficients a and b
using equations (11.8) and (11.9) above. The first step is to calculate the
mean values of x and y. These are found to be xm = 5.5 and ym
= 11.47. Next, we need to tabulate xiyi and x2
i for each pair of data values:
Hence, for x = 4.5, y = 0.1033 + (2.067
× 4.5) = 9.40 volts. Note that in this solution, we have only specified the
answer to an accuracy of three figures, which is the same accuracy as the
measurements. Any greater number of figures in the answer would be meaningless.
Least squares regression is often
appropriate for situations where a straight-line relationship is not
immediately obvious, for example where y α x2 or y α exp(x) .
Example 11.2
From theoretical considerations, it
is known that the voltage (V) across a charged capacitor decays with time (t)
according to the relationship: V = K exp(t/
The minimum can be found by setting
the partial derivatives ∂S/∂a, ∂S/∂b and ∂S/∂c to zero and solving the
resulting simultaneous equations, as for the linear least squares regression
case above. Standard computer programs to estimate the parameters a, b and c by
numerical methods are widely available and therefore a detailed solution is not
presented here.
Polynomial least squares regression
Polynomial least squares regression
is used to estimate the parameters of the pth order relationship y = a0 + a1x + a2x2 +…+ apxp
between two sets of measurements y1 ...yn,
x1 ...xn.
The deviation of each point (xi,
yi) from the line can be expressed as di, where:
di = yi (a0 + a1xi +
a2x2i +…+ apx p i
The best-fit line is obtained when
the sum of the squared deviations given by
S =
is a minimum.
The minimum can be found as before by
setting the p partial derivatives ∂S/∂a0 . . . ∂S/∂ap to zero and solving the
resulting simultaneous equations. Again, as for the quadratic least squares
regression case, standard computer programs to estimate the parameters a0
...ap by numerical methods are widely available and therefore a
detailed solution is not presented here.
Confidence tests in curve fitting by
least squares regression
Once data has been collected and a
mathematical relationship that fits the data points has been determined by
regression, the level of confidence that the mathematical relationship fitted
is correct must be expressed in some way. The first check that must be made is
whether the fundamental requirement for the validity of regression techniques
is satisfied, i.e. whether the deviations of data points from the fitted line
are all less than the maximum error level predicted for the measured variable.
If this condition is violated by any data point that a line or curve has been
fitted to, then use of the fitted relationship is unsafe and recourse must be
made to tabular data presentation, as described earlier.
The second check concerns whether or
not random errors affect both measured variables. If attempts are made to fit
relationships by regression to data where both measured variables contain
random errors, any relationship fitted will only be approximate and it is
likely that one or more data points will have a deviation from the fitted line
or curve that is greater than the maximum error level predicted for the
measured variable. This will show up when the appropriate checks are made.
Having carried out the above checks
to show that there are no aspects of the data which suggest that regression
analysis is not appropriate, the next step is to apply least squares regression
to estimate the parameters of the chosen relationship (linear, quadratic etc.).
After this, some form of follow-up procedure is clearly required to assess how
well the estimated relationship fits the data points. A simple curve-fitting
confidence test is to calculate the sum of squared deviations S for the chosen
y/x relationship and compare it with the value of S calculated for the next
higher order regression curve that could be fitted to the data. Thus if a
straight-line relationship is chosen, the value of S calculated should be of a
similar magnitude to that obtained by fitting a quadratic relationship. If the
value of S were substantially lower for a quadratic relationship, this would
indicate that a quadratic relationship was a better fit to the data than a
straight-line one and further tests would be needed to examine whether a cubic
or higher order relationship was a better fit still.
Other more sophisticated confidence
tests exist such as the F-ratio test. However, these are outside the scope of
this book.
Correlation tests
Where both variables in a measurement
data set are subject to random fluctuations, correlation analysis is applied to
determine the degree of association between the variables. For example, in the
case already quoted of a data set containing measurements of human height and
weight, we certainly expect some relationship between the variables of height
and weight because a tall person is heavier on average than a short person.
Correlation tests determine the strength of the relationship (or interdependence)
between the measured variables, which is expressed in the form of a correlation
coefficient.
For two sets of measurements y1
...yn, x1 ...xn with means xm and ym,
the correlation coefficient
The value of |
y = a + bx and x = c + dy
These two lines are not normally
coincident as shown in Figure 11.13. Both lines pass through the centroid of
the data points but their slopes are different.
As ||
As ||
For the general case, the best fit to
the data is the line that bisects the angle between the lines on Figure 11.13
No comments:
Post a Comment
Tell your requirements and How this blog helped you.