Distribution Plotting & Analysis


Go to the DataPlus Console and open the file "smallseries". This is an artificial data set that was generated using a program which creates random numbers having a Gaussian distribution. Choose "Population" for the X-field and leave the others as "None". The DataPlus Console now look like this:


Then, click on the distribution analysis tool icon at the far left side of the toolbar. The distribution analysis window should open as follows:

The default view is a distribution histogram, which for a Gaussian (or 'Normal') distribution, should show a nice bell-shaped curve. Depending upon the number of points, it will appear more or less 'lumpy'. As the number of points increases, the curve will begin to assume a smooth bell-like shape. However, just because the curve appears lumpy, it still has a Gaussian distribution and can be described by the statistics which appear in the box in the left side of the window. The data is fitted to an ideal Gaussian curve, with the fitted curve being shown.


The following parametric results are given:

Points: The number of data points in the set.

Range: The smallest and largest values in the data set.

Mean: The mean (average) of all values in the data set.

S.D.: The standard deviation of the data.

+/- 2SD: The values which are encompassed by two standard deviations above and below the mean (about 97% of the points).

The non-parametric results (Percentiles) are also shown:

1/99 (5/95, etc): The values above or below which are found 1% (5%, etc) of the data points.

Median value: The value where half the data points lie above or below (50th percentile).

Mode Value(Gp): The most common value. Note that this will depend upon how the points are grouped (Gp).


The distribution histogram is formed by taking the data and grouping it. The smaller the group size, the more 'fine-grained' the distribution will be, but it will also appear to be more non-uniform. The group size does not affect the statistics or the fitted curve; it simply controls how the data is shown. Using the "Group Size" edit box on the "Distribution Statistics" panel, set the value to 5 and click the "Apply" button. The distribution should now look like this:

This recalculated the histogram by grouping all the results into sets with values differing no more than 5.0. For example, all data points between 90.0 and 95.0 were counted, and a bar was created whose height reflects the number of data points in that range. Note that the distribution is more ragged than before, since each bar represents fewer data points.



The data can be plotted in either of three formats, a histogram, a percentile plot, or a plot representing the data values as standard deviations from the mean. To change the plot type, select the desired output from the "Graph" menu as shown:

 

If you choose "Percentile" as shown here, the graph is redrawn as shown below:

 

Here, the data points have been sorted, and the value of a point is plotted against its percentile ranking in the data set. For example, this set contains 100 points, so in a low -> high sorted list of the points, point #5 would represent the 5th percentile, the 20th point would be the 20th percentile, and so on. Note that, although rough, the curve has the expected "S" shape of a Gaussian, or "Normal" distribution.

Another way of looking at the distribution is by plotting the deviation of the value from the mean (similar to a "Probit" or "Z-Score"), as shown below.

 

This has the advantage that, in a "Normal" distribution, the plot should be a straight line, which allows data fitting to be done. The data fitting algorithm used in this application takes advantage of this by performing a least-squares regression of the data points +/- 2SD from the mean. This is not a very rigorous method, but it works well if the distribution is relatively Gaussian.


Now, go back to the DataPlus Console and choose "Append Data..." from the "File" menu. Select the file "tinyseries". The distribution icon should be dimmed, so select the 'Tiny Series" tab and set "Value_1" as the X-field.



Clicking the distribution icon should then give you a distribution histogram with two distributions. If you go to the "Distribution Statistics" panel and change the color of the "Values_2" set to red, you will have a graph like this:

This shows the sets are essentially identical except for the number of points they contain.


Continue the tutorial here

Back To Index