45. descriptive

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

45.1 Introduction to descriptive

Package descriptive contains a set of functions for making descriptive statistical computations and graphing. Together with the source code there are three data sets in your Maxima tree: pidigits.data, wind.data and biomed.data.

Any statistics manual can be used as a reference to the functions in package descriptive.

For comments, bugs or suggestions, please contact me at 'mario AT edu DOT xunta DOT es'.

Here is a simple example on how the descriptive functions in descriptive do they work, depending on the nature of their arguments, lists or matrices,

(%i1) load (descriptive)$
(%i2) /* univariate sample */   mean ([a, b, c]);
                            c + b + a
(%o2)                       ---------
                                3
(%i3) matrix ([a, b], [c, d], [e, f]);
                            [ a  b ]
                            [      ]
(%o3)                       [ c  d ]
                            [      ]
                            [ e  f ]
(%i4) /* multivariate sample */ mean (%);
                      e + c + a  f + d + b
(%o4)                [---------, ---------]
                          3          3

Note that in multivariate samples the mean is calculated for each column.

In case of several samples with possible different sizes, the Maxima function map can be used to get the desired results for each sample,

(%i1) load (descriptive)$
(%i2) map (mean, [[a, b, c], [d, e]]);
                        c + b + a  e + d
(%o2)                  [---------, -----]
                            3        2

In this case, two samples of sizes 3 and 2 were stored into a list.

Univariate samples must be stored in lists like

(%i1) s1 : [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5];
(%o1)           [3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5]

and multivariate samples in matrices as in

(%i1) s2 : matrix ([13.17, 9.29], [14.71, 16.88], [18.50, 16.88],
             [10.58, 6.63], [13.33, 13.25], [13.21,  8.12]);
                        [ 13.17  9.29  ]
                        [              ]
                        [ 14.71  16.88 ]
                        [              ]
                        [ 18.5   16.88 ]
(%o1)                   [              ]
                        [ 10.58  6.63  ]
                        [              ]
                        [ 13.33  13.25 ]
                        [              ]
                        [ 13.21  8.12  ]

In this case, the number of columns equals the random variable dimension and the number of rows is the sample size.

Data can be introduced by hand, but big samples are usually stored in plain text files. For example, file pidigits.data contains the first 100 digits of number %pi:

In order to load these digits in Maxima,

(%i1) s1 : read_list (file_search ("pidigits.data"))$
(%i2) length (s1);
(%o2)                          100

On the other hand, file wind.data contains daily average wind speeds at 5 meteorological stations in the Republic of Ireland (This is part of a data set taken at 12 meteorological stations. The original file is freely downloadable from the StatLib Data Repository and its analysis is discused in Haslett, J., Raftery, A. E. (1989) Space-time Modelling with Long-memory Dependence: Assessing Ireland's Wind Power Resource, with Discussion. Applied Statistics 38, 1-50). This loads the data:

(%i1) s2 : read_matrix (file_search ("wind.data"))$
(%i2) length (s2);
(%o2)                          100
(%i3) s2 [%]; /* last record */
(%o3)            [3.58, 6.0, 4.58, 7.62, 11.25]

Some samples contain non numeric data. As an example, file biomed.data (which is part of another bigger one downloaded from the StatLib Data Repository) contains four blood measures taken from two groups of patients, A and B, of different ages,

(%i1) s3 : read_matrix (file_search ("biomed.data"))$
(%i2) length (s3);
(%o2)                          100
(%i3) s3 [1]; /* first record */
(%o3)            [A, 30, 167.0, 89.0, 25.6, 364]

The first individual belongs to group A, is 30 years old and his/her blood measures were 167.0, 89.0, 25.6 and 364.

One must take care when working with categorical data. In the next example, symbol a is asigned a value in some previous moment and then a sample with categorical value a is taken,

(%i1) a : 1$
(%i2) matrix ([a, 3], [b, 5]);
                            [ 1  3 ]
(%o2)                       [      ]
                            [ b  5 ]

Categories: Descriptive statistics · Share packages · Package descriptive

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

45.2 Functions and Variables for data manipulation

Function: continuous_freq (list)

Function: continuous_freq (list, m)

The argument of continuous_freq must be a list of numbers, which will be then grouped in intervals and counted how many of them belong to each group. Optionally, function continuous_freq admits a second argument indicating the number of classes, 10 is default,

(%i1) load (descriptive)$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) continuous_freq (s1, 5);
(%o3) [[0, 1.8, 3.6, 5.4, 7.2, 9.0], [16, 24, 18, 17, 25]]

The first list contains the interval limits and the second the corresponding counts: there are 16 digits inside the interval [0, 1.8], that is 0's and 1's, 24 digits in (1.8, 3.6], that is 2's and 3's, and so on.

Categories: Package descriptive

Function: discrete_freq (list)

Counts absolute frequencies in discrete samples, both numeric and categorical. Its unique argument is a list,

(%i1) load (descriptive)$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) discrete_freq (s1);
(%o3) [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 
                             [8, 8, 12, 12, 10, 8, 9, 8, 12, 13]]

The first list gives the sample values and the second their absolute frequencies. Commands ? col and ? transpose should help you to understand the last input.

Categories: Package descriptive

Function: subsample (data_matrix, predicate_function)

Function: subsample (data_matrix, predicate_function, col_num1, col_num2, ...)

This is a sort of variant of the Maxima submatrix function. The first argument is the data matrix, the second is a predicate function and optional additional arguments are the numbers of the columns to be taken. Its behaviour is better understood with examples.

These are multivariate records in which the wind speed in the first meteorological station were greater than 18. See that in the lambda expression the i-th component is refered to as v[i].

(%i1) load (descriptive)$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) subsample (s2, lambda([v], v[1] > 18));
              [ 19.38  15.37  15.12  23.09  25.25 ]
              [                                   ]
              [ 18.29  18.66  19.08  26.08  27.63 ]
(%o3)         [                                   ]
              [ 20.25  21.46  19.95  27.71  23.38 ]
              [                                   ]
              [ 18.79  18.96  14.46  26.38  21.84 ]

In the following example, we request only the first, second and fifth components of those records with wind speeds greater or equal than 16 in station number 1 and less than 25 knots in station number 4. The sample contains only data from stations 1, 2 and 5. In this case, the predicate function is defined as an ordinary Maxima function.

(%i1) load (descriptive)$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) g(x):= x[1] >= 16 and x[4] < 25$
(%i4) subsample (s2, g, 1, 2, 5);
                     [ 19.38  15.37  25.25 ]
                     [                     ]
                     [ 17.33  14.67  19.58 ]
(%o4)                [                     ]
                     [ 16.92  13.21  21.21 ]
                     [                     ]
                     [ 17.25  18.46  23.87 ]

Here is an example with the categorical variables of biomed.data. We want the records corresponding to those patients in group B who are older than 38 years.

(%i1) load (descriptive)$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) h(u):= u[1] = B and u[2] > 38 $
(%i4) subsample (s3, h);
                [ B  39  28.0  102.3  17.1  146 ]
                [                               ]
                [ B  39  21.0  92.4   10.3  197 ]
                [                               ]
                [ B  39  23.0  111.5  10.0  133 ]
                [                               ]
                [ B  39  26.0  92.6   12.3  196 ]
(%o4)           [                               ]
                [ B  39  25.0  98.7   10.0  174 ]
                [                               ]
                [ B  39  21.0  93.2   5.9   181 ]
                [                               ]
                [ B  39  18.0  95.0   11.3  66  ]
                [                               ]
                [ B  39  39.0  88.5   7.6   168 ]

Probably, the statistical analysis will involve only the blood measures,

(%i1) load (descriptive)$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) subsample (s3, lambda([v], v[1] = B and v[2] > 38),
                 3, 4, 5, 6);
                   [ 28.0  102.3  17.1  146 ]
                   [                        ]
                   [ 21.0  92.4   10.3  197 ]
                   [                        ]
                   [ 23.0  111.5  10.0  133 ]
                   [                        ]
                   [ 26.0  92.6   12.3  196 ]
(%o3)              [                        ]
                   [ 25.0  98.7   10.0  174 ]
                   [                        ]
                   [ 21.0  93.2   5.9   181 ]
                   [                        ]
                   [ 18.0  95.0   11.3  66  ]
                   [                        ]
                   [ 39.0  88.5   7.6   168 ]

This is the multivariate mean of s3,

(%i1) load (descriptive)$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) mean (s3);
       65 B + 35 A  317          6 NA + 8144.999999999999
(%o3) [-----------, ---, 87.178, ------------------------, 
           100      10                     100
                                                    3 NA + 19587
                                            18.123, ------------]
                                                        100

Here, the first component is meaningless, since A and B are categorical, the second component is the mean age of individuals in rational form, and the fourth and last values exhibit some strange behaviour. This is because symbol NA is used here to indicate non available data, and the two means are nonsense. A possible solution would be to take out from the matrix those rows with NA symbols, although this deserves some loss of information.

(%i1) load (descriptive)$
(%i2) s3 : read_matrix (file_search ("biomed.data"))$
(%i3) g(v):= v[4] # NA and v[6] # NA $
(%i4) mean (subsample (s3, g, 3, 4, 5, 6));
(%o4) [79.4923076923077, 86.2032967032967, 16.93186813186813, 
                                                            2514
                                                            ----]
                                                             13

Categories: Package descriptive

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

45.3 Functions and Variables for descriptive statistics

Function: mean (list)

Function: mean (matrix)

This is the sample mean, defined as

                       n
                     ====
             _   1   \
             x = -    >    x
                 n   /      i
                     ====
                     i = 1

Example:

(%i1) load (descriptive)$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) mean (s1);
                               471
(%o3)                          ---
                               100
(%i4) %, numer;
(%o4)                         4.71
(%i5) s2 : read_matrix (file_search ("wind.data"))$
(%i6) mean (s2);
(%o6)     [9.9485, 10.1607, 10.8685, 15.7166, 14.8441]

Categories: Package descriptive

Function: var (list)

Function: var (matrix)

This is the sample variance, defined as

                     n
                   ====
           2   1   \          _ 2
          s  = -    >    (x - x)
               n   /       i
                   ====
                   i = 1

Example:

(%i1) load (descriptive)$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) var (s1), numer;
(%o3)                   8.425899999999999

45.4 Functions and Variables for specific multivariate descriptive statistics

Function: cov (matrix)

The covariance matrix of the multivariate sample, defined as

              n
             ====
          1  \           _        _
      S = -   >    (X  - X) (X  - X)'
          n  /       j        j
             ====
             j = 1

where X_j is the j-th row of the sample matrix.

Example:

(%i1) load (descriptive)$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) fpprintprec : 7$  /* change precision for pretty output */
(%i4) cov (s2);
      [ 17.22191  13.61811  14.37217  19.39624  15.42162 ]
      [                                                  ]
      [ 13.61811  14.98774  13.30448  15.15834  14.9711  ]
      [                                                  ]
(%o4) [ 14.37217  13.30448  15.47573  17.32544  16.18171 ]
      [                                                  ]
      [ 19.39624  15.15834  17.32544  32.17651  20.44685 ]
      [                                                  ]
      [ 15.42162  14.9711   16.18171  20.44685  24.42308 ]

45.5 Functions and Variables for statistical graphs

Function: histogram (list)

Function: histogram (list, option_1, option_2, ...)

Function: histogram (one_column_matrix)

Function: histogram (one_column_matrix, option_1, option_2, ...)

Function: histogram (one_row_matrix)

Function: histogram (one_row_matrix, option_1, option_2, ...)

This function plots an histogram from a continuous sample. Sample data must be stored in a list of numbers or a one dimensional matrix.

Available options are:

Those defined in the draw package. See also bars and barsplot.
nclasses: number of classes of the histogram (10 by default).

See also discrete_freq and continuous_freq to count data, and bars and barsplot to display bar graphs.

Examples:

A simple histogram with eight classes.

(%i1) load (descriptive)$
(%i2) s1 : read_list (file_search ("pidigits.data"))$
(%i3) histogram (
           s1,
           nclasses     = 8,
           title        = "pi digits",
           xlabel       = "digits",
           ylabel       = "Absolute frequency",
           fill_color   = grey,
           fill_density = 0.6)$

Categories: Package descriptive · Plotting

Function: scatterplot (list)

Function: scatterplot (list, option_1, option_2, ...)

Function: scatterplot (matrix)

Function: scatterplot (matrix, option_1, option_2, ...)

Plots scatter diagrams both for univariate (list) and multivariate (matrix) samples.

Available options are:

Those defined in the draw package.
nclasses: number of classes of the histogram (10 by default).

Examples:

Univariate scatter diagram from a simulated Gaussian sample.

(%i1) load (descriptive)$
(%i2) load (distrib)$
(%i3) scatterplot(
        random_normal(0,1,200),
        xaxis      = true,
        point_size = 2,
        terminal   = eps,
        eps_width  = 10,
        eps_height = 2)$

Two dimensional scatter plot.

(%i1) load (descriptive)$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) scatterplot(
       submatrix(s2, 1,2,3),
       title      = "Data from stations #4 and #5",
       point_type = diamant,
       point_size = 2,
       color      = blue)$

Three dimensional scatter plot.

(%i1) load (descriptive)$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) scatterplot(submatrix (s2, 1,2))$

Five dimensional scatter plot, with five classes histograms.

(%i1) load (descriptive)$
(%i2) s2 : read_matrix (file_search ("wind.data"))$
(%i3) scatterplot(
        s2,
        nclasses     = 5,
        fill_color   = blue,
        fill_density = 0.3,
        xtics        = 5)$

For plotting isolated or line-joined points in two and three dimensions, see points. For histogram related options, see bars.