~ubuntu-branches/ubuntu/trusty/gretl/trusty-proposed : contents of doc/tex

~ubuntu-branches/ubuntu/trusty/gretl/trusty-proposed : (revision 35)
\chapter{Discrete variables}
\label{chap:discrete}

When a variable can take only a finite, typically small, number of
values, then the variable is said to be \emph{discrete}. Some
\app{gretl} commands act in a slightly different way when applied to
discrete variables; moreover, \app{gretl} provides a few commands that
only apply to discrete variables.  Specifically, the \texttt{dummify}
and \texttt{xtab} commands (see below) are available only for discrete
variables, while the \texttt{freq} (frequency distribution) command
produces different output for discrete variables.


\section{Declaring variables as discrete}
\label{discr-declare}

\app{Gretl} uses a simple heuristic to judge whether a given variable
should be treated as discrete, but you also have the option of
explicitly marking a variable as discrete, in which case the heuristic
check is bypassed.

The heuristic is as follows: First, are all the values of the variable
``reasonably round'', where this is taken to mean that they are all
integer multiples of 0.25?  If this criterion is met, we then ask
whether the variable takes on a ``fairly small'' set of distinct
values, where ``fairly small'' is defined as less than or equal to 8.
If both conditions are satisfied, the variable is automatically
considered discrete.

To mark a variable as discrete you have two options.
\begin{enumerate}
\item From the graphical interface, select ``Variable, Edit
  Attributes'' from the menu. A dialog box will appear and, if the
  variable seems suitable, you will see a tick box labeled ``Treat
  this variable as discrete''.  This dialog box can also be invoked
  via the context menu (right-click on a variable) or by pressing the
  F2 key.
\item From the command-line interface, via the \texttt{discrete}
  command. The command takes one or more arguments, which can be
  either variables or list of variables. For example:
\begin{code}
list xlist = x1 x2 x3
discrete z1 xlist z2
\end{code}
This syntax makes it possible to declare as discrete many
variables at once, which cannot presently be done via the graphical
interface. The switch \option{reverse} reverses the declaration of a
variable as discrete, or in other words marks it as continuous.
For example:
\begin{code}
discrete foo
# now foo is discrete
discrete foo --reverse
# now foo is continuous
\end{code}
\end{enumerate}

The command-line variant is more powerful, in that you can mark a
variable as discrete even if it does not seem to be suitable for
this treatment.

Note that marking a variable as discrete does not affect its content.
It is the user's responsibility to make sure that marking a variable
as discrete is a sensible thing to do.  Note that if you want to
recode a continuous variable into classes, you can use the
\texttt{genr} command and its arithmetic functions, as in the
following example:
\begin{code}
nulldata 100
# generate a variable with mean 2 and variance 1
genr x = normal() + 2
# split into 4 classes
genr z = (x>0) + (x>2) + (x>4)
# now declare z as discrete
discrete z
\end{code}

Once a variable is marked as discrete, this setting is remembered when
you save the file.

\section{Commands for discrete variables}
\label{discr-commands}

\subsection{The \texttt{dummify} command}
\label{discr-dummify}

The \texttt{dummify} command takes as argument a series $x$ and creates
dummy variables for each distinct value present in $x$, which must
have already been declared as discrete.  Example:
\begin{code}
open greene22_2
discrete Z5 # mark Z5 as discrete
dummify Z5
\end{code}

The effect of the above command is to generate 5 new dummy variables,
labeled \texttt{DZ5\_1} through \texttt{DZ5\_5}, which correspond to
the different values in \texttt{Z5}. Hence, the variable
\texttt{DZ5\_4} is 1 if \texttt{Z5} equals 4 and 0 otherwise. This
functionality is also available through the graphical interface by
selecting the menu item ``Add, Dummies for selected discrete variables''.

The \texttt{dummify} command can also be used with the following
syntax:
\begin{code}
list dlist = dummify(x)
\end{code}
This not only creates the dummy variables, but also a named list (see
section~\ref{named-lists}) that can be used afterwards. The
following example computes summary statistics for the variable \texttt{Y} for
each value of \texttt{Z5}:
\begin{code}
open greene22_2
discrete Z5 # mark Z5 as discrete
list foo = dummify(Z5)
loop foreach i foo
  smpl $i --restrict --replace
  summary Y
endloop
smpl --full
\end{code}
% $

Since \texttt{dummify} generates a list, it can be used directly
in commands that call for a list as input, such as \texttt{ols}.  For
example:
\begin{code}
open greene22_2
discrete Z5 # mark Z5 as discrete
ols Y 0 dummify(Z5)
\end{code}

\subsection{The \texttt{freq} command}
\label{discr-freq}

The \texttt{freq} command displays absolute and relative frequencies
for a given variable. The way frequencies are counted depends on
whether the variable is continuous or discrete.  This command is also
available via the graphical interface by selecting the ``Variable,
Frequency distribution'' menu entry.

For discrete variables, frequencies are counted for each distinct
value that the variable takes. For continuous variables, values are
grouped into ``bins'' and then the frequencies are counted for each
bin. The number of bins, by default, is computed as a function of the
number of valid observations in the currently selected sample via the
rule shown in Table~\ref{tab:bins}. However, when the command is
invoked through the menu item ``Variable, Frequency Plot'', this
default can be overridden by the user.

\begin{table}[htbp]
  \centering
  \begin{tabular}{cc}
\hline
  Observations & Bins \\
\hline
  $8 \le n < 16$ & 5 \\
  $16 \le n < 50 $ & 7 \\
  $50 \le n \le 850 $ & $\lceil \sqrt{n} \rceil$  \\
  $n > 850 $ & 29 \\
\hline
\end{tabular}
\caption{Number of bins for various sample sizes}
\label{tab:bins}
\end{table}

For example, the following code
%
\begin{code}
open greene19_1
freq TUCE
discrete TUCE # mark TUCE as discrete
freq TUCE
\end{code}
%
yields
%
\begin{code}
Read datafile /usr/local/share/gretl/data/greene/greene19_1.gdt
periodicity: 1, maxobs: 32,
observations range: 1-32

Listing 5 variables:
  0) const    1) GPA      2) TUCE     3) PSI      4) GRADE  

? freq TUCE

Frequency distribution for TUCE, obs 1-32
number of bins = 7, mean = 21.9375, sd = 3.90151

       interval          midpt   frequency    rel.     cum.

          <  13.417     12.000        1      3.12%    3.12% *
    13.417 - 16.250     14.833        1      3.12%    6.25% *
    16.250 - 19.083     17.667        6     18.75%   25.00% ******
    19.083 - 21.917     20.500        6     18.75%   43.75% ******
    21.917 - 24.750     23.333        9     28.12%   71.88% **********
    24.750 - 27.583     26.167        7     21.88%   93.75% *******
          >= 27.583     29.000        2      6.25%  100.00% **

Test for null hypothesis of normal distribution:
Chi-square(2) = 1.872 with p-value 0.39211
? discrete TUCE # mark TUCE as discrete
? freq TUCE

Frequency distribution for TUCE, obs 1-32

          frequency    rel.     cum.

  12           1      3.12%    3.12% *
  14           1      3.12%    6.25% *
  17           3      9.38%   15.62% ***
  19           3      9.38%   25.00% ***
  20           2      6.25%   31.25% **
  21           4     12.50%   43.75% ****
  22           2      6.25%   50.00% **
  23           4     12.50%   62.50% ****
  24           3      9.38%   71.88% ***
  25           4     12.50%   84.38% ****
  26           2      6.25%   90.62% **
  27           1      3.12%   93.75% *
  28           1      3.12%   96.88% *
  29           1      3.12%  100.00% *

Test for null hypothesis of normal distribution:
Chi-square(2) = 1.872 with p-value 0.39211
\end{code}
%
As can be seen from the sample output, a Doornik-Hansen test for
normality is computed automatically.  This test is suppressed for
discrete variables where the number of distinct values is less than
10.

This command accepts two options: \option{quiet}, to avoid
generation of the histogram when invoked from the command line and
\option{gamma}, for replacing the normality test with Locke's
nonparametric test, whose null hypothesis is that the data follow a
Gamma distribution.

If the distinct values of a discrete variable need to be saved, the
\texttt{values()} matrix construct can be used (see chapter
\ref{chap:matrices}).

\subsection{The \texttt{xtab} command}
\label{discr-xtab}

The \texttt{xtab} command cab be invoked in either of the following
ways.  First,
%
\begin{code}
xtab ylist ; xlist
\end{code}
%
where \texttt{ylist} and \texttt{xlist} are lists of discrete
variables.  This produces cross-tabulations (two-way frequencies) of
each of the variables in \texttt{ylist} (by row) against each of the
variables in \texttt{xlist} (by column).  Or second,
%
\begin{code}
xtab xlist
\end{code}
%
In the second case a full set of cross-tabulations is generated; that
is, each variable in \texttt{xlist} is tabulated against each other
variable in the list.  In the graphical interface, this command is
represented by the ``Cross Tabulation'' item under the View menu,
which is active if at least two variables are selected.

Here is an example of use:
%
\begin{code}
open greene22_2
discrete Z* # mark Z1-Z8 as discrete
xtab Z1 Z4 ; Z5 Z6
\end{code}
which produces
\begin{code}
Cross-tabulation of Z1 (rows) against Z5 (columns)

       [   1][   2][   3][   4][   5]  TOT.
  
[   0]    20    91    75    93    36    315
[   1]    28    73    54    97    34    286

TOTAL     48   164   129   190    70    601

Pearson chi-square test = 5.48233 (4 df, p-value = 0.241287)

Cross-tabulation of Z1 (rows) against Z6 (columns)

       [   9][  12][  14][  16][  17][  18][  20]  TOT.
  
[   0]     4    36   106    70    52    45     2    315
[   1]     3     8    48    45    37    67    78    286

TOTAL      7    44   154   115    89   112    80    601

Pearson chi-square test = 123.177 (6 df, p-value = 3.50375e-24)

Cross-tabulation of Z4 (rows) against Z5 (columns)

       [   1][   2][   3][   4][   5]  TOT.
  
[   0]    17    60    35    45    14    171
[   1]    31   104    94   145    56    430

TOTAL     48   164   129   190    70    601

Pearson chi-square test = 11.1615 (4 df, p-value = 0.0248074)

Cross-tabulation of Z4 (rows) against Z6 (columns)

       [   9][  12][  14][  16][  17][  18][  20]  TOT.
  
[   0]     1     8    39    47    30    32    14    171
[   1]     6    36   115    68    59    80    66    430

TOTAL      7    44   154   115    89   112    80    601

Pearson chi-square test = 18.3426 (6 df, p-value = 0.0054306)
\end{code}

Pearson's $\chi^2$ test for independence is automatically displayed,
provided that all cells have expected frequencies under independence
greater than $10^{-7}$.  However, a common rule of thumb states that
this statistic is valid only if the expected frequency is 5 or
greater for at least 80 percent of the cells.  If this condition is not
met a warning is printed.

Additionally, the \option{row} or \option{column} options can be
given: in this case, the output displays row or column percentages,
respectively.

If you want to cut and paste the output of \texttt{xtab} to some other
program, e.g.\ a spreadsheet, you may want to use the \option{zeros}
option; this option causes cells with zero frequency to display the
number 0 instead of being empty.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "gretl-guide"
%%% End: