~ubuntu-branches/ubuntu/trusty/gretl/trusty-proposed

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
\chapter{Discrete variables}
\label{chap:discrete}

When a variable can take only a finite, typically small, number of
values, then the variable is said to be \emph{discrete}. Some
\app{gretl} commands act in a slightly different way when applied to
discrete variables; moreover, \app{gretl} provides a few commands that
only apply to discrete variables.  Specifically, the \texttt{dummify}
and \texttt{xtab} commands (see below) are available only for discrete
variables, while the \texttt{freq} (frequency distribution) command
produces different output for discrete variables.


\section{Declaring variables as discrete}
\label{discr-declare}

\app{Gretl} uses a simple heuristic to judge whether a given variable
should be treated as discrete, but you also have the option of
explicitly marking a variable as discrete, in which case the heuristic
check is bypassed.

The heuristic is as follows: First, are all the values of the variable
``reasonably round'', where this is taken to mean that they are all
integer multiples of 0.25?  If this criterion is met, we then ask
whether the variable takes on a ``fairly small'' set of distinct
values, where ``fairly small'' is defined as less than or equal to 8.
If both conditions are satisfied, the variable is automatically
considered discrete.

To mark a variable as discrete you have two options.
\begin{enumerate}
\item From the graphical interface, select ``Variable, Edit
  Attributes'' from the menu. A dialog box will appear and, if the
  variable seems suitable, you will see a tick box labeled ``Treat
  this variable as discrete''.  This dialog box can also be invoked
  via the context menu (right-click on a variable) or by pressing the
  F2 key.
\item From the command-line interface, via the \texttt{discrete}
  command. The command takes one or more arguments, which can be
  either variables or list of variables. For example:
\begin{code}
list xlist = x1 x2 x3
discrete z1 xlist z2
\end{code}
This syntax makes it possible to declare as discrete many
variables at once, which cannot presently be done via the graphical
interface. The switch \option{reverse} reverses the declaration of a
variable as discrete, or in other words marks it as continuous.
For example:
\begin{code}
discrete foo
# now foo is discrete
discrete foo --reverse
# now foo is continuous
\end{code}
\end{enumerate}

The command-line variant is more powerful, in that you can mark a
variable as discrete even if it does not seem to be suitable for
this treatment.

Note that marking a variable as discrete does not affect its content.
It is the user's responsibility to make sure that marking a variable
as discrete is a sensible thing to do.  Note that if you want to
recode a continuous variable into classes, you can use the
\texttt{genr} command and its arithmetic functions, as in the
following example:
\begin{code}
nulldata 100
# generate a variable with mean 2 and variance 1
genr x = normal() + 2
# split into 4 classes
genr z = (x>0) + (x>2) + (x>4)
# now declare z as discrete
discrete z
\end{code}

Once a variable is marked as discrete, this setting is remembered when
you save the file.

\section{Commands for discrete variables}
\label{discr-commands}

\subsection{The \texttt{dummify} command}
\label{discr-dummify}

The \texttt{dummify} command takes as argument a series $x$ and creates
dummy variables for each distinct value present in $x$, which must
have already been declared as discrete.  Example:
\begin{code}
open greene22_2
discrete Z5 # mark Z5 as discrete
dummify Z5
\end{code}

The effect of the above command is to generate 5 new dummy variables,
labeled \texttt{DZ5\_1} through \texttt{DZ5\_5}, which correspond to
the different values in \texttt{Z5}. Hence, the variable
\texttt{DZ5\_4} is 1 if \texttt{Z5} equals 4 and 0 otherwise. This
functionality is also available through the graphical interface by
selecting the menu item ``Add, Dummies for selected discrete variables''.

The \texttt{dummify} command can also be used with the following
syntax:
\begin{code}
list dlist = dummify(x)
\end{code}
This not only creates the dummy variables, but also a named list (see
section~\ref{named-lists}) that can be used afterwards. The
following example computes summary statistics for the variable \texttt{Y} for
each value of \texttt{Z5}:
\begin{code}
open greene22_2
discrete Z5 # mark Z5 as discrete
list foo = dummify(Z5)
loop foreach i foo
  smpl $i --restrict --replace
  summary Y
endloop
smpl --full
\end{code}
% $

Since \texttt{dummify} generates a list, it can be used directly
in commands that call for a list as input, such as \texttt{ols}.  For
example:
\begin{code}
open greene22_2
discrete Z5 # mark Z5 as discrete
ols Y 0 dummify(Z5)
\end{code}

\subsection{The \texttt{freq} command}
\label{discr-freq}

The \texttt{freq} command displays absolute and relative frequencies
for a given variable. The way frequencies are counted depends on
whether the variable is continuous or discrete.  This command is also
available via the graphical interface by selecting the ``Variable,
Frequency distribution'' menu entry.

For discrete variables, frequencies are counted for each distinct
value that the variable takes. For continuous variables, values are
grouped into ``bins'' and then the frequencies are counted for each
bin. The number of bins, by default, is computed as a function of the
number of valid observations in the currently selected sample via the
rule shown in Table~\ref{tab:bins}. However, when the command is
invoked through the menu item ``Variable, Frequency Plot'', this
default can be overridden by the user.

\begin{table}[htbp]
  \centering
  \begin{tabular}{cc}
\hline
  Observations & Bins \\
\hline
  $8 \le n < 16$ & 5 \\
  $16 \le n < 50 $ & 7 \\
  $50 \le n \le 850 $ & $\lceil \sqrt{n} \rceil$  \\
  $n > 850 $ & 29 \\
\hline
\end{tabular}
\caption{Number of bins for various sample sizes}
\label{tab:bins}
\end{table}

For example, the following code
%
\begin{code}
open greene19_1
freq TUCE
discrete TUCE # mark TUCE as discrete
freq TUCE
\end{code}
%
yields
%
\begin{code}
Read datafile /usr/local/share/gretl/data/greene/greene19_1.gdt
periodicity: 1, maxobs: 32,
observations range: 1-32

Listing 5 variables:
  0) const    1) GPA      2) TUCE     3) PSI      4) GRADE  

? freq TUCE

Frequency distribution for TUCE, obs 1-32
number of bins = 7, mean = 21.9375, sd = 3.90151

       interval          midpt   frequency    rel.     cum.

          <  13.417     12.000        1      3.12%    3.12% *
    13.417 - 16.250     14.833        1      3.12%    6.25% *
    16.250 - 19.083     17.667        6     18.75%   25.00% ******
    19.083 - 21.917     20.500        6     18.75%   43.75% ******
    21.917 - 24.750     23.333        9     28.12%   71.88% **********
    24.750 - 27.583     26.167        7     21.88%   93.75% *******
          >= 27.583     29.000        2      6.25%  100.00% **

Test for null hypothesis of normal distribution:
Chi-square(2) = 1.872 with p-value 0.39211
? discrete TUCE # mark TUCE as discrete
? freq TUCE

Frequency distribution for TUCE, obs 1-32

          frequency    rel.     cum.

  12           1      3.12%    3.12% *
  14           1      3.12%    6.25% *
  17           3      9.38%   15.62% ***
  19           3      9.38%   25.00% ***
  20           2      6.25%   31.25% **
  21           4     12.50%   43.75% ****
  22           2      6.25%   50.00% **
  23           4     12.50%   62.50% ****
  24           3      9.38%   71.88% ***
  25           4     12.50%   84.38% ****
  26           2      6.25%   90.62% **
  27           1      3.12%   93.75% *
  28           1      3.12%   96.88% *
  29           1      3.12%  100.00% *

Test for null hypothesis of normal distribution:
Chi-square(2) = 1.872 with p-value 0.39211
\end{code}
%
As can be seen from the sample output, a Doornik-Hansen test for
normality is computed automatically.  This test is suppressed for
discrete variables where the number of distinct values is less than
10.

This command accepts two options: \option{quiet}, to avoid
generation of the histogram when invoked from the command line and
\option{gamma}, for replacing the normality test with Locke's
nonparametric test, whose null hypothesis is that the data follow a
Gamma distribution.

If the distinct values of a discrete variable need to be saved, the
\texttt{values()} matrix construct can be used (see chapter
\ref{chap:matrices}).

\subsection{The \texttt{xtab} command}
\label{discr-xtab}

The \texttt{xtab} command cab be invoked in either of the following
ways.  First,
%
\begin{code}
xtab ylist ; xlist
\end{code}
%
where \texttt{ylist} and \texttt{xlist} are lists of discrete
variables.  This produces cross-tabulations (two-way frequencies) of
each of the variables in \texttt{ylist} (by row) against each of the
variables in \texttt{xlist} (by column).  Or second,
%
\begin{code}
xtab xlist
\end{code}
%
In the second case a full set of cross-tabulations is generated; that
is, each variable in \texttt{xlist} is tabulated against each other
variable in the list.  In the graphical interface, this command is
represented by the ``Cross Tabulation'' item under the View menu,
which is active if at least two variables are selected.

Here is an example of use:
%
\begin{code}
open greene22_2
discrete Z* # mark Z1-Z8 as discrete
xtab Z1 Z4 ; Z5 Z6
\end{code}
which produces
\begin{code}
Cross-tabulation of Z1 (rows) against Z5 (columns)

       [   1][   2][   3][   4][   5]  TOT.
  
[   0]    20    91    75    93    36    315
[   1]    28    73    54    97    34    286

TOTAL     48   164   129   190    70    601

Pearson chi-square test = 5.48233 (4 df, p-value = 0.241287)

Cross-tabulation of Z1 (rows) against Z6 (columns)

       [   9][  12][  14][  16][  17][  18][  20]  TOT.
  
[   0]     4    36   106    70    52    45     2    315
[   1]     3     8    48    45    37    67    78    286

TOTAL      7    44   154   115    89   112    80    601

Pearson chi-square test = 123.177 (6 df, p-value = 3.50375e-24)

Cross-tabulation of Z4 (rows) against Z5 (columns)

       [   1][   2][   3][   4][   5]  TOT.
  
[   0]    17    60    35    45    14    171
[   1]    31   104    94   145    56    430

TOTAL     48   164   129   190    70    601

Pearson chi-square test = 11.1615 (4 df, p-value = 0.0248074)

Cross-tabulation of Z4 (rows) against Z6 (columns)

       [   9][  12][  14][  16][  17][  18][  20]  TOT.
  
[   0]     1     8    39    47    30    32    14    171
[   1]     6    36   115    68    59    80    66    430

TOTAL      7    44   154   115    89   112    80    601

Pearson chi-square test = 18.3426 (6 df, p-value = 0.0054306)
\end{code}

Pearson's $\chi^2$ test for independence is automatically displayed,
provided that all cells have expected frequencies under independence
greater than $10^{-7}$.  However, a common rule of thumb states that
this statistic is valid only if the expected frequency is 5 or
greater for at least 80 percent of the cells.  If this condition is not
met a warning is printed.

Additionally, the \option{row} or \option{column} options can be
given: in this case, the output displays row or column percentages,
respectively.

If you want to cut and paste the output of \texttt{xtab} to some other
program, e.g.\ a spreadsheet, you may want to use the \option{zeros}
option; this option causes cells with zero frequency to display the
number 0 instead of being empty.

%%% Local Variables: 
%%% mode: latex
%%% TeX-master: "gretl-guide"
%%% End: