~ubuntu-branches/debian/sid/apophenia/sid : revision 1

1

/* Apophenia's narrative documentation

2

3

4

/** \mainpage Apophenia--the intro

5

6

Apophenia is an open statistical library for working with data sets and statistical

7

models. It provides functions on the same level as those of the typical stats package

8

(such as OLS, probit, or singular value decomposition) but gives the user more

9

flexibility to be creative in model-building. The core functions are written in C,

10

but experience has shown them to be easy to bind to in Python/Julia/Perl/Ruby/&c.

11

12

It is written to scale well, to comfortably work with gigabyte data sets, million-step simulations, or

13

computationally-intensive agent-based models. If you have tried using other open source

14

tools for computationally demanding work and found that those tools weren't up to the

15

task, then Apophenia is the library for you.

16

17

<h5>The goods</h5>

18

19

The library has been growing and improving since 2005, and has been downloaded over 10,000 times. To date, it has over two hundred functions to facilitate statistical computing, such as:

20

21

\li OLS and family, discrete choice models like probit and logit, kernel density estimators, and other common models

22

\li database querying and maintenance utilities

23

\li moments, percentiles, and other basic stats utilities

24

\li t-tests, F-tests, et cetera

25

\li Several optimization methods available for your own new models

26

\li It does <em>not</em> re-implement basic matrix operations or build yet another database

27

engine. Instead, it builds upon the excellent <a href="http://www.gnu.org/software/gsl/">GNU

28

Scientific</a> and <a href="http://www.sqlite.org/">SQLite</a> libraries. MySQL/mariaDB is also supported.

29

30

For the full list, click the <a href="globals.html">index</a> link from the header.

31

32

<h5><a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.</h5>

33

34

Most users will just want to download the latest packaged version linked from the <a

35

href="https://github.com/b-k/apophenia/archive/pkg.zip">Download

36

Apophenia here</a> header.

37

38

Those who would like to work on a cutting-edge copy of the source code

39

can get the latest version by cutting and pasting the following onto

40

the command line. If you follow this route, be sure to read the development README in the

41

<tt>apophenia</tt> directory this command will create.

42

43

\code

44

git clone https://github.com/b-k/apophenia.git

45

\endcode

46

47

<!--git clone git://apophenia.git.sourceforge.net/gitroot/apophenia/apophenia

48

cvs -z3 -d:ext:<i>(your sourceforge login)</i>@cvs.sourceforge.net:/cvsroot/apophenia co -P apophenia

49

cvs -z3 -d:pserver:anonymous@cvs.sf.net:/cvsroot/apophenia checkout -P apophenia

50

svn co https://apophenia.svn.sourceforge.net/svnroot/apophenia/trunk/apophenia -->

51

52

<h5>The documentation</h5>

53

54

To start off, have a look at this \ref gentle "Gentle Introduction" to the library.

55

56

<a href="outline.html">The outline</a> gives a more detailed narrative.

57

58

The <a href="globals.html">index</a> lists every function in the

59

library, with detailed reference

60

information. Notice that the header to every page has a link to the outline and the index.

61

62

To really go in depth, download or pick up a copy of <a

63

href="http://modelingwithdata.org">Modeling with Data</a>, which discusses general

64

methods for doing statistics in C with the GSL and SQLite, as well as Apophenia

65

itself. <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful

66

Algebraic System of Statistical Models</em></a> (PDF) discusses some of the theoretical

67

structures underlying the library.

68

69

There is a <a href="https://github.com/b-k/apophenia/wiki">wiki</a> with some convenience

70

functions, tips, and so on.

71

72

<h5>Notable features</h5>

73

Much of what Apophenia does can be done in any typical statistics package. The \ref

74

apop_data element is much like an R data frame, for example, and there is nothing special

75

about being able to invert a matrix or take the product of two matrices with a single

76

function call (\ref apop_matrix_inverse and \ref apop_dot, respectively).

77

Even more advanced features like Loess smoothing (\ref apop_loess) and the Fisher Exact

78

Test (\ref apop_test_fisher_exact) are not especially Apophenia-specific. But here are

79

some things that are noteworthy.

80

81

\li The text file parser is flexible and effective. Such data files are typically

82

called `CSV files', meaning <em>comma-separated values</em>, but the delimiter can be

83

anything (or even some mix of things), and there is no requirement that text have

84

"special delimiters". Missing data can be specified by a simple blank or a marker

85

of your choosing (e.g., <tt>apop_opts.nan_string = "N/A";</tt>). Or there can be

86

no delimiters, as in the case of fixed-width files. If you are a heavy SQLite user,

87

Apophenia may be useful to you simply for its \ref apop_text_to_db function.

88

89

\li The maximum likelihood system combines a lot of different subsystems into one

90

form: it will do a few flavors of conjugate gradient search, Nelder-Mead Simplex,

91

Newton's Method, or Simulated Annealing. You pick the method by a setting attached to

92

your model. If you want to use a method that requires derivatives and you don't have

93

a closed-form derivative, the ML subsystem will estimate a numerical gradient for

94

you. If you would like to do EM-style maximization (all but the first parameter are

95

fixed, that parameter is optimized, then all but the second parameter are fixed, that

96

parameter is optimized, ..., looping through dimensions until the change in objective

97

across cycles is less than <tt>eps</tt>), just add a settings group specifying the

98

tolerance at which the cycle should stop: <tt>Apop_settings_add_group(your_model,

99

apop_mle, .dim_cycle_tolerance=eps)</tt>.

100

101

\li The Iterative Proportional Fitting algorithm, \ref apop_rake, is best-in-breed,

102

designed to handle large, sparse matrices.

103

104

\li As well as the \ref apop_data structure, Apophenia is built around a model object,

105

the \ref apop_model. This allows for consistent treatment of distributions, regressions,

106

simulations, machine learning models, and who knows what other sorts of models you can

107

dream up. By transforming and combining existing models, it is easy to build complex

108

models from simple sub-models.

109

110

\li For example, the \ref apop_update function does Bayesian updating on any two

111

well-formed models. If they are on the table of conjugates, that is correctly

112

handled, and if they are not, an appropriate variant of MCMC produces an empirical

113

distribution. The output is yet another model, from which you can make random draws,

114

or which you can use as a prior for another round of Bayesian updating. Outside of

115

Bayesian updating, the \ref apop_model_metropolis function is good for approximating

116

other complex models.

117

118

\li Of course, it's a C library, meaning that you can build applications using Apophenia

119

for the data-processing back-end of your program. For example, it is currently used

120

in production for certain aspects of processing for the U.S. Census Bureau's American

121

Community Survey.

122

123

124

<h5>Contribute!</h5>

125

126

\li Develop a new model object.

127

\li Contribute your favorite statistical routine.

128

\li Package Apophenia into an RPM, apt, portage, cygwin package.

129

\li Report bugs or suggest features.

130

\li Write bindings for your preferred language. For example, here are early versions of <a

131

href="http://modelingwithdata.org/arch/00000173.htm"> a Julia

132

wrapper</a> and <a href="https://r-forge.r-project.org/projects/rapophenia/">an R

133

wrapper</a> which you could expand upon.

134

135

If you're interested, <a href="mailto:fluffmail@f-m.fm">write to the maintainer</a> (Ben Klemens), or join the

136

<a href="https://github.com/b-k/apophenia">GitHub</a> project.

137

*/

138

139

/** \page eg Some examples

140

Here are a few pieces of sample code, gathered from elsewhere in the documentation, for testing your installation or to give you a sense of what code with Apophenia's tools looks like. If you'd like more context or explanation, please click through to the page from which the example was taken.

141

142

In the documentation for the \ref apop_ols model, a program to read in data and run a regression. You'll need to go to that page for the sample data and further discussion.

143

144

From the \ref setup page, an example of gathering data from two processes, saving the input to a database, then doing a later analysis:

145

146

\include draw_to_db.c

147

148

In the \ref outline section on map/apply, a new \f$t\f$-test on every row, with all operations acting on entire rows rather than individual data points:

149

150

\include t_test_by_rows.c

151

152

In the documentation for \ref apop_query_to_text, a program to list all the tables in an SQLite database.

153

\include ls_tables.c

154

155

A demonstration of fixing parameters to create a marginal distribution, via \ref apop_model_fix_params

156

\include fix_params.c

157

158

Several uses of the \ref apop_dot function

159

\include dot_products.c

160

*/

161

162

163

164

/** \page setup Setting up

165

\section cast The supporting cast

166

To use Apophenia, you will need to have a working C compiler, the GSL (v1.7 or higher) and SQLite installed.

167

168

\li Some readers are unfamiliar with modern package managers and common methods for setting up a C development environment; see

169

<a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a> of <em> Modeling with Data</em> for an introduction.

170

171

\li Other pages on this site have a few more notes for \ref windows "Windows" users or \ref mingw users.

172

173

\li Install the basics using your package manager. E.g., try

174

175

\code

176

sudo apt-get install make gcc libgsl0-dev libsqlite3-dev

177

\endcode

178

179

or

180

181

\code

182

sudo yum install make gcc gsl-devel libsqlite3x-devel

183

\endcode

184

185

\li <a href="https://github.com/b-k/apophenia/archive/pkg.zip">Download Apophenia here</a>.

186

187

\li Once you have the library downloaded, compile it using

188

189

\code

190

tar xvzf apop*tgz && cd apophenia-0.999

191

./configure && make && sudo make install && make check

192

\endcode

193

194

If you decide not to keep the library on your system, run <tt>sudo make uninstall</tt>

195

from the source directory to remove it.

196

197

\li A \ref makefile will help immensely when you want to compile your program.

198

199

200

\subsection sample_program Sample programs

201

Here is a sample program so you can test your setup. There is another short,

202

complete program in the \ref apop_ols entry which runs a simple OLS regression on a

203

data file. Follow the instructions there to compile and run.

204

205

The sample program here is intended to show how one would integrate Apophenia into an existing program. For example, say that you are running a simulation of two different treatments, or say that two sensors are posting data at regular intervals. You need to gather the data in an organized form, and then ask questions of the resulting data set. Below, a thousand draws are made from the two processes and put into a database. Then, the data is pulled out, some simple statistics are compiled, and the data is written to a text file for inspection outside of the program. This program will compile cleanly with the sample \ref makefile.

206

207

\include draw_to_db.c

208

209

*/

210

211

/** \page windows The Windows page

212

213

\ref mingw users, see that page.

214

215

If you have a choice, <a href="http://www.cygwin.com">Cygwin</a> is strongly recommended. The setup program is

216

very self-explanatory. As a warning, it will probably take up >300MB on

217

your system. You should install at least the following programs:

218

\li autoconf/automake

219

\li binutils

220

\li gcc

221

\li gdb

222

\li gnuplot -- for plotting data

223

\li groff -- needed for the man program, below

224

\li gsl -- the engine that powers Apophenia

225

\li less -- to read text files

226

\li libtool -- needed for compiling programs

227

\li make

228

\li man -- for reading help files

229

\li more -- not as good as less but still good to have

230

If you are missing anything else, the program will probably tell you.

231

The following are not necessary but are good to have on hand as long as you are going to be using Unix and programming.

232

\li svn -- to partake in the versioning system

233

\li emacs -- steep learning curve, but people love it

234

\li ghostscript (for reading .ps/.pdf files)

235

\li openssh -- needed for cvs

236

\li perl, python, ruby -- these are other languages that you might also be interested in

237

\li tetex -- write up your documentation using the nicest-looking formatter around

238

\li X11 -- a windowing system

239

X-Window will give you a nicer environment in which to work. After you start Cygwin, just type <tt>startx</tt> to bring up a more usable, nice-looking terminal (and the ability to do a few thousand other things which are beyond the scope of this documentation). Once you have Cygwin installed and a good terminal running, you can follow along with the remainder of the discussion without modification.

240

241

Sqlite3 is difficult to build from scratch, but you can get a packaged version by pointing Cygwin's install program to the Cygwin Ports site: http://cygwinports.dotsrc.org/ .

242

243

Second, some older (but still pretty recent) versions of Cygwin have a search.h file which doesn't include the function lsearch(). If this is the case on your system, you will have to update your Cygwin installation.

244

245

Finally, windows compilers often spit out lines like:

246

\code

247

Info: resolving _gsl_rng_taus by linking to __imp__gsl_rng_taus (auto-import)

248

\endcode

249

These lines are indeed just information, and not errors. Feel free to ignore them.

250

251

[Thanks to Andrew Felton and Derrick Higgins for their Cygwin debugging efforts.]

252

*/

253

254

/** \page notroot Not root?

255

If you aren't root, then you will need to create a subdirectory in your home directory in which to install packages. The GSL and SQLite installations will go like this. The key is the <tt>--prefix</tt> addition to the <tt>./configure</tt> command.

256

\code

257

export MY_LIBS = src #choose a directory name to be created in your home directory.

258

tar xvzf pkg.tgz #change pkg.tgz to the appropriate name

259

cd package_dir #same here.

260

mkdir $HOME/$MY_LIBS

261

./configure --prefix $HOME/$MY_LIBS

262

make

263

make install #Now you don't have to be root.

264

echo "export LD_LIBRARY_PATH=$HOME/$MY_LIBS:\$LD_LIBRARY_PATH" >> ~/.bashrc

265

\endcode

266

*/

267

268

269

/** \page makefile Makefile

270

271

Instead of giving lengthy compiler commands at the command prompt, you can use a Makefile to do most of the work. How to:

272

\li Copy and paste the following into a file named \c makefile.

273

\li Change the first line to the name of your program (e.g., if you have written <tt>sample.c</tt>, then the first line will read <tt>PROGNAME=sample</tt>).

274

\li If your program has multiple <tt>.c</tt> files, add a corresponding <tt>.o</tt> to the currently blank <tt>objects</tt> variable, e.g. <tt>objects=sample2.o sample3.o</tt>

275

\li One you have a Makefile in the directory, simply type <tt>make</tt> at the command prompt to generate the executable.

276

277

\code

278

PROGNAME = your_program_name_here

279

objects =

280

CFLAGS = -g -Wall -O3

281

LDLIBS = -lapophenia -lgsl -lgslcblas -lsqlite3

282

283

$(PROGNAME): $(objects)

284

\endcode

285

286

\li If your system has \c pkg-config, then you can use it for a slightly more robust and readable makefile. Replace the above C and link flags with:

287

\code

288

CFLAGS = -g -Wall `pkg-config --cflags apophenia` -O3

289

LDLIBS = `pkg-config --libs apophenia`

290

\endcode

291

The \c pkg-config program will then fill in the appropriate directories and libraries. Pkg-config knows Apophenia depends on the GSL and database libraries, so you need only list the most-dependent library.

292

293

\li The -O3 flag is optional, asking the compiler to run its highest level of optimization (for speed).

294

295

\li GCC users may need the <tt>--std=gnu99</tt> or <tt>--std=gnu11</tt> flag to use post-1989 C standards.

296

297

\li Order matters in the linking list: the files a package depends on should be listed after the package. E.g., since sample.c depends on Apophenia, <tt>gcc sample.c -lapophenia</tt> will work, while <tt>gcc -lapophenia sample.c</tt> is likely to give you errors. Similarly, list <tt>-lapophenia</tt> before <tt>-lgsl</tt>, which comes before <tt>-lgslcblas</tt>.

298

299

*/

300

301

/** \page designated Designated initializers

302

303

Functions so marked in this documentation use standard C designated initializers and compound literals to allow you to omit, call by name, or change the order of inputs. The following examples are all equivalent.

304

305

The standard format:

306

\code

307

apop_text_to_db("infile.txt", "intable", 0, 1, NULL);

308

\endcode

309

310

Omitted arguments are left at their default vaules:

311

\code

312

apop_text_to_db("infile.txt", "intable");

313

\endcode

314

315

You can use the variable's name, if you forget its ordering:

316

\code

317

apop_text_to_db("infile.txt", "intable", .has_col_name=1, .has_row_name=0);

318

\endcode

319

320

If an un-named element follows a named element, then that value is given to the next variable in the standard ordering:

321

\code

322

apop_text_to_db("infile.txt", "intable", .has_col_name=1, NULL);

323

\endcode

324

325

\li There may be cases where you can not use this form (it relies on a macro, which

326

may not be available). You can always call the underlying function directly, by adding

327

\c _base to the name and giving all arguments:

328

329

\code

330

apop_text_to_db_base("infile.txt", "intable", 0, 1, NULL);

331

\endcode

332

333

\li If one of the optional elements is an RNG and you do not provide one, I use one

334

from \ref apop_rng_get_thread.

335

336

\li For exhaustive details on implementation of the above (should you wish to write

337

new functions that behave like this) see the \ref optionaldetails page.

338

*/

339

340

/** \defgroup global_vars The global variables */

341

/** \defgroup mle Maximum likelihood estimation */

342

/** \defgroup command_line "Command line programs" */

343

344

345

/** \page admin The admin page

346

347

Just a few page links:

348

\li <a href=todo.html>The to-do list</a>

349

\li <a href=bug.html>The known bug list</a>

350

\li <a href="modules.html">The documentation page list</a>

351

*/

352

353

/** \page outline An outline of the library

354

355

ALLBUTTON

356

357

Outlineheader preliminaries Getting started

358

359

If you are entirely new to Apophenia, \ref gentle "have a look at the Gentle Introduction here".

360

361

As well as the information in this outline, there is a separate page covering the details of

362

\ref setup "setting up a computing environment" and another page with \ref eg "some sample code" for your perusal.

363

364

For another concrete example of folding Apophenia into a project, have a look at this \ref sample_program "sample program".

365

366

367

368

Outlineheader c Some notes on C and Apophenia's use of C utilities.

369

370

Outlineheader learning Learning C

371

372

<a href="http://modelingwithdata.org">Modeling with Data</a> has a full tutorial for C, oriented at users of standard stats packages. More nuts-and-bolts tutorials are <a href="http://www.google.com/search?hl=es&c2coff=1&q=c+tutorial">in abundance</a>. Some people find pointers to be especially difficult; fortunately, there's a <a href="http://www.youtube.com/watch?v=6pmWojisM_E">claymation cartoon</a> which clarifies everything.

373

374

Coding often relies on gathering together many libraries; there is a section at the bottom of this outline linking to references for some libraries upon which Apophenia builds.

375

376

endofdiv

377

378

Outlineheader usagenotes Usage notes

379

380

Here are some notes about the technical details of using the Apophenia library in your development environment.

381

382

<b> Header aggregation </b>

383

384

There is only one header. Put

385

\code

386

#include <apop.h>

387

\endcode

388

at the top of your file, and you're done. Everything declared in that file starts with \c apop_ (or \c Apop_).

389

390

<b>Linking</b>

391

392

You will need to link to the Apophenia library, which involves adding the <tt>-lapophenia</tt> flag to your compiler. Apophenia depends on SQLite3 and the GNU Scientific Library (which depends on a BLAS), so you will probably need something like:

393

394

\code

395

gcc sample.c -lapophenia -lsqlite3 -lgsl -lgslcblas -o run_me -g -Wall -O3

396

\endcode

397

398

Your best bet is to encapsulate this mess in a \ref makefile "Makefile". Even if you are using an IDE and its command-line management tools, see the Makefile page for notes on useful flags.

399

400

<b>Debugging</b>

401

The end of <a href="http://modelingwithdata.org/appendix_o.html">Appendix O</a>

402

of <em>Modeling with Data</em> offers some GDB macros which can make dealing with

403

Apophenia from the GDB command line much more pleasant. As per the next section, it

404

also helps to set <tt>apop_opts.stop_on_warning='v'</tt> or <tt>'w'</tt> when running

405

under the debugger.

406

407

<b>Standards compliance</b>

408

409

To the best of our abilities, Apophenia complies to the C standard (ISO/IEC 9899:2011).

410

411

<b> Easier calling syntax</b>

412

413

Many functions allow optional named arguments to functions. For

414

example:

415

\code

416

apop_vector_distance(v1, v2); //assumes Euclidean distance

417

apop_vector_distance(v1, .metric='M'); //assumes v2=0, uses Manhattan metric.

418

\endcode

419

420

See the \ref designated page for details of this syntax.

421

422

endofdiv

423

424

425

endofdiv

426

427

Outlineheader debugging Errors, logging, debugging and stopping

428

429

<h5>The \c error element</h5>

430

431

The \ref apop_data set and the \ref apop_model both include an element named \c error. It is normally \c 0, indicating no (known) error.

432

433

For example, \ref apop_data_copy detects allocation errors and some circular links

434

(when <tt>Data->more == Data</tt>) and fails in those cases. You could thus use the

435

function with a form like

436

437

\code

438

apop_data *d = apop_text_to_data("indata");

439

apop_data *cp = apop_data_copy(d);

440

if (cp->error) {printf("Couldn't copy the input data; failing.\n"); return 1;}

441

\endcode

442

443

There is sometimes (but not always) benefit to handling specific error codes, which are listed in the documentation of those functions that set the \c error element. E.g.,

444

445

\code

446

apop_data *d = apop_text_to_data("indata");

447

apop_data *cp = apop_data_copy(d);

448

if (cp->error == 'a') {printf("Couldn't allocate space for the copy; failing.\n"); return 1;}

449

if (cp->error == 'c') {printf("Circular link in the data set; failing.\n"); return 2;}

450

\endcode

451

452

453

<h5>Verbosity level and logging</h5>

454

455

The global variable <tt>apop_opts.verbose</tt> determines how many notifications and warnings get printed by Apophenia's warning mechanism:

456

457

-1: turn off logging, print nothing (ill-advised) <br>

458

0: notify only of failures and clear danger <br>

459

1: warn of technically correct but odd situations that might indicate, e.g., numeric instability <br>

460

2: debugging-type information; print queries <br>

461

3: give me everything, such as the state of the data at each iteration of a loop.

462

463

These levels are of course subjective, but should give you some idea of where to place the

464

verbosity level. The default is 1.

465

466

The messages are printed to the \c FILE handle at <tt>apop_opts.log_file</tt>. If

467

this is blank (which happens at startup), then this is set to \c stderr. This is the

468

typical behavior for a console program. Use

469

470

\code

471

apop_opts.log_file = fopen("mylog", "w");

472

\endcode

473

474

to write to the \c mylog file instead of \c stderr.

475

476

As well as the error and warning messages, some functions can also print diagnostics,

477

using the \ref Apop_notify macro. For example, \ref apop_query and friends will print the

478

query sent to the database engine iff <tt>apop_opts.verbose >=2</tt> (which is useful

479

when building complex queries). The diagnostics attempt to follow

480

the same verbosity scale as the warning messages.

481

482

<h5>Stopping</h5>

483

484

Warnings and errors never halt processing. It is up to the calling function to decide

485

whether to stop.

486

487

When running the program under a debugger, this is an annoyance: we want to stop as

488

soon as a problem turns up.

489

490

The global variable <tt>apop_opts.stop_on_warning</tt> changes when the system halts:

491

492

\c 'n': never halt. If you were using Apophenia to support a user-friendly GUI, for example, you would use this mode.<br>

493

The default: if the variable is <tt>'\0'</tt> (the default), halt on severe errors, continue on all warnings.<br>

494

\c 'v': If the verbosity level of the warning is such that the warning would print to screen, then halt;

495

if the warning message would be filtered out by your verbosity level, continue.<br>

496

\c 'w': Halt on all errors or warnings, including those below your verbosity threshold.

497

498

See the documentation for individual functions for details on how each reports errors to the caller and the level at which warnings are posted.

499

500

endofdiv

501

502

Outlineheader About SQL, the syntax for querying databases

503

504

For a reference, your best bet is the <a href="http://www.sqlite.org/lang.html">Structured Query Language reference</a> for SQLite. For a tutorial; there is an abundance of <a href="http://www.google.com/search?q=sql+tutorial">tutorials online</a>. Here is a nice blog <a href="http://fluff.info/blog/arch/00000118.htm">entry</a> about complementaries between SQL and matrix manipulation packages.

505

506

Apophenia currently supports two database engines: SQLite and mySQL/mariaDB. SQLite is the default, because it is simpler and generally more easygoing than mySQL, and supports in-memory databases.

507

508

The global <tt>apop_opts.db_engine</tt> is initially \c NUL, indicating no preference

509

for a database engine. You can explicitly set it:

510

511

\code

512

apop_opts.db_engine='s' //use SQLite

513

apop_opts.db_engine='m' //use mySQL/mariaDB

514

\endcode

515

516

If \c apop_opts.db_engine is still \c NUL on your first database operation, then I will check

517

for an environment variable <tt>APOP_DB_ENGINE</tt>, and set

518

<tt>apop_opts.db_engine='m'</tt> if it is found and matches (case insensitive) \c mariadb or \c mysql.

519

520

\code

521

export APOP_DB_ENGINE=mariadb

522

apop_text_to_db indata mtab db_for_maria

523

524

unset APOP_DB_ENGINE

525

apop_text_to_db indata stab db_for_sqlite.db

526

\endcode

527

528

Finally, Apophenia provides a few nonstandard SQL functions to facilitate math via database; see \ref db_moments.

529

endofdiv

530

531

Outlineheader threads Threading

532

533

Apophenia uses OpenMP for threading. You generally do not need to know how OpenMP works

534

to use Apophenia, and many points of work will thread without your doing anything.

535

536

\li All functions strive to be thread-safe. Part of how this is achieved is that static

537

variables are marked as thread-local or atomic, as per the C standard. There still

538

exist compilers that can't implement thread-local or atomic variables, in which case

539

your safest bet is to set OMP's thread count to one as below (or get a new compiler).

540

541

\li Some functions modify their inputs. It is up to you to use those functions in

542

a thread-safe manner. The \ref apop_matrix_realloc handles states and global variables

543

correctly in a threaded environment, but if you have two threads resizing the same \c

544

gsl_matrix at the same time, you're going to have problems.

545

546

\li There are few compilers that don't support OpenMP. Clang on MacOS may be the only

547

current mainstream example as of this writing, and they are hard at work on implementing

548

it. In the mean time, when compiling on such a system all work will be single-threaded.

549

550

\li Set the maximum number of threads to \c N with the environment variable

551

552

\code

553

export OMP_NUM_THREADS N

554

\endcode

555

556

or the C function

557

558

\code

559

#include <omp.h>

560

omp_set_num_threads(N);

561

\endcode

562

563

Use one of these methods with <tt>N=1</tt> if you want a single-threaded program.

564

565

\li \ref apop_map and friends distribute their \c for loop over the input \ref apop_data

566

set across multiple threads. Therefore, be careful to send thread-unsafe functions to

567

it only after calling \c omp_set_num_threads(1).

568

569

\li There are a few functions, like \ref apop_model_draws, that rely on \ref apop_map, and

570

therefore also thread by default.

571

572

\li The function \ref apop_rng_get_thread retrieves a statically-stored RNG specific

573

to a given thread. Therefore, if you use that function in the place of a \c gsl_rng,

574

you can parallelize functions that make random draws.

575

576

\li \ref apop_rng_get_thread allocates its store of threads using <tt>apop_opts.rng_seed</tt>,

577

then incrementing that seed by one. You thus probably have threads with seeds 479901,

578

479902, 479903, .... [If you have a better way to do it, please feel free to modify the

579

code to implement your improvement and submit a pull request on Github.]

580

581

See <a href="http://modelingwithdata.org/arch/00000175.htm">this tutorial on C

582

threading</a> if you would like to know more, or are unsure about whether your functions

583

are thread-safe or not.

584

585

endofdiv

586

587

Outlineheader mwd The book version

588

589

Apophenia co-evolved with <em>Modeling with Data: Tools and Techniques for Statistical Computing</em>. You can read about the book, or download a free PDF copy of the full text, at <a href="http://modelingwithdata.org">modelingwithdata.org</a>.

590

591

If you are at this site, there is probably something there for you, including a tutorial on C and general computing form, SQL for data-handing, several chapters of statistics from various perspectives, and more details on working Apophenia.

592

593

As with many computer programs, the preferred manner of citing Apophenia is to cite its related book.

594

Here is a BibTeX-formatted entry giving the relevant information:

595

596

\code

597

@book{klemens:modeling,

598

title = "Modeling with Data: Tools and Techniques for Statistical Computing",

599

author="Ben Klemens",

600

year=2008,

601

publisher="Princeton University Press"

602

}

603

\endcode

604

605

The rationale for the \ref apop_model struct, based on an algebraic system of models, is detailed in a <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">U.S. Census Bureau research report</a>.

606

607

endofdiv

608

609

Outlineheader status What is the status of the code?

610

611

[This section last updated 3 August 2014.]

612

613

Apophenia was first posted to SourceForge in February 2005, which means that we've had

614

several years to develop and test the code in real-world applications.

615

616

The test suite, including the sample code and solution set for <em>Modeling with Data</em>,

617

is about 5,500 lines over 135 files. gprof reports that it covers over 90% of the

618

7,700 lines in Apophenia's code base. A broad rule of thumb for any code base is

619

that the well-worn parts, in this case functions like \ref apop_data_get and \ref

620

apop_normal's <tt>log_likelihood</tt>, are likely to be entirely reliable, while the

621

out-of-the-way functions (maybe the score for the Beta distribution) are worth a bit

622

of caution. Close to all of the code has been used in production, so all of it was at

623

least initially tested against real-world data.

624

625

It is currently at version 0.999, which is intended to indicate that it is substantially

626

complete. Of course, a library for scientific computing, or even for that small subset

627

that is statistics, will never cover all needs and all methods. But as it stands

628

Apophenia's framework, based on the \ref apop_data and \ref apop_model, is basically

629

internally consistent, has enough tools that you can get common work done quickly,

630

and is reasonably fleshed out with a good number of models out of the box.

631

632

The \ref apop_data structure is set, and there are enough functions there that you could

633

use it as a subpackage by itself (especially in tandem with the database functions)

634

for nontrivial dealings with data.

635

636

The \ref apop_model structure is much more ambitious---Apophenia is really intended

637

to be a novel system for developing models---and its internals can still be improved.

638

The promise underlying the structure is that you can provide just one item, such as

639

an RNG or a likelihood function, and the structure will do all of the work to fill in

640

computationally-intensive methods for everything else; see \ref settingswriting for

641

the details. Some directions aren't quite there yet (such as RNG -> most other things),

642

the PMF model needs an internal index for faster lookups, and so on. Readers are invited

643

to contribute better methods (such as an alternate means of estimating mixture models),

644

or filling in more of existing models (write a dlog likelihood function for a model

645

that does not currently have one), or submit new standard models not yet included.

646

647

648

endofdiv

649

650

Outlineheader ext How do I write extensions?

651

652

It's not a package, so you don't need an API---write your code and <tt>include</tt>

653

it like any other C code. The system is written to not require a registration or

654

initialization step to add a new model or other such parts. A new \ref apop_model

655

has to conform to some rules if it is to play well with \ref apop_estimate,

656

\ref apop_draw, and so forth. See the notes at \ref modeldetails. Once your new

657

model or function is working, please post the code or a link to the code on the <a

658

href="https://github.com/b-k/apophenia/wiki">Apophenia wiki</a>.

659

660

endofdiv

661

662

endofdiv

663

664

Outlineheader dataoverview Data sets

665

666

The \ref apop_data structure represents a data set. It joins together a \c gsl_vector, a \c gsl_matrix, an \ref apop_name, and a table of strings. It tries to be lightweight, so you can use it everywhere you would use a \c gsl_matrix or a \c gsl_vector.

667

668

If you are viewing the HTML documentation, here is a diagram showing a sample data set with all of the elements in place. Together, they represet a data set where each row is an observation, which includes both numeric and text values, and where each row/column may be named.

669

670

\htmlinclude apop_data_fig.html

671

672

In a regression, the vector would be the dependent variable, and the other columns

673

(after factor-izing the text) the independent variables. Or think of the \ref apop_data

674

set as a partitioned matrix, where the vector is column -1, and the first column of

675

the matrix is column zero. Here is some code to print the entire matrix, starting at

676

column -1.

677

678

\code

679

for (int j = 0; j< data->matrix->size1; j++){

680

printf("%s\t", apop_name_get(data->names, j, 'r'));

681

for (int i = -1; i< data->matrix->size2; i++)

682

printf("%g\t", apop_data_get(data, j, i));

683

printf("\n");

684

}

685

\endcode

686

687

Most functions assume that the data vector, data matrix, and text have the same row count: \c data->vector->size==data->matrix->size1 and \c data->vector->size==*data->textsize. This means that the \ref apop_name structure doesn't have separate \c vector_names, \c row_names, or \c text_row_names elements: the \c rownames are assumed to apply for all.

688

689

See below for notes on easily managing the \c text element and the row/column names.

690

691

The \ref apop_data set includes a \c more pointer, which will typically

692

be \c NULL, but may point to another \ref apop_data set. This is

693

intended for a main data set and a second or third page with auxiliary

694

information: estimated parameters on the front page and their

695

covariance matrix on page two, or predicted data on the front page and

696

a set of prediction intervals on page two. \c apop_data_copy and \c apop_data_free

697

will handle all the pages of information. The \c more pointer is not

698

intended as a linked list for millions of data points---you can probably

699

find a way to restructure your data to use a single table (perhaps via

700

\ref apop_data_pack and \ref apop_data_unpack).

701

702

There are a great many functions to collate, copy, merge, sort, prune, and otherwise

703

manipulate the \ref apop_data structure and its components.

704

705

\li\ref apop_data_add_named_elmt()

706

\li\ref apop_data_copy()

707

\li\ref apop_data_fill()

708

\li\ref apop_data_memcpy()

709

\li\ref apop_data_pack()

710

\li\ref apop_data_rm_columns()

711

\li\ref apop_data_sort()

712

\li\ref apop_data_split()

713

\li\ref apop_data_stack()

714

\li\ref apop_data_transpose()

715

\li\ref apop_data_unpack()

716

\li\ref apop_matrix_copy()

717

\li\ref apop_matrix_realloc()

718

\li\ref apop_matrix_rm_columns()

719

\li\ref apop_matrix_stack()

720

\li\ref apop_text_add()

721

\li\ref apop_text_paste()

722

\li\ref apop_text_to_data()

723

\li\ref apop_vector_bounded()

724

\li\ref apop_vector_copy()

725

\li\ref apop_vector_fill()

726

\li\ref apop_vector_stack()

727

\li\ref apop_vector_realloc()

728

729

Apophenia builds upon the GSL, but it would be inappropriate to redundantly replicate

730

the GSL's documentation here. You will find a link to the full GSL documentation at

731

the end of this outline. Meanwhile, here are prototypes for a few common functions. The GSL's

732

naming scheme is very consistent, so a simple reminder of the function name may be

733

sufficient to indicate how they are used.

734

735

\li <tt>gsl_matrix_swap_rows (gsl_matrix * m, size_t i, size_t j)</tt>

736

\li <tt>gsl_matrix_swap_columns (gsl_matrix * m, size_t i, size_t j)</tt>

737

\li <tt>gsl_matrix_swap_rowcol (gsl_matrix * m, size_t i, size_t j)</tt>

738

\li <tt>gsl_matrix_transpose_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>

739

\li <tt>gsl_matrix_transpose (gsl_matrix * m) : square matrices only</tt>

740

\li <tt>gsl_matrix_set_all (gsl_matrix * m, double x)</tt>

741

\li <tt>gsl_matrix_set_zero (gsl_matrix * m)</tt>

742

\li <tt>gsl_matrix_set_identity (gsl_matrix * m)</tt>

743

\li <tt>void gsl_vector_set_all (gsl_vector * v, double x)</tt>

744

\li <tt>void gsl_vector_set_zero (gsl_vector * v)</tt>

745

\li <tt>int gsl_vector_set_basis (gsl_vector * v, size_t i)</tt>: set all elements to zero, but set item \f$i\f$ to one.

746

\li <tt>gsl_vector_reverse (gsl_vector * v)</tt>: reverse the order of your vector's elements

747

\li <tt>gsl_vector_ptr</tt> and <tt>gsl_matrix_ptr</tt>. To increment an element in a vector use, e.g., <tt>*gsl_vector_ptr(v, 7) += 3;</tt> or <tt>(*gsl_vector_ptr(v, 7))++</tt>.

748

749

Outlineheader readin Reading from text files

750

751

The \ref apop_text_to_data() function takes in the name of a text file with a grid of data in (comma|tab|pipe|whatever)-delimited format and reads it to a matrix. If there are names in the text file, they are copied in to the data set. See \ref text_format for the full range and details of what can be read in.

752

753

If you have any columns of text, then you will need to read in via the database: use

754

\ref apop_text_to_db() to convert your text file to a database table,

755

do any database-appropriate cleaning of the input data, then use \ref

756

apop_query_to_data() or \ref apop_query_to_mixed_data() to pull the data to an \ref apop_data set.

757

758

endofdiv

759

760

Outlineheader datalloc Alloc/free

761

762

\li\ref apop_data_alloc()

763

\li\ref apop_data_calloc()

764

\li\ref apop_data_free()

765

\li\ref apop_text_alloc() : allocate or resize the text part of an \ref apop_data set.

766

\li\ref apop_text_free()

767

768

See also:

769

770

\li <tt>gsl_matrix * gsl_matrix_alloc (size_t n1, size_t n2)</tt>

771

\li <tt>gsl_matrix * gsl_matrix_calloc (size_t n1, size_t n2)</tt>

772

\li <tt>void gsl_matrix_free (gsl_matrix * m)</tt>

773

\li <tt>gsl_matrix_memcpy (gsl_matrix * dest, const gsl_matrix * src)</tt>

774

\li <tt>gsl_vector * gsl_vector_alloc (size_t n)</tt>

775

\li <tt>gsl_vector * gsl_vector_calloc (size_t n)</tt>

776

\li <tt>void gsl_vector_free (gsl_vector * v)</tt>

777

\li <tt>gsl_vector_memcpy (gsl_vector * dest, const gsl_vector * src)</tt>

778

779

endofdiv

780

781

Outlineheader gslviews Using views

782

783

There are several macros for the common task of viewing a single row or column of a \ref

784

apop_data set.

785

786

\code

787

apop_data *d = apop_query_to_data("select obs1, obs2, obs3 from a_table");

788

789

//Get a column using its name

790

Apop_col_t(d, "obs1", ov);

791

double obs1_sum = apop_vector_sum(ov);

792

793

//Get row zero of the data set's matrix as a vector; get its sum

794

double first_row_sum = apop_vector_sum(Apop_rv(d, 0));

795

796

//Get a row or rows as a standalone one-row apop_data set

797

apop_data_print(Apop_r(d, 0));

798

799

//ten rows starting at row 3:

800

Apop_data_rows(d, 3, 10, d10);

801

apop_data_show(d10);

802

803

//First column's sum

804

double first_col_sum = apop_vector_sum(Apop_cv(d, 0));

805

806

//Pull a 10x5 submatrix, whose first element is the (2,3)rd

807

//element of the parent data set's matrix

808

809

double sub_sum = apop_matrix_sum(Apop_subm(d, 2,3, 10,5));

810

\endcode

811

812

To make it easier to use the result of these macros as an argument to a function, these macros have abbreviated names.

813

814

\li\ref Apop_c

815

\li\ref Apop_r

816

\li\ref Apop_cv

817

\li\ref Apop_rv

818

\li\ref Apop_cs

819

\li\ref Apop_rs

820

\li\ref Apop_subm

821

822

A second set of macros have a slightly different syntax, taking the name of the object to be declared as the last argument. These can not be used as expressions such as function arguments.

823

824

\li\ref Apop_col_t

825

\li\ref Apop_row_t

826

\li\ref Apop_col_tv

827

\li\ref Apop_row_tv

828

\li\ref Apop_matrix_row

829

\li\ref Apop_matrix_col

830

831

The view is an automatic variable, not a pointer, and therefore disappears at the end

832

of the scope in which it is declared. If you want to retain the data after the function

833

exits, copy it to another vector:

834

835

\code

836

Apop_matrix_row(d->matrix, 2, rowtwo);

837

gsl_vector *outvector = apop_vector_copy(rowtwo);

838

\endcode

839

840

Curly braces always delimit scope, not just at the end of a function.

841

These macros work by generating a number of local variables, which you may be able to see in your

842

debugger. When program evaluation exits a given block, all variables in that block are

843

erased. Here is some sample code that won't work:

844

845

\code

846

apop_data *outdata;

847

if (get_odd){

848

outdata = Apop_r(data, 1);

849

} else {

850

outdata = Apop_r(data, 0);

851

}

852

apop_data_show(outdata); //breaks: outdata points to out-of-scope variables.

853

\endcode

854

855

For this if/then statement, there are two sets of local variables

856

generated: one for the \c if block, and one for the \c else block. By the last line,

857

neither exists. You can get around the problem here by making sure to not put the macro

858

declaring new variables in a block. E.g.:

859

860

\code

861

apop_data *outdata = Apop_r(data, get_odd ? 1 : 0);

862

apop_data_show(outdata);

863

\endcode

864

865

866

This is a general rule about how variables declared in blocks will behave, but because the

867

macros obscure the variable declarations, it is especially worth watching out for here.

868

869

endofdiv

870

871

Outlineheader setgetsec Set/get

872

873

The set/get functions can act on both names or indices. Sample usages:

874

875

876

\code

877

double twothree = apop_data_get(data, 2, 3); //just indices

878

apop_data_set(data, .rowname="A row", .colname="this column", .val=13);

879

double AIC = apop_data_get(your_model->info, .rowname="AIC");

880

\endcode

881

882

\li\ref apop_data_get()

883

\li\ref apop_data_set()

884

\li\ref apop_data_ptr() : returns a pointer to the element.

885

886

See also:

887

888

\li <tt>double gsl_matrix_get (const gsl_matrix * m, size_t i, size_t j)</tt>

889

\li <tt>double gsl_vector_get (const gsl_vector * v, size_t i)</tt>

890

\li <tt>void gsl_matrix_set (gsl_matrix * m, size_t i, size_t j, double x)</tt>

891

\li <tt>void gsl_vector_set (gsl_vector * v, size_t i, double x)</tt>

892

\li <tt>double * gsl_matrix_ptr (gsl_matrix * m, size_t i, size_t j)</tt>

893

\li <tt>double * gsl_vector_ptr (gsl_vector * v, size_t i)</tt>

894

\li <tt>const double * gsl_matrix_const_ptr (const gsl_matrix * m, size_t i, size_t j)</tt>

895

\li <tt>const double * gsl_vector_const_ptr (const gsl_vector * v, size_t i)</tt>

896

\li <tt>gsl_matrix_get_row (gsl_vector * v, const gsl_matrix * m, size_t i)</tt>

897

\li <tt>gsl_matrix_get_col (gsl_vector * v, const gsl_matrix * m, size_t j)</tt>

898

\li <tt>gsl_matrix_set_row (gsl_matrix * m, size_t i, const gsl_vector * v)</tt>

899

\li <tt>gsl_matrix_set_col (gsl_matrix * m, size_t j, const gsl_vector * v)</tt>

900

901

endofdiv

902

903

Outlineheader mapplysec Map/apply

904

905

\anchor outline_mapply

906

These functions allow you to send each element of a vector or matrix to a function, either producing a new matrix (map) or transforming the original (apply). The \c ..._sum functions return the sum of the mapped output.

907

908

There is an older and a newer set of functions. The older versions, which act on

909

<tt>gsl_matrix</tt>es or <tt>gsl_vector</tt>s have more verbose names; the newer

910

versions, which take in an \ref apop_data set, use the \ref designated syntax to add

911

a few options and a more brief syntax.

912

913

You can do many things quickly with these functions.

914

915

Get the sum of squares of a vector's elements:

916

917

\code

918

//given apop_data *dataset and gsl_vector *v:

919

double sum_of_squares = apop_map_sum(dataset, gsl_pow_2);

920

double sum_of_squares = apop_vector_map_sum(v, gsl_pow_2);

921

\endcode

922

923

Here, we create an index vector [\f$0, 1, 2, ...\f$].

924

925

\code

926

double index(double in, int index){return index;}

927

apop_data *d = apop_data_alloc(100);

928

apop_map(d, .fn_di=index, .inplace='y');

929

\endcode

930

931

Given your log likelihood function and a data set where each row of the matrix is an observation, find the total log likelihood:

932

933

\code

934

static double your_log_likelihood_fn(const gsl_vector * in)

935

{[your math goes here]}

936

937

double total_ll = apop_matrix_map_sum(dataset, your_log_likelihood_fn);

938

\endcode

939

940

How many missing elements are there in your data matrix?

941

942

\code

943

static double nan_check(const double in){ return isnan(in);}

944

945

int missing_ct = apop_map_sum(in, nan_check, .part='m');

946

\endcode

947

948

Get the mean of the not-NaN elements of a data set:

949

950

\code

951

static double no_nan_val(const double in){ return isnan(in)? 0 : in;}

952

static double not_nan_check(const double in){ return !isnan(in);}

953

954

static double apop_mean_no_nans(apop_data *in){

955

return apop_map_sum(in, no_nan_val)/apop_map_sum(in, not_nan_check);

956

}

957

\endcode

958

959

The following program randomly generates a data set where each row is a list of numbers with a different mean. It then finds the \f$t\f$ statistic for each row, and the confidence with which we reject the claim that the statistic is less than or equal to zero.

960

961

Notice how the older \ref apop_vector_apply uses file-global variables to pass information into the functions, while the \ref apop_map uses a pointer to the constant parameters to input to the functions.

962

963

\include t_test_by_rows.c

964

965

One more toy example, demonstrating the use of \ref apop_map and \ref apop_map_sum :

966

967

\include apop_map_row.c

968

969

\li\ref apop_map()

970

\li\ref apop_map_sum()

971

\li\ref apop_matrix_apply()

972

\li\ref apop_matrix_map()

973

\li\ref apop_matrix_map_all_sum()

974

\li\ref apop_matrix_map_sum()

975

\li\ref apop_vector_apply()

976

\li\ref apop_vector_map()

977

\li\ref apop_vector_map_sum()

978

979

endofdiv

980

981

Outlineheader matrixmathtwo Basic Math

982

983

\li\ref apop_vector_exp : exponentiate every element of a vector

984

\li\ref apop_vector_log : take the log of every element of a vector

985

\li\ref apop_vector_log10 : take the log (base 10) of every element of a vector

986

\li\ref apop_vector_distance : find the distance between two vectors via various metrics

987

\li\ref apop_vector_normalize : scale/shift a matrix to have mean zero, sum to one, et cetera

988

\li\ref apop_matrix_normalize : apply \ref apop_vector_normalize to every column or row of a matrix

989

990

See also:

991

992

\li <tt>int gsl_matrix_add (gsl_matrix * a, const gsl_matrix * b)</tt>

993

\li <tt>int gsl_matrix_sub (gsl_matrix * a, const gsl_matrix * b)</tt>

994

\li <tt>int gsl_matrix_mul_elements (gsl_matrix * a, const gsl_matrix * b)</tt>

995

\li <tt>int gsl_matrix_div_elements (gsl_matrix * a, const gsl_matrix * b)</tt>

996

\li <tt>int gsl_matrix_scale (gsl_matrix * a, const double x)</tt>

997

\li <tt>int gsl_matrix_add_constant (gsl_matrix * a, const double x)</tt>

998

\li <tt>gsl_vector_add (gsl_vector * a, const gsl_vector * b)</tt>

999

\li <tt>gsl_vector_sub (gsl_vector * a, const gsl_vector * b)</tt>

1000

\li <tt>gsl_vector_mul (gsl_vector * a, const gsl_vector * b)</tt>

1001

\li <tt>gsl_vector_div (gsl_vector * a, const gsl_vector * b)</tt>

1002

\li <tt>gsl_vector_scale (gsl_vector * a, const double x)</tt>

1003

\li <tt>gsl_vector_add_constant (gsl_vector * a, const double x)</tt>

1004

1005

endofdiv

1006

1007

Outlineheader matrixmath Matrix math

1008

1009

\li\ref apop_dot(): matrix \f$\cdot\f$ matrix, matrix \f$\cdot\f$ vector, or vector \f$\cdot\f$ matrix

1010

\li\ref apop_matrix_determinant

1011

\li\ref apop_matrix_inverse

1012

\li\ref apop_det_and_inv(): find determinant and inverse at the same time

1013

1014

See the GSL documentation for voluminous further options.

1015

1016

endofdiv

1017

1018

Outlineheader sumstats Summary stats

1019

1020

\li\ref apop_data_summarize ()

1021

\li\ref apop_vector_moving_average()

1022

\li\ref apop_vector_percentiles()

1023

1024

See also:

1025

1026

\li <tt>double gsl_matrix_max (const gsl_matrix * m)</tt>

1027

\li <tt>double gsl_matrix_min (const gsl_matrix * m)</tt>

1028

\li <tt>void gsl_matrix_minmax (const gsl_matrix * m, double * min_out, double * max_out)</tt>

1029

\li <tt>void gsl_matrix_max_index (const gsl_matrix * m, size_t * imax, size_t * jmax)</tt>

1030

\li <tt>void gsl_matrix_min_index (const gsl_matrix * m, size_t * imin, size_t * jmin)</tt>

1031

\li <tt>void gsl_matrix_minmax_index (const gsl_matrix * m, size_t * imin, size_t * jmin, size_t * imax, size_t * jmax)</tt>

1032

\li <tt>gsl_vector_max (const gsl_vector * v)</tt>

1033

\li <tt>gsl_vector_min (const gsl_vector * v)</tt>

1034

\li <tt>gsl_vector_minmax (const gsl_vector * v, double * min_out, double * max_out)</tt>

1035

\li <tt>gsl_vector_max_index (const gsl_vector * v)</tt>

1036

\li <tt>gsl_vector_min_index (const gsl_vector * v)</tt>

1037

\li <tt>gsl_vector_minmax_index (const gsl_vector * v, size_t * imin, size_t * imax)</tt>

1038

1039

endofdiv

1040

1041

Outlineheader moments Moments

1042

1043

For most of these, you can add a weights vector for weighted mean/var/cov/..., such as

1044

<tt>apop_vector_mean(d->vector, .weights=d->weights)</tt>

1045

1046

\li\ref apop_mean(): the first three with short names operate on a vector.

1047

\li\ref apop_sum()

1048

\li\ref apop_var()

1049

\li\ref apop_matrix_sum ()

1050

\li\ref apop_data_correlation ()

1051

\li\ref apop_data_covariance ()

1052

\li\ref apop_data_summarize ()

1053

\li\ref apop_matrix_mean ()

1054

\li\ref apop_matrix_mean_and_var ()

1055

\li\ref apop_vector_correlation ()

1056

\li\ref apop_vector_cov ()

1057

\li\ref apop_vector_kurtosis ()

1058

\li\ref apop_vector_kurtosis_pop ()

1059

\li\ref apop_vector_mean()

1060

\li\ref apop_vector_skew()

1061

\li\ref apop_vector_skew_pop()

1062

\li\ref apop_vector_sum()

1063

\li\ref apop_vector_var()

1064

\li\ref apop_vector_var_m ()

1065

1066

endofdiv

1067

1068

Outlineheader convsec Conversion among types

1069

1070

There are no functions provided to convert from \ref apop_data to the constituent

1071

elements, because you don't need a function.

1072

1073

If you need an individual element, you can just use its pointer to retrieve it:

1074

1075

\code

1076

apop_data *d = apop_text_to_mixed_data("vmw", "select result, age, "

1077

"income, sampleweight from data");

1078

double avg_result = apop_vector_mean(d->vector, .weights=d->weights);

1079

\endcode

1080

1081

In the other direction, you can use compound literals to wrap an \ref apop_data struct

1082

around a loose vector or matrix:

1083

1084

\code

1085

//Given:

1086

gsl_vector *v;

1087

gsl_matrix *m;

1088

1089

// Then this form wraps the elements into \ref apop_data structs. Note that

1090

// these are not pointers: they're automatically allocated and therefore

1091

// the extra memory use for the wrapper is cleaned up on exit from scope.

1092

1093

apop_data *dv = &(apop_data){.vector=v};

1094

apop_data *dm = &(apop_data){.matrix=m};

1095

1096

apop_data *v_dot_m = apop_dot(dv, dm);

1097

1098

//Here is a macro to hide C's ugliness:

1099

#define As_data(...) (&(apop_data){__VA_ARGS__})

1100

1101

apop_data *v_dot_m2 = apop_dot(As_data(.vector=v), As_data(.matrix=m));

1102

1103

//The wrapped object is an automatically-allocated structure pointing to the

1104

//original data. If it needs to persist or be separate from the original,

1105

//make a copy:

1106

apop_data *dm_copy = apop_data_copy(As_data(.vector=v, .matrix=m));

1107

\endcode

1108

1109

\li\ref apop_array_to_vector() : <tt>double*</tt>\f$\to\f$ <tt>gsl_vector</tt>

1110

\li\ref apop_data_fill() : <tt>double*</tt>\f$\to\f$ \ref apop_data. See also \ref apop_data_falloc

1111

\li\ref apop_text_to_data() : delimited text file\f$\to\f$ \ref apop_data

1112

\li\ref apop_text_to_db() : delimited text file\f$\to\f$ database

1113

\li\ref apop_vector_to_matrix()

1114

1115

endofdiv

1116

1117

Outlineheader names Name handling

1118

1119

If you generate your data set from the database via \ref apop_query_to_data (or

1120

\ref apop_query_to_text or \ref apop_query_to_mixed_data) then column names appear

1121

as expected. Set <tt>apop_opts.db_name_column</tt> to the name of a column in your

1122

query result to use that column name for row names.

1123

1124

Sample uses, given \ref apop_data set <tt>d</tt>:

1125

1126

\code

1127

int row_name_count = d->names->rowct

1128

int col_name_count = d->names->colct

1129

int text_name_count = d->names->textct

1130

1131

//Manually add a name:

1132

apop_name_add(d->names, "the vector", 'v');

1133

apop_name_add(d->names, "row 0", 'r');

1134

apop_name_add(d->names, "row 1", 'r');

1135

apop_name_add(d->names, "row 2", 'r');

1136

apop_name_add(d->names, "numeric column 0", 'c');

1137

apop_name_add(d->names, "text column 0", 't');

1138

apop_name_add(d->names, "The name of the data set.", 'h');

1139

1140

//or append several names at once

1141

apop_data_add_names(d, 'c', "numeric column 1", "numeric column 2", "numeric column 3");

1142

1143

//point to element i from:

1144

1145

char *rowname_i = d->names->row[i];

1146

char *colname_i = d->names->col[i];

1147

char *textname_i = d->names->text[i];

1148

1149

//The vector also has a name:

1150

char *vname = d->names->vector;

1151

\endcode

1152

1153

\li\ref apop_name_add() : add one name

1154

\li\ref apop_data_add_names() : add a sequence of names at once

1155

\li\ref apop_name_stack() : copy the contents of one name list to another

1156

\li\ref apop_name_find() : find the row/col number for a given name.

1157

\li\ref apop_name_print() : print the \ref apop_name struct, for diagnostic purposes.

1158

1159

endofdiv

1160

1161

Outlineheader textsec Text data

1162

1163

The \ref apop_data set includes a grid of strings, <tt>text</tt>, for holding text data.

1164

1165

Text should be encoded in UTF-8. US ASCII is a subset of UTF-8, so that's OK too.

1166

1167

There are a few simple forms for handling the \c text element of an \c apop_data set, which handle the tedium of memory-handling for you.

1168

1169

\li Use \ref apop_text_alloc to allocate the block of text. It is actually a realloc function, which you can use to resize an existing block without leaks.

1170

\li Use \ref apop_text_add to add text elements. It replaces any existing text in the given slot without memory leaks.

1171

\li The number of rows of text data in <tt>tdata</tt> is

1172

<tt>tdata->textsize[0]</tt>;

1173

the number of columns is <tt>tdata->textsize[1]</tt>.

1174

\li Refer to individual elements using the usual 2-D array notation, <tt>tdata->text[row][col]</tt>.

1175

\li <tt>x[0]</tt> can always be written as <tt>*x</tt>, which may save some typing. The number of rows is <tt>*tdata->textsize</tt>. If you have a single column of text data (i.e., all data is in column zero), then item \c i is <tt>*tdata->text[i]</tt>. If you know you have exactly one cell of text, then its value is <tt>**tdata->text</tt>.

1176

\li After \ref apop_text_alloc, all elements are the empty string <tt>""</tt>, which

1177

you can check via <tt>if (!strlen(dataset->text[i][j])) printf("<blank>")</tt> or

1178

<tt>if (!*dataset->text[i][j]) printf("<blank>")</tt>. For the sake of efficiency

1179

when dealing with large, sparse data sets, all blank cells point to <em>the same</em>

1180

static empty string, meaning that freeing cells must be done with care. Your best bet

1181

is to rely on \ref apop_text_add, \ref apop_text_alloc, and \ref apop_text_free to do

1182

the memory management for you.

1183

1184

Here is a sample program that uses these forms, plus a few text-handling functions.

1185

1186

\include eg/text_demo.c

1187

1188

\li\ref apop_data_transpose() : also transposes the text data. Say that you use

1189

<tt>dataset = apop_query_to_text("select onecolumn from data");</tt> then you have a

1190

sequence of strings, <tt>d->text[0][0], d->text[1][0], </tt>.... After <tt>apop_data

1191

*dt = apop_data_transpose(dataset)</tt>, you will have a single list of strings,

1192

<tt>dt->text[0]</tt>, which is often useful as input to list-of-strings handling

1193

functions.

1194

1195

\li\ref apop_query_to_text()

1196

\li\ref apop_text_alloc() : allocate or resize the text part of an \ref apop_data set.

1197

\li\ref apop_text_add(): replace a single cell of the text grid with new text.

1198

\li\ref apop_text_paste() : convert a table of little strings into one long string.

1199

\li\ref apop_text_unique_elements() : get a sorted list of unique elements for one column of text.

1200

\li\ref apop_text_free() : you may never need this, because \ref apop_data_free calls it.

1201

\li\ref apop_regex() : friendlier front-end for POSIX-standard regular expression searching and pulling matches into a \ref apop_data set.

1202

\li\ref apop_text_paste(): produce a single string from a grid of text

1203

1204

endofdiv

1205

1206

Outlineheader fact Generating factors

1207

1208

\em Factor is jargon for a numbered category. Number-crunching programs prefer integers over text, so we need a function to produce a one-to-one mapping from text categories into numeric factors.

1209

1210

A \em dummy is a variable that is either one or zero, depending on membership in a given

1211

group. Some methods (typically when the variable is an input or independent variable

1212

in a regression) prefer dummies; some methods (typically for outcome or dependent

1213

variables) prefer factors. The functions that generate factors and dummies will add

1214

an informational page to your \ref apop_data set with a name like <tt>\<categories

1215

for your_column\></tt> listing the conversion from the artificial numeric factor to

1216

the original data. Use \ref apop_data_get_factor_names to get a pointer to that page.

1217

1218

You can use the factor table to translate from numeric categories back to text (though

1219

you probably have the original text column in your data anyway).

1220

1221

This is especially useful if you want text in two data sets to have the same

1222

categories. Generate factors in the first set, then copy the factor list to the second,

1223

then run \ref apop_data_to_factors on the second.

1224

1225

\code

1226

apop_data_to_factors(d1);

1227

d2->more = apop_data_copy(apop_data_get_factor_names(d1));

1228

apop_data_to_factors(d2);

1229

\endcode

1230

1231

1232

\li\ref apop_data_to_dummies()

1233

\li\ref apop_data_to_factors()

1234

\li\ref apop_data_get_factor_names()

1235

\li\ref apop_text_unique_elements()

1236

\li\ref apop_vector_unique_elements()

1237

1238

endofdiv

1239

1240

endofdiv

1241

1242

Outlineheader dbs Databases

1243

1244

These are convenience functions to handle interaction with SQLite or mySQL/mariaDB. They open one and only one database, and handle most of the interaction therewith for you.

1245

1246

You will probably first use \ref apop_text_to_db to pull data into the database, then \ref apop_query to clean the data in the database, and finally \ref apop_query_to_data to pull some subset of the data out for analysis.

1247

1248

See the \ref db_moments page for not-SQL-standard math functions that you can

1249

use when sending queries from Apophenia, such as \c pow, \c stddev, or \c sqrt.

1250

1251

\li \ref apop_text_to_db : Read a text file on disk into the database. Most data analysis projects start with a call to this.

1252

\li \ref apop_data_print : If you include the argument <tt>.output_type='d'</tt>, this prints your \ref apop_data set to the database.

1253

\li \ref apop_query : Manipulate the database, return nothing (e.g., insert rows or create table).

1254

\li \ref apop_db_open : Optional, for when you want to use a database on disk.

1255

\li \ref apop_db_close : If you used \ref apop_db_open, you will need to use this too.

1256

\li \ref apop_table_exists : Check to make sure you aren't reinventing or destroying data. Also, a clean way to drop a table.

1257

1258

\li Apophenia reserves the right to insert temp tables into the opened database. They will all have names beginning with "apop_", so the reader is advised to not use tables with such names, and is free to ignore or delete any such tables that turn up.

1259

1260

\li If you need to deal with two databases, use SQL's <a

1261

href="https://sqlite.org/lang_attach.html"><tt>attach database</tt></a>. By default

1262

with SQLite, Apophenia opens an in-memory database handle. It is a sensible workflow to

1263

use the faster in-memory database as the primary, and then attach an on-disk database

1264

to read in data and write final output tables.

1265

1266

1267

<b>Extracting data from the database</b>

1268

1269

\li\ref apop_db_to_crosstab(): take three columns in the database (row, column, value) and produce a table of values.

1270

\li\ref apop_query_to_data()

1271

\li\ref apop_query_to_float()

1272

\li\ref apop_query_to_mixed_data()

1273

\li\ref apop_query_to_text()

1274

\li\ref apop_query_to_vector()

1275

1276

<b>Writing data to the database</b>

1277

1278

See the print functions below. E.g.

1279

1280

\code

1281

apop_data_print(yourdata, .output_type='d', .output_name="dbtab");

1282

\endcode

1283

1284

Outlineheader cmdline Command-line utilities

1285

1286

A few functions have proven to be useful enough to be worth breaking out into their own programs, for use in scripts or other data analysis from the command line:

1287

1288

\li The \c apop_text_to_db command line utility is a wrapper for the \ref apop_text_to_db command.

1289

\li The \c apop_db_to_crosstab function is a wrapper for the \ref apop_db_to_crosstab function.

1290

\li For fans of Gnuplot, the \c apop_plot_query utility produces a plot from the database. It is especially useful for histograms, whcih are binned via \ref apop_data_to_bins before plotting.

1291

1292

endofdiv

1293

1294

endofdiv

1295

1296

Outlineheader Modesec Models

1297

1298

This segment discusses the use of existing \ref apop_model objects.

1299

If you need to write a new model, see \ref modeldetails.

1300

1301

Outlineheader introtomodels Introduction

1302

1303

Begin with the most common use:

1304

the \c estimate function will estimate the parameters of your model. Just prep the data, select a model, and produce an estimate:

1305

1306

\code

1307

apop_data *data = apop_query_to_data("select outcome, in1, in2, in3 from dataset");

1308

apop_model *the_estimate = apop_estimate(data, apop_probit);

1309

apop_model_print(the_estimate, NULL);

1310

\endcode

1311

1312

Along the way to estimating the parameters, most models also find covariance estimates for

1313

the parameters, calculate statistics like log likelihood, and so on, which the final print statement will show.

1314

1315

The <tt>apop_probit</tt> model that ships with Apophenia is unparameterized:

1316

<tt>apop_probit.parameters==NULL</tt>. The output from the estimation,

1317

<tt>the_estimate</tt>, has the same form as <tt>apop_probit</tt>, but

1318

<tt>the_estimate->parameters</tt> has a meaningful value.

1319

1320

Outlineheader covandstuff More estimation output

1321

1322

A call to \ref apop_estimate produces more than just the estimated parameters. Most will

1323

produce any of a covariance matrix, some hypothesis tests, a list of expected values, log

1324

likelihood, AIC, AIC_c, BIC, et cetera.

1325

1326

First, note that if you don't want all that,

1327

adding to your model an \ref apop_parts_wanted_settings group with its default values (see below on settings groups) signals to

1328

the model that you want only the parameters and to not waste CPU time on covariances,

1329

expected values, et cetera. See the \ref apop_parts_wanted_settings documentation for examples and

1330

further refinements.

1331

1332

\li The actual parameter estimates are in an \ref apop_data set at \c your_model->parameters.

1333

1334

\li A pointer to the \ref apop_data set used for estimation, named \c data.

1335

1336

\li Scalar statistics of the model are listed in the output model's \c info group, and can

1337

be retrieved via a form like

1338

1339

\code

1340

apop_data_get(your_model->info, .rowname="log likelihood");

1341

//or

1342

apop_data_get(your_model->info, .rowname="AIC");

1343

\endcode

1344

1345

\li Covariances of the parameters are a page appended to the parameters; retrieve via

1346

1347

\code

1348

apop_data *cov = apop_data_get_page(your_model->parameters, "<Covariance>");

1349

\endcode

1350

1351

\li If the model calculates it, the table of expected values (typically including

1352

expected value, actual value, and residual) is a page stapled to the main info

1353

page. This is mostly for regression-type models. Retrieve via:

1354

1355

\code

1356

apop_data *predict = apop_data_get_page(your_model->info, "<Predicted>");

1357

\endcode

1358

1359

endofdiv

1360

1361

But we expect much more from a model than just estimating parameters from data.

1362

1363

Continuing the above example where we got an estimated Probit model named \c the_estimate, we can interrogate the estimate in various familiar ways. In each of the following examples, the model object holds enough information that the generic function being called can do its work:

1364

1365

\code

1366

apop_data *expected_value = apop_predict(NULL, the_estimate);

1367

1368

double density_under = apop_cdf(expected_value, the_estimate);

1369

1370

apop_data *draws = apop_model_draws(the_estimate, .count=1000);

1371

\endcode

1372

1373

Apophenia ships with many well-known models for your immediate use, including

1374

probability distributions, such as the \ref apop_normal, \ref apop_poisson, or \ref apop_beta models. The data is assumed to have been drawn from a given distribution and the question is only what distributional parameters best fit; e.g., assume the data is Normally distributed and find the mean and variance: \ref apop_estimate(\c your_data, \ref apop_normal).

1375

1376

There are also linear models like \ref apop_ols, \ref apop_probit, and \ref apop_logit. As in the example, they are on equal footing with the distributions, so nothing keeps you from making random draws from an estimated linear model.

1377

1378

\li If you send a data set with the \c weights vector filled, \ref apop_ols estimates Weighted OLS.

1379

1380

\li If the dependent variable has more than categories, The \ref apop_probit and \ref apop_logit models estimate a multinomial logit or probit.

1381

1382

\li There are separate \ref apop_normal and \ref apop_multivariate_normal functions because the parameter formats are slightly different: the univariate Normal keeps both μ and σ in the vector element of the parameters; the multivariate version uses the vector for the vector of means and the matrix for the Σ matrix. The univariate is so heavily used that it merits a special-case model.

1383

1384

1385

Simulation models seem to not fit this form, but you will see below that if you can write an objective function for the \c p method of the model, you can use the above tools. Notably, you can estimate parameters via maximum likelihood and then give confidence intervals around those parameters.

1386

1387

But some models have to break uniformity, like how a histogram has a list of bins that makes no sense for a Normal distribution. These are held in <em>settings groups</em>, which you will occasionally need to tweak to modify how a model is handled or estimated. The most common example would be for maximum likelihood, eg.

1388

1389

\code

1390

//Probit uses MLE. Redo the estimation using Newton's Method

1391

Apop_settings_add_group(the_estimate, apop_mle, .verbose='y',

1392

.tolerance=1e-4, .method=APOP_RF_NEWTON);

1393

apop_model *re_est = apop_estimate(data, the_estimate);

1394

\endcode

1395

1396

See below for more details on using settings groups.

1397

1398

Outlineheader modelparameterization Parameterizing or initializing a model

1399

1400

The models that ship with Apophenia have the requisite procedures for estimation,

1401

making draws, and so on, but have <tt>parameters==NULL</tt> and <tt>settings==NULL</tt>. The

1402

model is thus, for many purposes, incomplete, and you will need to take some action to

1403

complete the model. As per the examples to follow, there are several possibilities:

1404

1405

\li Estimate it! Almost all models can be sent with a data set as an argument to the

1406

<tt>apop_estimate</tt> function. The input model is unchanged, but the output model

1407

has parameters and settings in place.

1408

1409

\li If your model has a fixed number of numeric parameters, then you can set them with

1410

\ref apop_model_set_parameters.

1411

1412

\li If your model has a variable number of parameters, you can directly set the \c

1413

parameters element via \c apop_data_falloc. For most purposes, you will also need to

1414

set the \c msize1, \c msize2, \c vsize, and \c dsize elements to the size you want. See

1415

the example below.

1416

1417

\li Some models have disparate, non-numeric settings rather than a simple matrix of

1418

parameters. For example, an kernel density estimate needs a model as a kernel and a

1419

base data set, which can be set via \ref apop_model_copy_set.

1420

1421

Here is an example that shows the options for parameterizing a model. After each

1422

parameterization, 20 draws are made and written to a file named draws-[modelname].

1423

1424

\include ../eg/parameterization.c

1425

1426

endofdiv

1427

1428

Where to from here? See the \ref models page for

1429

a list of basic functions that make use of them,

1430

along with a list of the canned models, including popular favorites like

1431

\ref apop_beta, \ref apop_binomial, \ref apop_iv (instrumental variables),

1432

\ref apop_kernel_density, \ref apop_loess, \ref apop_lognormal,

1433

\ref apop_pmf (see the histogram section below), and \ref apop_poisson.

1434

1435

If you need to write a new model, see \ref modeldetails.

1436

1437

endofdiv

1438

1439

1440

Outlineheader mathmethods Model methods

1441

1442

\li\ref apop_estimate() : estimate the parameters of the model with data.

1443

\li\ref apop_predict() : the expected value function.

1444

\li\ref apop_draw() : random draws from an estimated model.

1445

\li\ref apop_p() : the probability of a given data set given the model.

1446

\li\ref apop_log_likelihood() : the log of \ref apop_p

1447

\li\ref apop_score() : the derivative of \ref apop_log_likelihood

1448

\li\ref apop_model_print() : write to screen, file, or database

1449

1450

\li\ref apop_model_copy() : duplicate a model

1451

\li\ref apop_model_set_parameters() : Models ship with no parameters set. Use this to convert a Normal(μ, σ) with unknown μ and σ into a Normal(0, 1), for example.

1452

\li\ref apop_model_free()

1453

\li\ref apop_model_clear(), apop_prep() : remove the parameters from a parameterized model. Used infrequently.

1454

\li\ref apop_model_draws() : many random draws from an estimated model.

1455

1456

endofdiv

1457

1458

Outlineheader Update Filtering & updating

1459

1460

The model structure makes it

1461

easy to generate new models that are variants of prior models. Bayesian updating,

1462

for example, takes in one \ref apop_model that we call the prior, one \ref apop_model

1463

that we call a likelihood, and outputs an \ref apop_model that we call the

1464

posterior. One can produce complex models using simpler transformations as well. For example, to generate

1465

a one-parameter Normal(μ, 1) given the code for for a Normal(μ, σ):

1466

1467

\code

1468

apop_model *N_sigma1 = apop_model_fix_params(apop_model_set_parameters(apop_normal, NAN, 1));

1469

\endcode

1470

1471

This can be used anywhere the original Normal distribution can be. If we need to truncate the distribution in the data space:

1472

1473

\code

1474

//The constraint function.

1475

double over_zero(apop_data *in, apop_model *m){

1476

return apop_data_get(in) > 0;

1477

}

1478

1479

apop_model *trunc = apop_model_dconstrain(.base_model=N_sigma1,

1480

.constraint=over_zero);

1481

\endcode

1482

1483

Chaining together simpler transformations is an easy method to produce

1484

models of arbitrary detail.

1485

1486

\li\ref apop_update() : Bayesian updating

1487

\li\ref apop_model_coordinate_transform() : apply an invertible transformation to the data space

1488

\li\ref apop_model_dconstrain() : constrain the data space of a model to a subspace. E.g., truncate a Normal distribution so \f$x>0\f$.

1489

\li\ref apop_model_fix_params() : hold some parameters constant

1490

\li\ref apop_model_mixture() : a linear combination of models

1491

\li\ref apop_model_stack() : If \f$(p_1, p_2)\f$ has a Normal distribution and \f$p_3\f$ has an independent Poisson distribution, then \f$(p_1, p_2, p_3)\f$ has an <tt>apop_model_stack(apop_normal, apop_poisson)</tt> distribution.

1492

\li\ref apop_model_dcompose() : use the output of one model as a data set for another

1493

1494

endofdiv

1495

1496

Outlineheader modelsettings Settings groups

1497

1498

[For info on specific settings groups and their contents and use, see the \ref settings page.]

1499

1500

1501

Describing a statistical, agent-based, social, or physical model in a standardized form is difficult because every model has significantly different settings. E.g., an MLE requires a method of search (conjugate gradient, simplex, simulated annealing), and a histogram needs the number of bins to be filled with data.

1502

1503

So, the \ref apop_model includes a single list which can hold an arbitrary number of groups of settings, like the search specifications for finding the maximum likelihood, a histogram for making random draws, and options about the model type.

1504

1505

Settings groups are automatically initialized with default values when

1506

needed. If the defaults do no harm, then you don't need to think about

1507

these settings groups at all.

1508

1509

Here is an example where a settings group is worth tweaking: the \ref apop_parts_wanted_settings group indicates which parts

1510

of the auxiliary data you want.

1511

1512

1513

\code

1514

1 apop_model *m = apop_model_copy(apop_ols);

1515

2 Apop_settings_add_group(m, apop_parts_wanted, .covariance='y');

1516

3 apop_model *est = apop_estimate(data, m);

1517

\endcode

1518

1519

1520

Line one establishes the baseline form of the model. Line two adds a settings group

1521

of type \ref apop_parts_wanted_settings to the model. By default other auxiliary items, like the expected values, are set to \c 'n' when using this group, so this specifies that we want covariance and only covariance. Having stated our preferences, line three does the estimation we want.

1522

1523

Notice that the \c _settings ending to the settings group's name isn't written---macros

1524

make it happen. The remaining arguments to \c Apop_settings_add_group (if any) follow

1525

the \ref designated syntax of the form <tt>.setting=value</tt>.

1526

1527

There is an \ref apop_model_copy_set macro that adds a settings group when it is first copied, joining up lines one and two above:

1528

1529

\code

1530

apop_model *m = apop_model_copy_set(apop_ols, apop_parts_wanted, .covariance='y');

1531

\endcode

1532

1533

Settings groups are copied with the model, which facilitates chaining

1534

estimations. Continuing the above example, you could re-estimate to get the predicted

1535

values and covariance via:

1536

1537

1538

\code

1539

Apop_settings_set(est, apop_parts_wanted, predicted, 'y');

1540

apop_model *est2 = apop_estimate(data, est);

1541

\endcode

1542

1543

1544

\li \ref Apop_settings_set, for modifying a single setting, doesn't use the designated initializers format.

1545

1546

\li The \ref settings page lists the settings structures included in Apophenia and their use.

1547

1548

\li Because the settings groups are buried within the model, debugging them can be a

1549

pain. Here is a documented macro for \c gdb that will help you pull a settings group out of a

1550

model for your inspection, to cut and paste into your \c .gdbinit. It shouldn't be too difficult to modify this macro for other debuggers.

1551

1552

\code

1553

define get_group

1554

set $group = ($arg1_settings *) apop_settings_get_grp( $arg0, "$arg1", 0 )

1555

p *$group

1556

end

1557

document get_group

1558

Gets a settings group from a model.

1559

Give the model name and the name of the group, like

1560

get_group my_model apop_mle

1561

and I will set a gdb variable named $group that points to that model,

1562

which you can use like any other pointer. For example, print the contents with

1563

p *$group

1564

The contents of $group are printed to the screen as visible output to this macro.

1565

end

1566

\endcode

1567

1568

For just using a model, that's all of what you need to know. For details on writing a new settings group, see \ref settingswriting .

1569

1570

\li\ref Apop_settings_add_group

1571

\li\ref Apop_settings_set

1572

\li\ref Apop_settings_get get a single element from a settings group.

1573

\li\ref Apop_settings_get_group get the whole settings group.

1574

1575

endofdiv

1576

1577

endofdiv

1578

1579

endofdiv

1580

1581

1582

Outlineheader Test Tests & diagnostics

1583

1584

Just about any hypothesis test consists of a few common steps:

1585

1586

\li specify a statistic

1587

\li State the statistic's hypothesized distribution

1588

\li Find the odds that the statistic would lie within some given range, like <em>greater than zero</em> or <em>near 1.1</em>.

1589

1590

If the statistic is from a common form, like the parameters from an OLS regression, then the commonly-associated \f$t\f$ test is probably included as part of the estimation output, typically as a row in the \c info element of the \ref apop_data set.

1591

1592

Some tests, like ANOVA, produce a statistic using a specialized procedure, so Apophenia includes some functions, like \ref apop_test_anova_independence and \ref apop_test_kolmogorov, to produce the statistic and look up its significance level.

1593

1594

If you are producing a statistic that you know has a common form, like a central limit theorem tells you that your statistic is Normally distributed, then the convenience function \ref apop_test will do the final lookup step of checking where your statistic lies on your chosen distribution.

1595

1596

\li\ref apop_test()

1597

\li\ref apop_paired_t_test()

1598

\li\ref apop_f_test()

1599

\li\ref apop_t_test()

1600

\li\ref apop_test_anova_independence()

1601

\li\ref apop_test_fisher_exact()

1602

\li\ref apop_test_kolmogorov()

1603

\li\ref apop_estimate_coefficient_of_determination()

1604

\li\ref apop_estimate_r_squared()

1605

\li\ref apop_estimate_parameter_tests()

1606

1607

See also the example at the end of \ref gentle.

1608

1609

Outlineheader Mont Monte Carlo methods

1610

1611

\li\ref apop_bootstrap_cov()

1612

\li\ref apop_jackknife_cov()

1613

1614

endofdiv

1615

1616

To give another example of testing, here is a function that used to be a part of Apophenia, but seemed a bit out of place. Here it is as a sample:

1617

1618

1619

\code

1620

// Input: any old vector. Output: 1 - the p-value for a chi-squared

1621

// test to answer the question, "with what confidence can I reject the

1622

// hypothesis that the variance of my data is zero?"

1623

1624

double apop_test_chi_squared_var_not_zero(const gsl_vector *in){

1625

Apop_stopif(!in, return NAN, 0, "input vector is NULL. Doing nothing.");

1626

double sum=0;

1627

gsl_vector *normed = apop_vector_normalize((gsl_vector *)in, .normalization_type='s');

1628

gsl_vector_mul(normed, normed);

1629

for(size_t i=0;i< normed->size;

1630

sum +=gsl_vector_get(normed,i++));

1631

gsl_vector_free(normed);

1632

return gsl_cdf_chisq_P(sum, in->size);

1633

}

1634

\endcode

1635

1636

Or, consider the Rao statistic,

1637

\f${\partial\over \partial\beta}\log L(\beta)'I^{-1}(\beta){\partial\over \partial\beta}\log L(\beta)\f$

1638

where \f$L\f$ is your model's likelihood function and \f$I\f$ its information matrix. In code:

1639

1640

\code

1641

apop_data * infoinv = apop_model_numerical_covariance(data, your_model);

1642

apop_data * score;

1643

apop_score(data, score->vector, your_model);

1644

apop_data * stat = apop_dot(apop_dot(score, infoinv), score);

1645

\endcode

1646

1647

Given the correct assumptions, this is \f$\sim \chi^2_m\f$, where \f$m\f$ is the dimension of \f$\beta\f$, so the odds of a Type I error given the model is:

1648

1649

\code

1650

double p_value = apop_test(stat, "chi squared", beta->size);

1651

\endcode

1652

1653

endofdiv

1654

1655

Outlineheader Histosec Empirical distributions and PMFs (probability mass functions)

1656

1657

The \ref apop_pmf model wraps a \ref apop_data set so it can be read as an empirical

1658

model, with a likelihoood function (equal to the associated weight for observed

1659

values and zero for unobserved values), a random number generator (which

1660

simply makes weighted random draws from the data), and so on. Setting it up is a

1661

model estimation from data like any other, done via \ref apop_estimate(\c your_data,

1662

\ref apop_pmf).

1663

1664

You have the option of cleaning up the data before turning it into a PMF. For example...

1665

1666

\code

1667

apop_data_pmf_compress(your_data); //remove duplicates

1668

apop_data_sort(your_data);

1669

apop_vector_normalize(your_data->weights); //weights sum to one

1670

apop_model *a_pmf = apop_estimate(your_data, apop_pmf);

1671

\endcode

1672

1673

These are largely optional.

1674

1675

\li The CDF is calculated based on the percent of the weights between the zeroth row of the PMF

1676

and the row specified. This generally makes more sense after \ref apop_data_sort.

1677

\li Compression produces a corresponding improvement in efficiency when calculating

1678

CDFs, but is otherwise not necessary.

1679

\li Sorting or normalizing is not necessary for making draws or getting a likelihood or log likelihood.

1680

1681

It is the weights vector that holds the density represented by each row; the rest of the row represents the coordinates of that density. If the input data set has no \c weights segment, then I assume that all rows have equal weight.

1682

1683

Most models have a \c parameters \ref apop_data set that is filled when you call \ref

1684

apop_estimate. For a PMF model, the \c parameters are \c NULL, and the \c data itself

1685

is used for calculation. Therefore, modifying the data post-estimation can break some

1686

internal settings set during estimation. If you modify the data, throw away any existing

1687

PMFs (\c apop_model_free) and re-estimate a new one.

1688

1689

1690

Using \ref apop_data_pmf_compress puts the data into one bin for each unique value in

1691

the data set.

1692

You may instead want bins of fixed with, in the style of a histogram, which you can get via

1693

\ref apop_data_to_bins. It requires a bin specification. If you send a \c NULL binspec,

1694

then the offset is zero and the bin size is big enough to ensure that there are

1695

\f$\sqrt(N)\f$ bins from minimum to maximum. There are other preferred formulæ for

1696

bin widths that minimize MSE, which might be added as a future extension. The binspec

1697

will be added as a page to the data set, named <tt>"<binspec>"</tt>. See the \ref

1698

apop_data_to_bins documentation on how to write a custom bin spec.

1699

1700

There are a few ways of testing the claim that one distribution equals another, typically an empirical PMF versus a smooth theoretical distribution. In both cases, you will need two distributions based on the same binspec.

1701

1702

For example, if you do not have a prior binspec in mind, then you can use the one generated by the first call to the histogram binning function to make sure that the second data set is in sync:

1703

1704

\code

1705

apop_data_to_bins(first_set, NULL);

1706

apop_data_to_bins(second_set, apop_data_get_page(first_set, "<binspec>"));

1707

\endcode

1708

1709

You can use \ref apop_test_kolmogorov or \ref apop_histograms_test_goodness_of_fit to generate the appropriate statistics from the pairs of bins.

1710

1711

Kernel density estimation will produce a smoothed PDF. See \ref apop_kernel_density for details.

1712

Or, use \ref apop_vector_moving_average for a simpler smoothing method.

1713

1714

1715

\li\ref apop_data_pmf_compress() : merge together redundant rows in a data set before calling

1716

\ref apop_estimate(\c your_data, \ref apop_pmf); optional.

1717

\li\ref apop_vector_moving_average() : smooth a vector (e.g., your_pmf->data->weights) via moving average.

1718

\li\ref apop_histograms_test_goodness_of_fit() : goodness-of-fit via \f$\chi^2\f$ statistic

1719

\li\ref apop_test_kolmogorov() : goodness-of-fit via Kolmogorov-Smirnov statistic

1720

1721

endofdiv

1722

1723

Outlineheader Maxi Maximum likelihood methods

1724

1725

This section includes some notes on the maximum likelihood routine. As in the section

1726

on writing models above, if a model has a \c p or \c log_likelihood method but no \c

1727

estimate method, then calling \c apop_estimate(your_data, your_model) executes the

1728

default estimation routine of maximum likelihood.

1729

1730

If you are a not a statistician, then there are a few things you will need to keep in

1731

mind:

1732

1733

\li Physicists, pure mathematicians, and the GSL minimize; economists, statisticians, and

1734

Apophenia maximize. If you are doing a minimization, be sure that your function returns minus the objective

1735

function's value.

1736

1737

\li The overall setup is about estimating the parameters of a model with data. The user

1738

provides a data set and an unparameterized model, and the system tries parameterized

1739

models until one of them is found to be optimal. The data is fixed. The parameters make sense only in the

1740

context of a model, so the optimization tries a series of parameterized models,

1741

searching for the one that is most likely.

1742

In a non-stats setting, you will probably have \c NULL data, and will be optimizing the

1743

parameters internal to the model.

1744

1745

\li Because the unit of analysis is a parameterized model,

1746

not just parameters, you need to have an \ref apop_model

1747

wrapping your objective function. Here is a typical sort of function that one would

1748

maximize; it is Rosenbrock's banana function, \f$(1-x)^2+ s(y - x^2)^2\f$, where the

1749

scaling factor \f$s\f$ is fixed ahead of time, say at 100.

1750

1751

\code

1752

typedef struct {

1753

double scaling;

1754

} coeff_struct;

1755

1756

double banana (double *params, coeff_struct *in){

1757

return (gsl_pow_2(1-params[0]) +

1758

in->scaling*gsl_pow_2(params[1]-gsl_pow_2(params[0])));

1759

}

1760

\endcode

1761

1762

The function returns a single number to be minimized. You will need to write an

1763

\ref apop_model to send to the optimizer, which is a two step process: write a log

1764

likelihood function wrapping the real objective function, and a model that uses that

1765

log likelihood. Here they are:

1766

1767

\code

1768

double ll (apop_data *d, apop_model *in){

1769

return - banana(in->parameters->vector->data, in->more);

1770

}

1771

1772

int main(){

1773

coeff_struct co = {.scaling=100};

1774

apop_model b = {"Bananas!", .log_likelihood= ll, .vsize=2,

1775

.more = &co, .more_size=sizeof(coeff_struct)};

1776

\endcode

1777

1778

The <tt>vsize=2</tt> specified that your parameters are a vector of size two, which

1779

means that <tt>in->parameters->vector->data</tt> is the list of <tt>double</tt>s that you

1780

should send to \c banana. The \c more element of the structure is designed to hold any

1781

arbitrary structure; if you use it, you will also need to use the \c more_size element, as

1782

above.

1783

1784

\li Statisticians want the covariance and basic tests about the parameters. If you just

1785

want the optimal value, then adding this line will shut off all auxiliary calculations:

1786

\code

1787

Apop_settings_add_group(your_model, apop_parts_wanted);

1788

\endcode

1789

See the documentation for \ref apop_parts_wanted_settings for details about how this

1790

works. It can also offer quite the speedup: especially for high-dimensional problems,

1791

finding the covariance matrix without any information can take dozens of evaluations

1792

of the objective funtion for each evaluation that is part of the search itself.

1793

1794

\li MLEs have an especially large number of parameter tweaks that could be made;

1795

see the section on MLE settings above or the \ref apop_mle_settings page.

1796

1797

\li Putting it all together, here is a full program to minimize Rosenbrock's banana

1798

function. There are some extras: it uses two methods (notice how easy it is to re-run an

1799

estimation with an alternate method, but the syntax for modifying a setting differs from

1800

the initialization syntax) and checks that the results are accurate.

1801

1802

\include ../eg/banana.c

1803

1804

\li If you would like to see what the optimizer did, add <tt>.want_path='y'</tt> to the settings group, then get the <tt>path</tt> element from the settings group:

1805

1806

\code

1807

Apop_settings_add_group(your_model, apop_mle, .want_path='y');

1808

apop_model *out = apop_estimate(your_data, your_model);

1809

apop_data_show(Apop_settings_get(out, apop_mle, path));

1810

\endcode

1811

1812

Outlineheader constr Setting Constraints

1813

1814

The problem is that the parameters of a function must not take on certain values, either because the function is undefined for those values or because parameters with certain values would not fit the real-world problem.

1815

1816

The solution is to rewrite the function being maximized such that the function is continuous at the constraint boundary but takes a steep downward slope. The unconstrained maximization routines will be able to search a continuous function but will never return a solution that falls beyond the parameter limits.

1817

1818

If you give it a likelihood function with no regard to constraints plus an array of constraints,

1819

\ref apop_maximum_likelihood will combine them to a function that fits the above description and search accordingly.

1820

1821

A constraint function must do three things:

1822

\li It must check the constraint, and if the constraint does not bind (i.e. the parameter values are OK), then it must return zero.

1823

\li If the constraint does bind, it must return a penalty, that indicates how far off the parameter is from meeting the constraint.

1824

\li if the constraint does bind, it must set a return vector that the likelihood function can take as a valid input. The penalty at this returned value must be zero.

1825

1826

The idea is that if the constraint returns zero, the log likelihood function will

1827

return the log likelihood as usual, and if not, it will return the log likelihood at

1828

the constraint's return vector minus the penalty. To give a concrete example, here

1829

is a constraint function that will ensure that both parameters of a two-dimensional

1830

input are both greater than zero. As with the constraints for many of the models that

1831

ship with Apophenia, it is a wrapper for \ref apop_linear_constraint.

1832

1833

\code

1834

static long double beta_zero_greater_than_x_constraint(apop_data *data, apop_model *v){

1835

//constraint is 0 < beta_2

1836

static apop_data *constraint = NULL;

1837

if (!constraint) constraint= apop_data_falloc((1,1,1), 0, 1);

1838

return apop_linear_constraint(v->parameters->vector, constraint, 1e-3);

1839

}

1840

\endcode

1841

1842

\li\ref apop_linear_constraint()

1843

1844

endofdiv

1845

1846

\li\ref apop_estimate_restart(): Restarting an MLE with different settings can improve results.

1847

\li\ref apop_maximum_likelihood(): Rarely used. If a model has no \c estimate element, just call \ref apop_estimate to run an MLE.

1848

\li\ref apop_model_numerical_covariance()

1849

\li\ref apop_numerical_gradient()

1850

1851

endofdiv

1852

1853

1854

Outlineheader Miss Missing data

1855

1856

\li\ref apop_data_listwise_delete()

1857

\li\ref apop_ml_impute()

1858

1859

endofdiv

1860

1861

Outlineheader Legi Legible output

1862

1863

The output routines handle four sinks for your output. There is a global variable that

1864

you can use for small projects where all data will go to the same place.

1865

1866

\code

1867

apop_opts.output_type = 's'; //Stdout

1868

apop_opts.output_type = 'f'; //named file

1869

apop_opts.output_type = 'p'; //a pipe or already-opened file

1870

apop_opts.output_type = 'd'; //the database

1871

\endcode

1872

1873

You can also set the output type, the name of the output file or table, and other options

1874

via arguments to individual calls to output functions. See \ref apop_prep_output for the list of options.

1875

1876

1877

C makes minimal distinction between pipes and files, so you can set a

1878

pipe or file as output and send all output there until further notice:

1879

1880

\code

1881

apop_opts.output_type = 'p';

1882

apop_opts.output_pipe = popen("gnuplot", "w");

1883

apop_plot_lattice(...); //see https://github.com/b-k/Apophenia/wiki/gnuplot_snippets

1884

fclose(apop_opts.output_pipe);

1885

apop_opts.output_pipe = fopen("newfile", "w");

1886

apop_data_print(set1);

1887

fprintf(apop_opts.output_pipe, "\nNow set 2:\n");

1888

apop_data_print(set2);

1889

\endcode

1890

1891

Continuing the example, you can always override the global data with

1892

a specific request:

1893

\code

1894

apop_vector_print(v, "vectorfile"); //put vectors in a separate file

1895

apop_matrix_print(m, "matrix_table", .output_type = 'd'); //write to the db

1896

apop_matrix_print(m, .output_pipe = stdout); //now show the same matrix on screen

1897

\endcode

1898

1899

I will first look to the input file name, then the input pipe, then the

1900

global \c output_pipe, in that order, to determine to where I should

1901

write. Some combinations (like output type = \c 'd' and only a pipe) don't

1902

make sense, and I'll try to warn you about those.

1903

1904

What if you have too much output and would like to use a pager, like \c less or \c more?

1905

In C and POSIX terminology, you're asking to pipe your output to a paging program. Here is

1906

the form:

1907

\code

1908

FILE *lesspipe = popen("less", "w");

1909

assert(lesspipe);

1910

apop_data_print(your_data_set, .output_pipe=lesspipe);

1911

pclose(lesspipe);

1912

\endcode

1913

\c popen will search your usual program path for \c less, so you don't have to give a full path.

1914

1915

\li\ref apop_data_print()

1916

\li\ref apop_matrix_print()

1917

\li\ref apop_vector_print()

1918

\li\ref apop_data_show() : alias for \ref apop_data_print limited to \c stdout.

1919

1920

The plot functions produce output for Gnuplot (so output type = \c 'd' again does not

1921

make sense). As above, you can pipe directly to Gnuplot or write to a file. Please

1922

consider these to be deprecated, as there is better graphics support in the works.

1923

1924

\li\ref apop_plot_histogram()

1925

1926

endofdiv

1927

1928

Outlineheader moreasst Assorted

1929

1930

A few more descriptive methods:

1931

1932

\li\ref apop_matrix_pca : Principal component analysis

1933

\li\ref apop_anova : One-way or two-way ANOVA tables

1934

\li\ref apop_rake : Iterative proportional fitting on large, sparse tables

1935

1936

1937

General utilities:

1938

1939

\li\ref Apop_stopif : Apophenia's error-handling and warning-printing macro.

1940

\li\ref apop_opts : the global options

1941

\li\ref apop_regex() : friendlier front-end for POSIX-standard regular expression searching and pulling matches into a \ref apop_data set.

1942

\li\ref apop_text_paste(): produce a single string from a grid of text

1943

\li\ref apop_system()

1944

1945

Math utilities:

1946

1947

\li\ref apop_matrix_is_positive_semidefinite()

1948

\li\ref apop_matrix_to_positive_semidefinite()

1949

\li\ref apop_generalized_harmonic()

1950

\li\ref apop_multivariate_gamma()

1951

\li\ref apop_multivariate_lngamma()

1952

\li\ref apop_rng_alloc()

1953

1954

endofdiv

1955

1956

Outlineheader links Further references

1957

1958

For your convenience, here are links to some other libraries you are probably using.

1959

1960

\li <a href="http://www.gnu.org/software/libc/manual/html_node/index.html">The standard C library</a>

1961

\li <a href="http://www.gnu.org/software/gsl/manual/html_node/index.html">The

1962

GSL documentation</a>, and <a href="http://www.gnu.org/software/gsl/manual/html_node/Function-Index.html">its index</a>

1963

\li <a href="http://sqlite.org/lang.html">SQL understood by SQLite</a>

1964

1965

*/

1966

1967

/** \page mingw MinGW

1968

1969

Minimalist GNU for Windows is indeed minimalist: it is not a full POSIX subsystem, and provides no package manager. Therefore, you will have to make some adjustments and install the dependencies yourself.

1970

1971

Matt P. Dziubinski successfully used Apophenia via MinGW; here are his instructions (with edits by BK):

1972

1973

\li get libregex (the ZIP file) from:

1974

http://sourceforge.net/project/showfiles.php?group_id=204414&package_id=306189

1975

\li get libintl (three ZIP files) from:

1976

http://gnuwin32.sourceforge.net/packages/libintl.htm .

1977

download "Binaries", "Dependencies", "Developer files"

1978

\li follow "libintl" steps from:

1979

http://kayalang.org/download/compiling/windows

1980

1981

\li Comment out "alloc.h" in apophenia-0.22/vasprintf/vasnprintf.c

1982

\li Modify \c Makefile, adding -lpthread to AM_CFLAGS (removing -pthread) and -lregex to AM_CFLAGS and LIBS

1983

1984

\li Now compile the main library:

1985

\code

1986

make

1987

\endcode

1988

1989

\li Finally, put one more expected directory in place and install:

1990

\code

1991

mkdir -p -- "/usr/local/Lib/site-packages"

1992

make install

1993

\endcode

1994

1995

\li You will get the usual warning about library paths, and may have to take the specified action:

1996

\code

1997

----------------------------------------------------------------------

1998

Libraries have been installed in:

1999

/usr/local/lib

2000

2001

If you ever happen to want to link against installed libraries

2002

in a given directory, LIBDIR, you must either use libtool, and

2003

specify the full pathname of the library, or use the `-LLIBDIR'

2004

flag during linking and do at least one of the following:

2005

- add LIBDIR to the `PATH' environment variable

2006

during execution

2007

- add LIBDIR to the `LD_RUN_PATH' environment variable

2008

during linking

2009

- use the `-LLIBDIR' linker flag

2010

2011

See any operating system documentation about shared libraries for

2012

more information, such as the ld(1) and ld.so(8) manual pages.

2013

----------------------------------------------------------------------

2014

\endcode

2015

*/

2016

2017

2018

/** \page optionaldetails Implementation of optional arguments

2019

Optional and named arguments are among the most commonly commented-on features of Apophenia, so this page goes into full detail about the implementation.

2020

2021

To use these features, see the all-you-really-need summary at the \ref designated

2022

page. For a background and rationale, see the blog entry at http://modelingwithdata.org/arch/00000022.htm .

2023

2024

I'll assume you've read both links before continuing.

2025

2026

OK, now that you've read the how-to-use and the discussion of how optional and named arguments can be constructed in C, this page will show how they are done in Apophenia. The level of details should be sufficient to implement them in your own code if you so desire.

2027

2028

There are three components to the process of generating optional arguments as implemented here:

2029

\li Produce a \c struct whose elements match the arguments to the function.

2030

\li Write a wrapper function that takes in the struct, unpacks it, and calls the original function.

2031

\li Write a macro that makes the user think the wrapper function is the real thing.

2032

2033

None of these steps are really rocket science, but there is a huge amount of redundancy.

2034

Apophenia includes some macros that reduce the boilerplate redundancy significantly. There are two layers: the C-standard code, and the script that produces the C-standard code.

2035

2036

We'll begin with the C-standard header file:

2037

\code

2038

#ifdef APOP_NO_VARIADIC

2039

void apop_vector_increment(gsl_vector * v, int i, double amt);

2040

#else

2041

void apop_vector_increment_base(gsl_vector * v, int i, double amt);

2042

apop_varad_declare(void, apop_vector_increment, gsl_vector * v; int i; double amt);

2043

#define apop_vector_increment(...) apop_varad_link(apop_vector_increment, __VA_ARGS__)

2044

#endif

2045

\endcode

2046

2047

First, there is an if/else that allows the system to degrade gracefully

2048

if you are sending C code to a parser like swig, whose goals differ

2049

too much from straight C compilation for this to work. Just set \c

2050

APOP_NO_VARIADIC to produce a plain function with no variadic support.

2051

2052

Else, we begin the above steps. The \c apop_varad_declare line expands to the following:

2053

2054

\code

2055

typedef struct {

2056

gsl_vector * v; int i; double amt ;

2057

} variadic_type_apop_vector_increment;

2058

2059

void variadic_apop_vector_increment(variadic_type_apop_vector_increment varad_in);

2060

\endcode

2061

2062

So there's the ad-hoc struct and the declaration for the wrapper

2063

function. Notice how the arguments to the macro had semicolons, like a

2064

struct declaration, rather than commas, because the macro does indeed

2065

wrap the arguments into a struct.

2066

2067

Here is what the \c apop_varad_link would expand to:

2068

\code

2069

#define apop_vector_increment(...) variadic_apop_increment_base((variadic_type_apop_vector_increment) {__VA_ARGS__})

2070

\endcode

2071

That gives us part three: a macro that lets the user think that they are

2072

making a typical function call with a set of arguments, but wraps what

2073

they type into a struct.

2074

2075

Now for the code file where the function is declared. Again, there is is an \c APOP_NO_VARIADIC wrapper. Inside the interesting part, we find the wrapper function to unpack the struct that comes in.

2076

2077

\code

2078

\#ifdef APOP_NO_VARIADIC

2079

void apop_vector_increment(gsl_vector * v, int i, double amt){

2080

\#else

2081

apop_varad_head( void , apop_vector_increment){

2082

gsl_vector * apop_varad_var(v, NULL);

2083

Apop_assert(v, "You sent me a NULL vector.");

2084

int apop_varad_var(i, 0);

2085

double apop_varad_var(amt, 1);

2086

apop_vector_increment_base(v, i, amt);

2087

}

2088

2089

void apop_vector_increment_base(gsl_vector * v, int i, double amt){

2090

#endif

2091

v->data[i * v->stride] += amt;

2092

}

2093

\endcode

2094

2095

The

2096

\c apop_varad_head macro just reduces redundancy, and will expand to

2097

\code

2098

void variadic_apop_vector_increment (variadic_type_variadic_apop_vector_increment varad_in)

2099

\endcode

2100

2101

The function with this header thus takes in a single struct, and for every variable, there is a line like

2102

\code

2103

double apop_varad_var(amt, 1);

2104

\endcode

2105

which simply expands to:

2106

\code

2107

double amt = varad_in.amt ? varad_in.amt : 1;

2108

\endcode

2109

Thus, the macro declares each not-in-struct variable, and so there will need to be one such declaration line for each argument. Apart from requiring declarations, you can be creative: include sanity checks, post-vary the variables of the inputs, unpack without the macro, and so on. That is, this parent function does all of the bookkeeping, checking, and introductory shunting, so the base function can just do the math. Finally, the introductory section will call the base function.

2110

2111

The setup goes out of its way to leave the \c _base function in the public namespace, so that those who would prefer speed to bounds-checking can simply call that function directly, using standard notation. You could eliminate this feature by just merging the two functions.

2112

2113

2114

<b>The m4 script</b>

2115

2116

The above is all you need to make this work: the varad.h file, and the above structures. But there is still a lot of redundancy, which can't be eliminated by the plain C preprocessor.

2117

2118

Thus, in Apophenia's code base (the one you'll get from checking out the git repository, not the gzipped distribution that has already been post-processed) you will find a pre-preprocessing script that converts a few markers to the above form. Here is the code that will expand to the above C-standard code:

2119

2120

\code

2121

//header file

2122

APOP_VAR_DECLARE void apop_vector_increment(gsl_vector * v, int i, double amt);

2123

2124

//code file

2125

APOP_VAR_HEAD void apop_vector_increment(gsl_vector * v, int i, double amt){

2126

gsl_vector * apop_varad_var(v, NULL);

2127

Apop_assert(v, "You sent me a NULL vector.");

2128

int apop_varad_var(i, 0);

2129

double apop_varad_var(amt, 1);

2130

APOP_VAR_END_HEAD

2131

v->data[i * v->stride] += amt;

2132

}

2133

\endcode

2134

2135

It is obviously much shorter. The declaration line is actually a C-standard declaration with the \c APOP_VAR_DECLARE preface, so you don't have to remember when to use semicolons. The function itself looks like a single function, but there is again a marker before the declaration line, and the introductory material is separated from the main matter by the \c APOP_VAR_END_HEAD line. Done right, drawing a line between the introductory checks or initializations and the main function can really improve readability.

2136

2137

The m4 script inserts a <tt>return function_base(...)</tt> at the end of the header

2138

function, so you don't have to. If you want to call the funtion before the last line, you

2139

can do so explicitly, as in the expansion above, and add a bare <tt>return;</tt> to

2140

guarantee that the call to the base function that the m4 script will insert won't ever be

2141

reached.

2142

2143

One final detail: it is valid to have types with commas in them---function arguments. Because commas get turned to semicolons, and m4 isn't a real parser, there is an exception built in: you will have to replace commas with exclamation marks in the header file (only). E.g.,

2144

2145

\code

2146

APOP_VAR_DECLARE apop_data * f_of_f(apop_data *in, void *param, int n, double (*fn_d)(double ! void * !int));

2147

\endcode

2148

2149

m4 is POSIX standard, so even if you can't read the script, you have the program needed to run it. For example, if you name it \c prep_variadics.m4, then run

2150

\code

2151

m4 prep_variadics.m4 myfile.m4.c > myfile.c

2152

\endcode

2153

*/

2154

2155

/** \page dataprep Data prep rules

2156

2157

There are a lot of ways your data can come in, and we would like to run estimations on a reasonably standardized form.

2158

2159

First, this page will give a little rationale, which you are welcome to skip, and then will present the set of rules.

2160

2161

\section dataones Dealing with the ones column

2162

Most standard regression-type estimations require or generally expect

2163

a constant column. That is, the 0th column of your data is a constant (one), so the first parameter

2164

\f$\beta_1\f$ is slightly special in corresponding to a constant rather than a variable.

2165

2166

However, there are some estimations that do not use the constant column.

2167

2168

\em "Why not implicitly assume the ones column?"

2169

Some stats packages implicitly assume a constant column, which the user never sees. This violates the principle of transparency

2170

upon which Apophenia is based, and is generally annoying. Given a data matrix \f$X\f$ with the estimated parameters \f$\beta\f$,

2171

if the model asserts that the product \f$X\beta\f$ has meaning, then you should be able to calculate that product. With a ones column, a dot product is one line \c apop_dot(x, your_est->parameters, 0, 0)); without a ones column, the problem is left as an unpleasant exercise for the reader.

2172

2173

\subsection datashunt Shunting columns around.

2174

2175

Each regression-type estimation has one dependent variable and several independent. In the end, we want the dependent variable to be in the vector element. However, continuing the \em "lassies faire" tradition, doing major surgery on the data, such as removing a column and moving in all subsequent columns, is more invasive than an estimation should be.

2176

2177

\subsection datarules The rules

2178

So those are the two main considerations in prepping data. Here are the rules, intended to balance those considerations:

2179

2180

\subsubsection datamatic The automatic case

2181

There is one clever trick we can use to resolve both the need for a ones column and for having the dependent column in the vector: given a data set with no vector element and the dependent variable in the first column of the matrix, we can copy the dependent variable into the vector and then replace the first column of the matrix with ones. The result

2182

fits all of the above expectations.

2183

2184

You as a user merely have to send in a \c apop_data set with no vector and a dependent column in the first column.

2185

2186

\subsubsection dataprepped The already-prepped case

2187

If your data has a vector element, then the prep routines won't try to force something to be there. That is, they won't move anything, and won't turn anything into a constant column. If you don't want to use a constant column, or your data has already been prepped by an estimation, then this is what you want.

2188

2189

You as a user just have to send in a \c apop_data set with a filled vector element.

2190

*/

2191

2192

/**

2193

\page gentle A quick overview

2194

2195

This is a "gentle introduction" to the Apophenia library. It is intended

2196

to give you some initial bearings on the typical workflow and the concepts and tricks that

2197

the manual pages assume you have met.

2198

2199

This introduction assumes you already have some familiarity with C, how to compile a program, and how to use a debugger. If you want to install Apophenia now so you can try the samples on this page, see the \ref setup page.

2200

2201

An outline of this overview:

2202

2203

\li Apophenia fills a space between traditional C libraries and stats packages.

2204

\li The \ref apop_data structure represents a data set (of course). Data sets are inherently complex,

2205

but there are many functions that act on \ref apop_data sets to make life easier.

2206

\li The \ref apop_model encapsulates the sort of actions one would take with a model, like estimating model parameters or predicting values based on new inputs.

2207

\li Databases are great, and a perfect fit for the sort of paradigm here. Apophenia

2208

provides functions to make it easy to jump between database tables and \ref apop_data sets.

2209

2210

\par The opening example

2211

2212

Setting aside the more advanced applications and model-building tasks, let us begin with

2213

the workflow of a typical fitting-a-model project using Apophenia's tools:

2214

2215

\li Read the raw data into the database using \ref apop_text_to_db.

2216

\li Use SQL queries handled by \ref apop_query to massage the data as needed.

2217

\li Use \ref apop_query_to_data to pull some of the data into an in-memory \ref apop_data set.

2218

\li Call a model estimation such as \code apop_estimate (data_set, apop_ols)\endcode or \code apop_estimate (data_set, apop_probit)\endcode to fit parameters to the data. This will return an \ref apop_model with parameter estimates.

2219

\li Interrogate the returned estimate, by dumping it to the screen with \ref apop_model_print, sending its parameters and variance-covariance matrices to additional tests (the \c estimate step runs a few for you), or send the model's output to be input to another model.

2220

2221

Here is a concrete example of most of the above steps, which you can compile and run, to be discussed in detail below.

2222

2223

The program:

2224

2225

\include ols.c

2226

2227

To run this, you will need a file named <tt>data</tt> in comma-separated form, so

2228

here is a set sufficient for the demonstration. The first column is the dependent variable; the

2229

remaining columns are the independent:

2230

2231

\code

2232

Y, X_1, X_2, X_3

2233

2,3,4,5

2234

1,2,9,3

2235

4,7,9,0

2236

2,4,8,16

2237

1,4,2,9

2238

9,8,7,6

2239

\endcode

2240

2241

If you saved the code to <tt>sample.c</tt> and don't have a \ref makefile or other

2242

build system, then you can compile it with

2243

2244

\code

2245

gcc sample.c -std=gnu99 -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me

2246

\endcode

2247

2248

or

2249

2250

\code

2251

clang sample.c -lapophenia -lgsl -lgslcblas -lsqlite3 -o run_me

2252

\endcode

2253

2254

and then run it with <tt>./run_me</tt>. This compile line will work on any system with all the requisite tools,

2255

but for full-time work with this or any other C library, you will probably want to write a \ref makefile .

2256

2257

The results are unremarkable---this is just an ordinary regression on too few data

2258

points---but it does give us some lines of code to dissect.

2259

2260

The first two lines in \c main() make use of a database.

2261

I'll discuss the value of the database step more at the end of this page, but for

2262

now, note that there are several functions, \ref apop_query, and \ref

2263

apop_query_to_data being the ones you will most frequently be using, that will allow you to talk to and pull data from either an SQLite or mySQL/mariaDB database.

2264

2265

\par Designated initializers

2266

2267

Like this line,

2268

2269

\code

2270

apop_text_to_db(.text_file="data", .tabname="d");

2271

\endcode

2272

2273

many Apophenia functions accept named, optional arguments. To give another example,

2274

the \ref apop_data set has the usual row and column numbers, but also row and column

2275

names. So you should be able to refer to a cell by any combination of name or number;

2276

for the data set you read in above, which has column names, all of the following work:

2277

2278

\code

2279

x = apop_data_get(data, 2, 3); //x == 0

2280

x = apop_data_get(data, .row=2, .colname="X_3"); // x == 0

2281

apop_data_set(data, 2, 3, 18);

2282

apop_data_set(data, .colname="X_3", .row=2, .val= 18);

2283

\endcode

2284

2285

2286

Default values mean that the \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr functions handle matrices, vectors, and scalars sensibly:

2287

\code

2288

//Let v be a hundred-element vector:

2289

apop_data *v = apop_data_alloc(100);

2290

double x1 = apop_data_get(v, 1);

2291

apop_data_set(v, 2, .val=x1);

2292

2293

//A 100x1 matrix behaves like a vector

2294

apop_data *m = apop_data_alloc(100, 1);

2295

double m1 = apop_data_get(v, 1);

2296

2297

//let s be a scalar stored in a 1x1 apop_data set:

2298

apop_data *v = apop_data_alloc(1);

2299

double *scalar = apop_data_ptr(s);

2300

\endcode

2301

2302

This form may be new to users of less user-friendly C libraries, but it it fully

2303

conforms to the C standard (ISO/IEC 9899:2011). See the \ref designated page for details.

2304

2305

2306

\section apop_data

2307

2308

A lot of real-world data processing is about quotidian annoyances about text versus

2309

numeric data or dealing with missing values, and the the \ref apop_data set and its

2310

many support functions are intended to make data processing in C easy. Some users of

2311

Apophenia use the library only for its \ref apop_data set and associated functions. See

2312

the "data sets" section of the outline page (linked from the header of this page)

2313

for extensive notes on using the structure.

2314

2315

The structure includes seven parts:

2316

2317

\li a vector

2318

\li a matrix

2319

\li a grid of text elements

2320

\li a vector of weights

2321

\li names for everything: row names, a vector name, matrix column names, text names.

2322

\li a link to a second page of data

2323

\li an error marker

2324

2325

This is not a generic and abstract ideal, but is really the sort of mess that data sets look like. For

2326

example, here is some data for a weighted OLS regression. It includes an outcome

2327

variable in the vector, dependent variables in the matrix and text grid,

2328

replicate weights, and column names in bold labeling the variables:

2329

2330

\htmlinclude apop_data_fig.html

2331

2332

Apophenia will generally assume that one row across all of these elements

2333

describes a single observation or data point.

2334

2335

See above for some examples of getting and setting individual elements.

2336

2337

Also, \ref apop_data_get, \ref apop_data_set, and \ref apop_data_ptr consider the vector to be the -1st column,

2338

so using the data set in the figure, <tt>apop_data_get(sample_set, .row=0, .col=-1) == 1</tt>.

2339

2340

\par Reading in data

2341

2342

As per the example above, use \ref apop_text_to_data or \ref apop_text_to_db and then \ref apop_query_to_data.

2343

2344

\par Subsets

2345

2346

There are many macros to get subsets of the data. Each generates what is

2347

considered to be a disposable view: once the variable goes out of scope (by the usual C

2348

rules of scoping), it is no longer valid. However, these structures are all wrappers for pointers

2349

to the base data, so all operations on the data view affect the base data.

2350

2351

\include simple_subsets.c

2352

2353

All of these slicing routines are macros, because they generate several

2354

background variables in the current scope (something a function can't do). Traditional

2355

custom is to put macro names in all caps, like \c APOP_DATA_ROWS, which to modern

2356

sensibilities looks like yelling. The custom has a logic: there are ways to hang

2357

yourself with macros, so it is worth distinguishing them typographically.

2358

The documentation always uses a single capital.

2359

2360

Notice that all of the slicing macros return nothing, so there is nothing to do with one

2361

but put it on a line by itself. This limits the number of misuses.

2362

2363

\par Basic manipulations

2364

2365

The outline page (which you can get to via a link next to the snowflake header at the top of every page on this site) lists a number of other manipulations of data sets, such as

2366

\ref apop_data_listwise_delete for quick-and-dirty removal of observations with <tt>NaN</tt>s,

2367

\ref apop_data_split / \ref apop_data_stack,

2368

or \ref apop_data_sort to sort all elements by a single column.

2369

2370

\par Apply and map

2371

2372

If you have an operation of the form <em>for each element of my data set, call this

2373

function</em>, then you can use \ref apop_map to do it. You could basically do everything you

2374

can do with an apply/map function via a \c for loop, but the apply/map approach is clearer

2375

and more fun. Also, if you set OpenMP's <tt>omp_set_num_threads(N)</tt> for any \c N

2376

greater than 1 (the default on most systems is the number of CPU cores), then the work

2377

of mapping will be split across multiple CPU threads. See the outline \> data sets \>

2378

map/apply section for a number of examples.

2379

2380

\par Text

2381

2382

Text in C is annoying. C already treats strings as pointer-to-characters, so a grid

2383

of text data is a pointer-to-pointer-to-pointer-to-character. The text grid in the

2384

\ref apop_data structure actually takes this form, but functions are provided to do

2385

most or all the pointer work for you. The \ref apop_text_alloc function

2386

is really a realloc function: you can use it to resize the text grid as necessary. The

2387

\ref apop_text_add function will do the pointer work in copying a single string to the

2388

grid. Functions that act on entire data sets, like \ref apop_data_rm_rows, handle the

2389

text part as well.

2390

2391

You have <tt>your_data->textsize[0]</tt> rows and <tt>your_data->textsize[1]</tt> columns. If you are using only the functions to this point, then empty elements are a blank string (<tt>""</tt>), not \c NULL.

2392

For reading individual elements, refer to the \f$(i,j)\f$th text element via <tt>your_data->text[i][j]</tt>.

2393

2394

\par Errors

2395

2396

Many functions will set the <tt>error</tt> element of the \ref apop_data structure being operated on if anything goes wrong. You can use this to halt the program or take corrective action:

2397

2398

\code

2399

apop_data *the_data = apop_query_to_data("select * from d");

2400

if (the_data->error) exit(1);

2401

\endcode

2402

2403

\par The whole structure

2404

2405

Here is a diagram of all of Apophenia's structures and how they

2406

relate. It is taken from this

2407

<a href="http://modelingwithdata.org/pdfs/cheatsheet.pdf">cheat sheet</a> (2 page PDF),

2408

which will be useful to you if only because it lists some of the functions that act on

2409

GSL vectors and matrices that are useful (in fact, essential) but out of the scope of the Apophenia documentation.

2410

2411

\image html http://apophenia.info/structs.png

2412

2413

2414

All of the elements of the \ref apop_data structure are laid out at middle-left. You have

2415

already met the vector, matrix, and weights, which are all a \c gsl_vector or \c gsl_matrix.

2416

2417

The diagram shows the \ref apop_name structure, which has received little mention so far because names

2418

basically take care of themselves. Just use \ref apop_data_add_names to add names to your data

2419

set and \ref apop_name_stack to copy from one data set to another.

2420

2421

The \ref apop_data structure has a \c more element, for when your data is best expressed

2422

in more than one page of data. Use \ref apop_data_add_page, \ref apop_data_rm_page,

2423

and \ref apop_data_get_page. Output routines will sometimes append an extra page of

2424

auxiliary information to a data set, such as pages named <tt>\<Covariance\></tt> or

2425

<tt>\<Factors\></tt>. The angle-brackets indicate a page that describes the data set

2426

but is not a part of it (so an MLE search would ignore that page, for example).

2427

2428

2429

Now let us move up the structure diagram to the \ref apop_model structure.

2430

2431

\section gentle_model apop_model

2432

2433

Even restricting ourselves to the most basic operations, there are a lot of things that

2434

we want to do with our models: estimating the parameters of a model (like the mean and

2435

variance of a Normal distribution) given data, or drawing random numbers, or showing the

2436

expected value, or showing the expected value of one part of the data given fixed values

2437

for the rest of it. The \ref apop_model is intended to encapsulate most of these desires

2438

into one object, so that models can easily be swapped around, modified to create new models,

2439

compared, and so on.

2440

2441

From the figure above, you can see that the \ref apop_model structure is pretty big,

2442

including a number of informational items, key being the \c parameters, \c data, and

2443

\c info elements, a list of settings to be discussed below, and a set of procedures for

2444

many operations. Its contents are not (entirely) arbitrary: the theoretical basis for

2445

what is and is not included in an \ref apop_model, as well as its overall intent, are

2446

described in this <a href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf">research

2447

report</a>.

2448

2449

There are helper functions that will allow you to avoid deailing with the model internals. For example, the \ref apop_estimate helper function means you never have to look at the model's \c estimate method (if it even has one), and you will simply pass the model to a function, as with the above form:

2450

2451

\code

2452

apop_model *est = apop_estimate(data, apop_ols);

2453

\endcode

2454

2455

\li Apophenia ships with a broad set of models, like \ref apop_ols, \ref apop_dirichlet,

2456

\ref apop_loess, and \ref apop_pmf (probability mass function); see the full list on <a href="http://apophenia.info/group__models.html">the models documentation page</a>. You would estimate the

2457

parameters of any of them using the form above, with the appropriate model in the second

2458

slot of the \ref apop_estimate call.

2459

\li The models that ship with Apophenia, like \ref apop_ols, include the procedures and some metadata, but are of course not yet estimated using a data set (i.e., <tt>data == NULL</tt>, <tt>parameters == NULL</tt>). The line above generated a new

2460

model, \c est, which is identical to the base OLS model but has estimated parameters

2461

(and covariances, and basic hypothesis tests, a log likelihood, AIC_c, BIC, et cetera), and a \c data pointer to the \ref apop_data set used for estimation.

2462

\li You will mostly use the models by passing them as inputs to

2463

functions like \ref apop_estimate, \ref apop_draw, or \ref apop_predict; more examples below.

2464

Other than \ref apop_estimate, most require a parameterized model like \c est. After all, it doesn't make sense to

2465

draw from a Normal distribution until you've specified its mean and standard deviation.

2466

\li If you know what the parameters should be, use \ref apop_model_set_parameters. E.g.

2467

2468

\code

2469

apop_model *std_normal = apop_model_set_parameters(apop_normal, 0, 1);

2470

apop_data *a_thousand_normals = apop_model_draws(std_normal, 1000);

2471

2472

apop_model *poisson = apop_model_set_parameters(apop_poisson, 1.5);

2473

apop_data *a_thousand_waits = apop_model_draws(poisson, 1000);

2474

\endcode

2475

2476

\li You can use \ref apop_model_print to print the various elements to screen.

2477

\li You can combine and transform models with functions such as \ref apop_model_fix_params, \ref apop_model_coordinate_transform, or \ref apop_model_mixture. Each of these functions produce a new model, which can be estimated, re-combined, or otherwise used like any other model.

2478

\li Writing your own models won't be covered in this introduction, but it can be pretty easy to

2479

copy and modify the procedures of an existing model to fit your needs. When in doubt, delete a procedure, because any procedures that are missing will have

2480

defaults filled when used by functions like \ref apop_estimate (which uses \ref

2481

apop_maximum_likelihood) or \ref apop_cdf (which uses integration via random draws). See \ref modeldetails for details.

2482

\li There's a simple rule of thumb for remembering the order of the arguments to most of

2483

Apophenia's functions, including \ref apop_estimate : the data always comes first.

2484

2485

\par Settings

2486

2487

Every model, and every method one would apply to a model, is prone to have a list of

2488

settings: how many bins in the histogram, at what tolerance does the maximum likelihood

2489

search end, what are the models being combined in the mixture?

2490

2491

Apophenia organizes settings in <em>settings groups</em>, which are then attached

2492

to models. In the following snippet, we specify a Beta distribution prior. If the

2493

likelihood function were a Binomial distribution, \ref apop_update knows the closed-form

2494

posterior for a Beta-Binomial pair, but in this case it will have to run Markov chain

2495

Monte Carlo. Attach a \ref apop_mcmc_settings group to the prior to specify details

2496

of how the run should work.

2497

2498

For a likelihood, we generate an empirical distribution---a PMF---from an input

2499

data set; you will often use the \ref apop_pmf to turn a data set into an distribution.

2500

When we call \ref apop_update on the last line, it already has all of the above info

2501

on hand.

2502

2503

\code

2504

apop_model *beta = apop_model_set_parameters(apop_beta, 0.5, 0.25);

2505

Apop_settings_add_group(beta, apop_mcmc, .burnin = 0.2, .periods =1e5);

2506

apop_model *my_pmf = apop_estimate(your_data, apop_pmf);

2507

apop_model *posterior = apop_update(.prior= beta, .likelihood = my_pmf);

2508

\endcode

2509

2510

You will encounter model settings often when doing nontrivial work with models. All

2511

can be set using a form like above. See the <a href="http://apophenia.info/group__settings.html">settings documentation page</a>

2512

for the full list of options.

2513

There is a full discussion of using and writing settings groups in the <a href="http://apophenia.info/outline.html">outline page</a> under the Models heading.

2514

2515

2516

\par Databases and models

2517

2518

Returning to the introductory example, you saw that (1) the

2519

library expects you to keep your data in a database, pulling out the

2520

data as needed, and (2) that the workflow is built around

2521

\ref apop_model structures.

2522

2523

Starting with (2),

2524

if a stats package has something called a <em>model</em>, then it is

2525

probably of the form Y = [an additive function of <b>X</b>], such as \f$y = x_1 +

2526

\log(x_2) + x_3^2\f$. Trying new models means trying different

2527

functional forms for the right-hand side, such as including \f$x_1\f$ in

2528

some cases and excluding it in others. Conversely, Apophenia is designed

2529

to facilitate trying new models in the broader sense of switching out a

2530

linear model for a hierarchical, or a Bayesian model for a simulation.

2531

A formula syntax makes little sense over such a broad range of models.

2532

2533

As a result, the right-hand side is not part of

2534

the \ref apop_model. Instead, the data is assumed to be correctly formatted, scaled, or logged

2535

before being passed to the model. This is where part (1), the database,

2536

comes in, because it provides a proxy for the sort of formula specification language above:

2537

\code

2538

apop_data *testme= apop_query_to_data("select y, x1, log(x2), pow(x3, 2) from data");

2539

apop_model *est = apop_estimate(testme, apop_ols);

2540

\endcode

2541

2542

Generating factors and dummies is also considered data prep, not model

2543

internals. See \ref apop_data_to_dummies and \ref apop_data_to_factors.

2544

2545

Now that you have \c est, an estimated model, you can interrogate it. This is really where Apophenia and its encapsulated

2546

model objects shine, because you can do more than just admire the parameter estimates on

2547

the screen: you can take your estimated data set and fill in or generate new data, use it

2548

as an input to the parent distribution of a hierarchical model, et cetera. Some simple

2549

examples:

2550

2551

\code

2552

//If you have a new data set with missing elements (represented by NaN), you can fill in predicted values:

2553

apop_predict(new_data_set, est);

2554

apop_data_show(new_data_set);

2555

2556

//Fill a matrix with random draws.

2557

apop_data *d = apop_model_draws(est, .count=1000);

2558

2559

//How does the AIC_c for this model compare to that of est2?

2560

printf("ΔAIC_c=%g\n", apop_data_get(est->info, .rowname="AIC_c")

2561

- apop_data_get(est2->info, .rowname="AIC_c"));

2562

\endcode

2563

2564

\par Testing

2565

2566

Here is the model for all testing within Apophenia:

2567

2568

\li Calculate a statistic.

2569

\li Describe the distribution of that statistic.

2570

\li Work out how much of the distribution is (above|below|closer to zero than) the statistic.

2571

2572

There are a handful of named tests that produce a known statistic and then compare to a

2573

known distribution, like \ref apop_test_kolmogorov or \ref apop_test_fisher_exact. For

2574

traditional distributions (Normal, \f$t\f$, \f$\chi^2\f$), use the \ref apop_test convenience

2575

function.

2576

2577

But if your model is not from the textbook, then you have the tools to apply the

2578

above three-step process directly. First I'll give an overview of the three steps,

2579

then another working example.

2580

2581

\li Model parameters are a statistic, and you know that <tt>apop_estimate(your_data,

2582

your_model)</tt> will output a model with a <tt>parameters</tt> element.

2583

\li The distribution of a parameter is also a model, so

2584

\ref apop_parameter_model will also return an \ref apop_model.

2585

\li \ref apop_cdf takes in a model and a data point, and returns the area under the data

2586

point.

2587

2588

Defaults for the parameter models are filled in via bootstrapping or resampling, meaning

2589

that if your model's parameters are decidedly off the Normal path, you can still test

2590

claims about the parameters.

2591

2592

The introductory example ran a standard OLS regression, whose output includes some

2593

standard hypothesis tests; to conclude, let us go the long way and replicate those results

2594

via the general \ref apop_parameter_model mechanism. The results here will of course be

2595

identical, but the more general mechanism can be used in situations where the standard

2596

models don't apply.

2597

2598

The first part of this program is identical to the program above. The second

2599

half executes the three steps uses many of the above tricks: one of the inputs to

2600

\ref apop_parameter_model (which row of the parameter set to use) is sent by adding a

2601

settings group, we pull that row into a separate data set using \ref Apop_r, and we

2602

set its vector value by referring to it as the -1st element.

2603

2604

\include ols2.c

2605

2606

Note that the procedure did not assume the model parameters had a certain form. It

2607

queried the model for the distribution of parameter \c x_1, and if the model didn't have

2608

a closed-form answer then a distribution via bootstrap would be provided. Then that model

2609

was queried for its CDF. [The procedure does assume a symmetric distribution. Fixing this

2610

is left as an exercise for the reader.] For a model like OLS, this is entirely overkill,

2611

which is why OLS provides the basic hypothesis tests automatically. But for models

2612

where the distribution of parameters is unknown or has no closed-form solution, this

2613

may be the only recourse.

2614

2615

2616

This introduction has shown you the \ref apop_data set and some of the functions

2617

associated, which might be useful even if you aren't formally doing statistical work but do have to deal with data with real-world elements like column names and mixed

2618

numeric/text values. You've seen how Apophenia encapsulates as many of a model's

2619

characteristics as possible into a single \ref apop_model object, which you can send with

2620

data to functions like \ref apop_estimate, \ref apop_predict, or \ref apop_draw. Once

2621

you've got your data in the right form, you can use this to simply estimate model

2622

parameters, or as an input to later analysis.

2623

*/

2624

2625

/** \page modeldetails Writing new models

2626

2627

The \ref apop_model is intended to provide a consistent expression of <em>any</em>

2628

model that (implicitly or explicitly) expresses a likelihood of data given parameters,

2629

including traditional linear models, textbook distributions, Bayesian hierarchies,

2630

microsimulations, and any combination of the above. The unifying feature is that

2631

all of the models act over some data space and some parameter space (in some cases

2632

one or both is the empty set), and can assign a likelihood for a fixed pair of

2633

parameters and data given the model. This is a very broad requirement, often used

2634

in the statistical literature. For discussion of the theoretical structures, see <a

2635

href="http://www.census.gov/srd/papers/pdf/rrs2014-06.pdf"><em>A Useful Algebraic System

2636

of Statistical Models</em></a> (PDF).

2637

2638

This page includes:

2639

2640

\li \ref write_likelihoods, giving a quick overview of how to write a new model from scratch.

2641

\li \ref settingswriting, covering the writing of <em>ad hoc</em> structures to hold model- or method-specific details, like the number of periods for burning in an MCMC run or the number of bins in a histogram.

2642

\li \ref vtables, covering the means of writing special-case routines for functions that are not part of the \ref apop_model itself, including the score.

2643

\li \ref modeldataparts, a detailed list of the requirements for the data (non-function) elements of an \ref apop_model.

2644

\li \ref methodsection, a detailed list of requirements for the method (function) elements of an \ref apop_model.

2645

2646

\section write_likelihoods A walkthrough

2647

2648

Users are encouraged to always use models via the helper functions, like

2649

\ref apop_estimate or \ref apop_cdf. The helper functions do some boilerplate error

2650

checking, and are where the defaults are called: if your model has a \c log_likelihood

2651

method but no \c p method, then \ref apop_p will use exp(\c log_likelihood). If you don't

2652

give an \c estimate method, then \c apop_estimate will call \ref apop_maximum_likelihood.

2653

2654

So the game in writing a new model is to write just enough internal methods to give the helper functions what they need.

2655

In the not-uncommon best case, all you need to do is write a log likelihood function.

2656

2657

Here is how one would set up a model that could be estimated using maximum likelihood:

2658

2659

\li Write a likelihood function. Its header will look like this:

2660

2661

\code

2662

long double new_log_likelihood(apop_data *data, apop_model *m);

2663

\endcode

2664

2665

where \c data is the input data, and \c

2666

m is the parametrized model (i.e. your model with a \c parameters element set by the caller).

2667

This function will return the value of the log likelihood function at the given parameters.

2668

2669

\li Is this a constrained optimization? See the outline page under maximum likelihood methods \f$->\f$ Setting constraints on how to set them. Otherwise, no constraints will be assumed.

2670

\li Write the object:

2671

2672

\code

2673

apop_model *your_new_model = &(apop_model){"The Me distribution",

2674

.vsize=n0, .msize1=n1, .msize2=n2, .dsize=nd,

2675

.log_likelihood = new_log_likelihood };

2676

\endcode

2677

2678

\li The first element is the human-language name for your model.

2679

\li the \c vsize, \c msize1, and \c msize2 elements specify the shape of the parameter set. For example, if it's three numbers in the vector, then set <tt>.vsize=3</tt> and omit the matrix sizes. The default model prep routine will call

2680

<tt>new_est->parameters = apop_data_alloc(vsize, msize1, msize2)</tt>.

2681

\li The \c dsize element is the size of one random draw from your model.

2682

\li It's common to have [the number of columns in your data set] parameters; this

2683

count will be filled in if you specify \c -1 for \c vsize, <tt>msize(1|2)</tt>, or

2684

<tt>dsize</tt>. If the allocation is exceptional in a different way, then you will

2685

need to allocate parameters by writing a custom \c prep method for the model.

2686

\li If there are constraints, add an element for those too.

2687

2688

You already have more than enough that something like this will work (the \c dsize is used for random draws):

2689

\code

2690

apop_model *estimated = apop_estimate(your_data, your_new_model);

2691

\endcode

2692

2693

Once that baseline works, you can fill in other elements of the \ref apop_model as needed.

2694

2695

For example, if you are using a maximum likelihood method to estimate parameters, you can get much faster estimates and better covariance estimates by specifying the dlog likelihood function (aka the score):

2696

2697

\code

2698

void apop_new_dlog_likelihood(apop_data *d, gsl_vector *gradient, apop_model *m){

2699

//some algebra here to find df/dp0, df/dp1, df/dp2....

2700

gsl_vector_set(gradient, 0, d_0);

2701

gsl_vector_set(gradient, 1, d_1);

2702

}

2703

\endcode

2704

The score has to be registered (see below) using

2705

\code

2706

apop_score_insert(apop_new_dlog_likelihood, your_new_model);

2707

\endcode

2708

2709

\section settingswriting Writing new settings groups

2710

2711

Your model may need additional settings or auxiliary information to function, which would require associating a model-specific struct with the model.

2712

2713

Before getting into the detail of how to make model-specific groups of settings work, note that there's a lightweight method of storing sundry settings, so in many cases you can bypass all of the following.

2714

2715

The \ref apop_model structure has a \c void pointer named \c more which you can use to

2716

point to a model-specific struct. If \c more_size is larger than zero (i.e. you set

2717

it to <tt>your_model.more_size=sizeof(your_struct)</tt>), then it will be copied via \c

2718

memcpy by \ref apop_model_copy, and freed by \ref apop_model_free. Apophenia's

2719

estimation routines will never impinge on this item, so do what you wish with it.

2720

2721

The remainder of this subsection describes the information you'll have to provide to make

2722

use of the conveniences described to this point: initialization of defaults, smarter

2723

copying and freeing, and adding to an arbitrarily long list of settings groups attached

2724

to a model. You will need four items: a typedef for the structure itself, plus init, copy, and

2725

free functions. This is the sort of boilerplate that will be familiar to users of

2726

object oriented languages in the style of C++ or Java, but it's really a list of

2727

arbitrarily-typed elements, which makes this feel more like LISP. [And being a

2728

reimplementation of an existing feature of LISP, this section will be macro-heavy.]

2729

2730

\li The settings struct will likely go into a header file, so

2731

here is a sample header for a new settings group named \c ysg_settings, with a dataset, its two sizes, and an owner-of-data marker. <tt>ysg</tt> stands for Your Settings Group; replace that substring with your preferred name in every instance to follow.

2732

2733

\code

2734

typedef struct {

2735

int size1, size2;

2736

char *refs;

2737

apop_data *dataset;

2738

} ysg_settings;

2739

2740

Apop_settings_declarations(ysg)

2741

\endcode

2742

2743

The first item is a familiar structure definition. The last line is a macro that declares the

2744

three functions below. This is everything you would

2745

need in a header file, should you need one. These are just declarations; we'll write

2746

the actual init/copy/free functions below.

2747

2748

The structure itself gets the full name, \c ysg_settings. Everything else is a macro, and so you need only specify \c ysg, and the \c _settings part is filled in. Because of these macros, your \c struct name must end in \c _settings.

2749

2750

If you have an especially simple structure, then you can generate the three functions with these three macros in your <tt>.c</tt> file:

2751

2752

\code

2753

Apop_settings_init(ysg, )

2754

Apop_settings_copy(ysg, )

2755

Apop_settings_free(ysg, )

2756

\endcode

2757

2758

These macros generate appropriate functions to do what you'd expect: allocating the

2759

main structure, copying one struct to another, freeing the main structure.

2760

The spaces after the commas indicate that no special code gets added to

2761

the functions that these macros generate.

2762

2763

You'll never call these funtions directly; they are called by \ref Apop_settings_add_group,

2764

\ref apop_model_free, and other model or settings-group handling functions.

2765

2766

Now that initializing/copying/freeing of

2767

the structure itself is handled, the remainder of this section will be about how to

2768

add instructions for the struture internals, like data that is pointed to by the structure elements.

2769

2770

\li For the allocate function, use the above form if everything in your code defaults to zero/\c NULL.

2771

In most cases, though, you will need a new line declaring a default for every element in your structure. There is a macro to help with this too.

2772

These macros will define for your use a structure named \c in, and an output pointer-to-struct named \c out.

2773

Continuing the above example:

2774

2775

\code

2776

Apop_settings_init (ysg,

2777

Apop_assert(in.size1, "I need you to give me a value for size1. Stopping.");

2778

Apop_varad_set(size2, 10);

2779

Apop_varad_set(dataset, apop_data_alloc(out->size1, out->size2));

2780

Apop_varad_set(refs, malloc(sizeof(int)));

2781

*refs=1;

2782

)

2783

\endcode

2784

2785

Now, <tt>Apop_settings_add(a_model, ysg, .size1=100)</tt> would set up a group with a 100-by-10 data set, and set the owner bit to one.

2786

2787

\li Some functions do extensive internal copying, so you will need a copy function even if you don't do any explicit calls to \ref apop_model_copy. The default above simply copies every element in the structure. Pointers are copied, giving you two pointers pointing to the same data. We have to be careful to prevent double-freeing later.

2788

2789

\code

2790

//The elements of the set to copy are all copied, and then make one additional modification:

2791

Apop_settings_copy (ysg,

2792

(*refs)++;

2793

)

2794

\endcode

2795

2796

\li The struct itself is freed by boilerplate code, but add code in the free function

2797

to free data pointed to by pointers in the main struture. The macro defines a

2798

pointer-to-struct named \c in for your use. Continuing the example:

2799

2800

\code

2801

Apop_settings_free (ysg,

2802

if (!(--in->refs)) {

2803

free(in->dataset);

2804

free(in->refs);

2805

}

2806

)

2807

\endcode

2808

2809

With those three macros in place and the header as above, Apophenia will treat your

2810

settings group like any other, and users can use \ref Apop_settings_add_group to

2811

populate it and attach it to any model.

2812

2813

\section vtables Registering new methods in vtables

2814

2815

For any given function (e.g., entropy, the dlog likelihood, Bayesian updating), there is

2816

probably a special case for well-known models like the Normal distribution. Rather than

2817

any procedure that could have a special-case calculation to the \c apop_model struct,

2818

functions may maintain a registry of models and associated special-case procedures.

2819

2820

This subsection will discuss how to add

2821

a function to an existing vtable.

2822

2823

\li See \ref apop_update, \ref apop_score, \ref apop_predict, \ref apop_model_print, and \ref

2824

apop_parameter_model for examples and procedure-specific details.

2825

\li Write a function following the given type definition.

2826

\li Use the associated <tt>_vtable_add</tt> function to add the function and associate it

2827

with the given model. For example, to add a Beta-binomial routine named \c betabinom

2828

to the registry of Bayesian updating routines, use <tt>apop_update_vtable_add(betabinom,

2829

apop_beta, apop_binomial)</tt>.

2830

\li Lookups happen based on a hash that takes into account the elements of the model

2831

that will be used in the calculation. For example, the \c apop_update_hash takes in two

2832

models and calculates the hash based on the address of the prior's \c draw method and

2833

the likelihood's \c log_likelihood or \c p method. Thus, a vtable lookup for new models

2834

that re-use the same methods (at the same addresses in memory) will still find the

2835

same special-case function.

2836

\li If you need to deregister the function, use the associated deregister function,

2837

e.g. <tt>apop_update_vtable_drop(apop_beta, apop_binomial)</tt>. You can guarantee that a method will not be re-added by following up the <tt>_drop</tt> with, e.g., <tt>apop_update_vtable_add(NULL, apop_beta, apop_binomial)</tt>.

2838

\li Calls to <tt>..._vtable_add</tt> are typically placed in the \c prep method of the given model, thus ensuring that the auxiliary functions are registered after the first time the model is sent to \ref apop_estimate.

2839

2840

This overview will not go into detail about setting up a new vtable. Briefly:

2841

2842

\li See the existing setups in vtables.h.

2843

\li Cut/paste one and do a search and replace to change the name to match your desired use.

2844

\li Set the typedef to describe the functions that get added to the vtable.

2845

\li Rewrite the hash function to check the part of the inputs that interest you. For

2846

example, the update vtable associates functions with the \c draw, \c log_likelihood,

2847

and \p methods of the model. A model where these elements are identical but the name

2848

is changed will still match.

2849

2850

\section modeldataparts The data elements

2851

2852

The remainder of this page covers the detailed expectations regarding the elements

2853

of the \ref apop_model structure. I begin with the data (non-function) elements,

2854

and then cover the method (function) elements. Some of the following will be

2855

requirements for all models and some will be advice to authors; I use the accepted

2856

definitions of <a href="http://tools.ietf.org/html/rfc2119">"must", "shall", "may"</a>

2857

and related words.

2858

2859

\subsection datasubsec data

2860

2861

\li Each row should be a single observation.

2862

For example, \ref apop_bootstrap_cov depends on each row being an iid observation to function correctly.

2863

Calculating the Bayesian Information Criterion (BIC) requires knowing

2864

the number of observations in the data, and assumes that row count==observation count.

2865

For complex data, the \ref apop_data_pack and \ref apop_data_unpack functions can help with this.

2866

2867

\li Some functions (bootstrap again, or many uses of \ref apop_kl_divergence) use \ref

2868

apop_draw to use your model's RNG (or a default) to draw a

2869

value, write it to the matrix element of the data set, and then move on to an

2870

estimation or other step. In this case, the data sent in will be entirely in the \c

2871

->matrix element of the \ref apop_data set sent to model methods. Your \c likelihood, \c p, \c cdf, and \c estimate routines

2872

must accept data as a single row of the matrix of the \ref apop_data set for such functions to work.

2873

2874

\li Your routines may accept other data formats, as per contract with the user.

2875

For example, regression-type functions use a function named \c ols_shuffle

2876

to convert a matrix where the first column is the dependent variable to a data

2877

set with dependent variable in the vector and a column of ones in the first

2878

matrix column. By checking for a vector, the prep function knows whether to do

2879

the shuffling or not. Most univariate distributions take each scalar element as

2880

a separate data point; having one data point per row is a special case.

2881

2882

\subsection paramsubsec Parameters, vsize, msize1, msize2

2883

2884

\li The sizes will be used by the \c prep method of the model; see below. Given the model \c m and its elements \c m.vsize, \c m.msize1, \c m.msize2,

2885

functions that need to allocate a parameter set will do so via <tt>apop_data_alloc(m.vsize, m.msize1, m.msize2)</tt>.

2886

2887

\li As a special case, if you set any of \c .vsize, \c .msize1, or \c .msize2

2888

to \c -1, then the default prep method will set that size to the number of columns in

2889

the input data. This is what you want for regression methods, where there is one parameter per independent variable.

2890

2891

2892

\subsection infosubsec Info

2893

2894

\li The first page, named \c <info> is typically a list of scalars. Nothing is guaranteed, but the elements may include:

2895

2896

\li AIC: <a href="https://en.wikipedia.org/wiki/Akaike's_Information_Criterion">Aikake Information Criterion</a>

2897

\li AIC_c: AIC with a finite sample correction. "<b>Generally, we advocate the use of AIC_c when the ratio \f$n/K\f$ is small (say \f$< 40\f$)</b>" [Kenneth P. Burnham, David R. Anderson: <em>Model Selection and Multi-Model Inference</em>, p 66, emphasis in original.]

2898

\li BIC: <a href="https://en.wikipedia.org/wiki/Bayesian_information_criterion">Bayesian Information Criterion</a>

2899

\li R squared

2900

\li R squared adj

2901

\li log likelihood

2902

\li status.

2903

2904

For those elements that require a count of input data, the calculations assume each row in the input \ref apop_data set is a single datum.

2905

2906

Get these via, e.g., <tt>apop_data_get(your_model->info, .rowname="log likelihood")</tt>.

2907

When writing for any arbitrary function, be prepared to handle \c NaN, indicating that the element is not calculated or saved in the info page by the given model.

2908

2909

\li Several routines will include a \c predict table. The table has these rows:

2910

\li row (optional)

2911

\li col (optional)

2912

\li observed

2913

\li predicted

2914

\li residual

2915

2916

For OLS-type estimations, each row corresponds to the row in the original data. For

2917

filling in of missing data, the elements may appear anywhere, so the row/col indices are

2918

essential.

2919

2920

\subsection settingsgroupmention settings, more

2921

2922

In object-oriented jargon, settings groups are the private elements of the data set,

2923

to be pulled out in certain contexts, and ignored in all others. Therefore, there are

2924

no rules about internal use. The \c more element of the \ref apop_model provides a lightweight

2925

means of attaching an arbitrary struct to a model. See \ref settingswriting above for details.

2926

2927

\li As many settings groups of different types as desired can be added to a single \ref apop_model.

2928

\li One \ref apop_model can not hold two settings groups of the same type. Re-additions cause the removal of the previous version of the group.

2929

2930

2931

\section methodsection Methods

2932

2933

\subsection psubsection p, log_likelihood

2934

2935

\li Function headers look like <tt>long double your_p_or_ll(apop_data *d, apop_model *params)</tt>.

2936

\li The inputs are an \ref apop_data set and an \ref apop_model, which should include a filled <tt>->parameters</tt> element.

2937

\li We assume that the parameters have been set, by users via \ref apop_estimate or \ref apop_model_set_parameters, or by \ref apop_maximum_likelihood by its search algorithms. if the parameters are necessary, the function shall check that the parameters are not \c NULL and set the model's \c error element to \c 'p' if they are.

2938

\li Return \c NaN on errors. If an error in the input model is found, the function may set the input model's \c error element to an appropriate \c char value.

2939

\li If observations are assumed to be iid, you can probably use \ref apop_map_sum to write the core of the log likelihood function.

2940

\li If your model includes both \ log_likelihood and \c p methods, it must be the case that <tt>log(p(d, m))</tt> equals <tt>log_likelihood(d, m)</tt> for all \c d and \c m.

2941

2942

\subsection prepsubsection prep

2943

2944

\li Function header looks like <tt>void your_prep(apop_data *data, apop_model *params)</tt>.

2945

\li If \c vsize, \c msize1, or \c msize2 are -1, then the prep function will set them to the width of the input data.

2946

\li If \c dsize is -1, then the prep function shall set it to the width of the input data.

2947

\li If the \c parameters element is not allocated, the function shall allocate it via <tt>apop_data_alloc(vsize, msize1, msize2)</tt> (or equivalent).

2948

\li The model's <tt>data</tt> pointer shall be set to point to the input data.

2949

\li The input data may be modified by the prep routine. For example, the OLS prep routine shuffles a single input matrix as described above under \c data.

2950

\li The \c info element shall be allocated and its title set to "<Info>".

2951

\li The default is \ref apop_model_clear. It does all of the above.

2952

\li The prep routine may initialize any desired settings groups. Unless otherwise

2953

stated, these should not be removed if they are already there, so that users can override defaults by adding a settings group before starting an estimation.

2954

\li If any functions associated with the model need to be added to

2955

a vtable (see above), the registration shall happen here. Registration may also happen elsewhere.

2956

2957

\subsection estimatesubsection estimate

2958

2959

\li Function header looks like <tt> void your_estimate(apop_data * data, apop_model *params)</tt>.

2960

\li Assume that the prep routine has already been run. Notably, this means that parameters have been allocated.

2961

\li Assume that the \c parmaeters hold garbage (as in a \c malloc without a subsequent assignment to the <tt>malloc</tt>-ed space).

2962

\li The function modifies the input model, and returns nothing. Note that this is different from the wrapper function, \ref apop_estimate, which makes a copy of its input model, preps it, and then calls the \c estimate function with the prepeped copy.

2963

\li The function shall set the \c parameters of the input model. For consistency with other models, the estimate should be the maximum likelihood estimate, unless otherwise documented.

2964

\li Additional settings may be set.

2965

\li The model's \c <Info> page may be filled with data. For scalars like log likelihood and AIC, use \ref apop_data_add_named_elmt.

2966

\li Data should not be modified by the \c estimate routine; any changes to the data made by \c estimate must be documented.

2967

\li The default called by \ref apop_estimate is \ref apop_maximum_likelihood.

2968

\li If errors occur during processing, set the model's \c error element to a single character. Documentation should include the list of error characters and their meaning.

2969

2970

\subsection drawsubsection draw

2971

2972

\li Function header looks like <tt>void your_draw(double *out, gsl_rng* r, apop_model *params)</tt>

2973

\li Assume that model \c paramters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the draw method should check that \c parameters are not \c NULL and fill the output with NaNs if necessary parameters are not set.

2974

\li User inputs a pointer-to-<tt>double</tt> of length \c dsize; user is expected to make sure that there is adequate space. User also inputs a \c gsl_rng, already allocated (probably via \ref apop_rng_alloc).

2975

\li The function shall fill the space pointed to by the input pointer with a random draw from the data space, where the likelihood of any given observation is proportional to its likelihood as given by the \c p method. Data shall be reduced to a single vector via \ref apop_data_pack if it is not already a single vector.

2976

2977

\subsection cdfsubsection cdf

2978

2979

\li Function header looks like <tt>long double your_cdf(apop_data *d, apop_model *params)</tt>.

2980

\li Assume that \c paramters are set, via \ref apop_estimate or \ref apop_model_set_parameters. The author of the CDF method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.

2981

\li The CDF method must accept data as a single row of data in the \c matrix of the input \ref apop_data set (as per a draw produced using the \c draw method). May accept other formats.

2982

\li Returns the percentage of the likelihood function \f$\leq\f$ the first row of the input data. The definition of \f$\leq\f$ is chosen by the model author.

2983

\li If one is not already present, an \c apop_cdf_settings group may be added to the model. See the \ref apop_cdf function for details of its use.

2984

2985

\subsection constraintsubsection constraint

2986

2987

\li Function header looks like <tt>long double your_constraint(apop_data *data, apop_model *params)</tt>.

2988

\li Assume that \c parameters are set, via \ref apop_estimate, \ref apop_model_set_parameters, or the internals of an MLE search. The author of the constraint method should check that \c parameters are not \c NULL and return NaN if necessary parameters are not set.

2989

\li See \ref apop_linear_constraint for a useful basis and/or example. Many constraints can be written as wrappers for this function.

2990

\li If the constraint is met, then return zero.

2991

\li If the constraint fails, then (1) move the \c parameters in the input model to a

2992

constraint-satisfying value, and (2) return the distance between the input parameters and

2993

what you've moved the parameters to. The choice of within-bounds parameters and distance function is left to the author of the constraint function.

2994

*/