~mwshinn/+junk/notes3 : contents of notes3.n at revision 32

~mwshinn/+junk/notes3 : (revision 32)
Title: Documentation for Notes3
By: Max Shinn
Opened: 7/22/15

Introduction
============

Notes3 is a powerful system for scientific note-taking.  It is
intended to be used for the following:

- Keeping lab notes
- Fast and efficient digital note-taking when dealing with lots of
  mathematical equations
- Drafting scientific papers 
- Typesetting homework assignments, especially those in the
  quantitative sciences
- Programming with verbose comments, when one focuses on the comments
  over the code
- Documenting brief explorations of intriguing ideas

The current version of Notes3 is based on Pandoc, allowing notes to be
converted into LaTeX/pdf and html.  Pandoc supports many more formats,
but results may vary, as I haven't tested them.  That being said, I
try to maintain compatibility with these formats, and in theory they
should work equally well.

History
=======

Notes3 started as a lowly perl script (`latexify-notes.pl`) in Spring
2012 which used simple regex substitutions in order to allow me to
take scientific notes.  I wrote the script at the suggestion Clarence
Lehman, who had been using a similar system for years to take lab
notes.  I started using it frequently, adding more features, such as
extended math mode.  Initially, I named my files based on a timestamp,
but I grew to dislike this due to the difficulty of locating files.
So, I changed the naming scheme.  I also found that I had a lot of
files associated with each note, so instead of just prefixing their
names with the note's filename, I gave each one its own directory.

Eventually, I realized I needed something more powerful than regex, so
in Fall 2012 I built Notes2, a much more powerful system based on
Perl's Regexp::Grammars.  It worked much better, and lasted me until
now (Summer 2015) but it still had its problems.  There were many
mistakes in my grammar, partially due to my ambiguously-defined format
and limitations in Regexp::Grammars, but also due in part to the fact
that I was mostly reinventing the wheel.

Both iterations of the program used exactly the same syntax, which was
a format vaguely reminiscent of markdown.  In fact, I used
markdown-mode on emacs as a guide to designing notes-mode and
notes2-mode.  I should mention that notes-mode and notes2-mode were
both emacs modes to support the same syntax... both were compatible
with both iterations of the notes program.  However, notes2-mode
actually worked, whereas notes-mode spent its life in a pre-alpha
stage due to the fact that I knew nothing about lisp at the time.

Eventually, Notes2 started to be less appealing to me for the
following reasons:

- There were some bugs in math mode that were impossible to fix
  without defining a grammar for math mode.  Math mode still used
  regex, even though the rest of the document used a grammar.
- I had always intended on creating an html export; however, with that
  codebase, it would have taken *a lot* of work.
- My syntax (an amalgamation of org-mode syntax and LaTeX syntax) was
  different enough (for trivial reasons) from markdown, org-mode,
  reStructured Text, etc. to require its own custom support for text
  editors.  If I were to adapt the syntax to look like a more typical
  format, I could just piggyback off existing text editor support.
- The code base grew very large very quickly, and many things got to
  be pretty hacky.  Matrices in particular were not implemented in an
  elegant fashion.
- Due to a bug in Regexp::Grammars, the grammar could not be housed in
  an external file.  Thus, when I wrote the script to extract code, I
  needed to duplicate all of the grammar code in that file too.  It
  got to be very messy very quickly.
- The error messages were very bad.  The errors only occurred when the
  LaTeX code was compiling.  Thus, the line numbers were all wrong,
  and a knowledge of the substitutions was necessary to track down many
  syntax errors.  Furthermore, many errors would just slide by with no
  error message.  For example, all unrecognized unicode characters
  were ignored.
- The compiler was dog slow and there was no way to speed it up under
  the Regexp::Grammars platform.

I discovered Pandoc in Summer 2014 and immediately knew that it would
be the future of my notes program.  However, I also wanted to use this
as an excuse to learn Haskell, so I spent the rest of the summer
learning Haskell.  It turns out I didn't have enough time afterwards
to actually write it.  I was also severely disappointed to discover
that almost the entire codebase was written in procedural style using
monads and the `do` syntax.  I became discouraged and busy with my
Analysis I course.  Additionally, for the Analysis course, I needed a
graphical math typesetter to visualize my equations.  So I started
using Texmacs, and my immediate need for Notes2 decreased.

This summer (2015), I decided I should go back and refresh my Haskell
skills.  I thought I would do so by rewriting my notes program for
Pandoc.  However, I quickly realized that if I did so, I would not do
a good job.  So I wrote the filters as json filters in Python instead.
This made it much cleaner, and I hypothesize that there will never be
a Notes4.

-> Note that the syntax used in Notes3 is different than that used in
   previous notes versions.  Previous versions are not based on
   markdown.


Design principles
=================

The following are some design principles that I had in mind when
writing Notes3, and they should continue to be adhered to.

- It should be implemented on top of Pandoc with a focus on Markdown
  format, though most features should be able to run when coming from
  other starting formats as well.
- It should be implemented exclusively using Pandoc filters, without
  "preprocessors", or scripts that run before Pandoc.
- The features should be implemented as modularly as is practical, so
  any subset of features can be used without the others.  This should
  be done by using multiple Pandoc filters.


Format documentation
====================

All syntax supported by Pandoc's Markdown format is supported in
Notes3.  Functionality supported by Notes3 is modular.  Each of the
modules are described individually below.  (Normally, all modules
would be used, but individual modules can be used separately if
desired.)  For a description of the Pandoc Markdown format, see the
[Rmarkdown Pandoc documentation][rpandocdoc], which is cleaner for our
purposes than the [official Pandoc documentation][pandocdoc].  If you
already know markdown and want to know the differences between the
two, see the  [MarkupBinder documentation][panvsmd].

[rpandocdoc]: http://rmarkdown.rstudio.com/authoring_pandoc_markdown.html
[pandocdoc]: http://pandoc.org/README.html#pandocs-markdown
[panvsmd]: http://www.speakon.org.uk/MarkupBinder/beta/docs/Markdown/Pandoc_Markdown_vs_standard_Markdown.html

Simplified Math
---------------

LaTeX math is very flexible, however it is cumbersome to type and
read.  Notes3 assuages this problem by redefining some of the LaTeX
syntax to:

- Avoid the need for separate commands when unicode would suffice
- Limit the need for brackets in super/sub-scripts
- Allow Matlab-style matrices
- Guess intelligently about the proper alignment for the equations.
- Support combining characters
- Typeset functions in roman by default

### Matlab-style matrices

Matrices can be specified in a format very similar to that of Matlab
or Octave.  They start with the text `mat[` and end with `]`.  The
rows are `;` delimited, and the columns are `,` delimited.  Other
commas contained within the matrix (for instance functions with
multiple arguments) will be handled properly.  These matrices are only
valid in math mode.  So for instance, the code:

~~~
mat[3x, max(2x+3, 4-2x); 14999, 3^2] mat[4; 2]
~~~

produces

$$
mat[3x, max(2x+3, 4-2x); 14999, 3^2] mat[4; 2]
$$

This can also be useful for producing vectors, for instance
`mat[1, 2, 3]` renders as $mat[1, 2, 3]$.

### Unicode

Most standard unicode characters are supported.  Just use a unicode
character inside math mode, and it will be converted to the
appropriate symbol.  Combining characters will be properly handled.
In addition to unicode, "smart" substitutions will be made, such as
changing `<<` to `\ll`.

### Super/sub-scripts

Superscripts and subscripts should not, in most cases, need to be
surrounded by brackets.  Notes3 will be "smart" about determining
whether something is continuous or not.  For instance, it detects
functions and open-parentheses, as well as numbers, strings, and raw
LaTeX commands.  If you are ever in doubt, you can always add a space
between the exponent and whatever comes next to verify that it will
not be included in the exponent.

| Input           | Output          |
|-----------------+-----------------|
| `x^sin(y)`      | $x^sin(y)$      |
| `x^3y`          | $x^3y$          |
| `x^(3y)`        | $x^(3y)$        |
| `x^{3y}`        | $x^{3y}$        |
| `2^12`          | $2^12$          |
| `2^1+1`         | $2^1+1$         |
| `2^50_12`       | $2^50_12$       |
| `2_12^50`       | $2_12^50$       |
| `x_"shuffle"`   | $x_"shuffle"$   |
| `x^{y^3}`       | $x^{y^3}$       |
| `x^\frac{1}{2}` | $x^\frac{1}{2}$ |

### Equation alignment

LaTeX distinguishes between a single equation, grouped equations, and
aligned grouped equations.  Notes3 does not.  Multiple equations can
be added by just inserting new lines when in math mode.  To align
them, add an `&` at the location where you want them to be aligned.

### Functions

Notes3 will detect functions within equations and typeset them in
roman face.

### Raw LaTeX

If there is anything which does not work correctly or would be easier
to type in plain (math mode) LaTeX, just surround the code with double
curly brackets.  For example, `$a+x^10+20$` gives $a+x^10+20$, but
`$a+{{x^10}}+20$` gives $a+{{x^10}}+20$.  Similarly, `$mat[1, 2; 3, 4]$`
gives $mat[1, 2; 3, 4]$, but `${{mat[1, 2; 3, 4]}}$` gives
${{mat[1, 2; 3, 4]}}$.  Note that, since no processing is done on
this, it is not possible to use unicode characters within Raw LaTeX
regions.  Usually you should not have to use this, but it is included
as a workaround for yet unnoticed bugs.


Documented Code Support
-----------------------

First, it is important to specify that Notes3 neither supports nor
intends to support literate programming.  However, it can be used as a
markup syntax for comments in much the same style as literate
programming.  Furthermore, any code fragments within the document can
be extracted into their own source files.

Each code segment may optionally have a filename associated with it.
If a filename (with an appropriate extension) is listed as a class
attribute, when Notes3 is run, the specified file will be extracted to
its own file.  This file will be located in the same directory as the
note file unless the note has a data directory (see the
[data directory section](#data-directory-support)).  In this case, it
will be extracted to the `generated-scripts` directory within the data
directory.  If the `generated-scripts` directory doesn't exist, it
will be created.  **Any modifications to files in this directory will
be overwritten whenever Notes3 is run.**

-> Modifications should always be made within the note file to make
   sure nothing will be overwritten.

The filename is not necessary; if one is not supplied, the code will
not be extracted.

Code segments will have syntax highlighting applied to them according
to their filename.  If no file name is supplied, syntax highlighting
will occur using the Pandoc defaults.

For example, the following code segment will extract to testfile.cpp
and will be highlighted as a C++ file:

~~~~
~~~testfile.cpp
#include <iostream>
using namespace std;
int main() { return 0; }
~~~
~~~~

It will show in the document as:

~~~ {.cpp .numberLines}
#include <iostream>
using namespace std;
int main() { return 0; }
~~~

Line numbering is included by default only if a file is specified.

If multiple code blocks are given with the same filename, they will be
concatenated to form the specified file in the order in which they
appear in the document.  The line numbering will reflect this.  For
instance, the following appears in the document:

~~~~
~~~testfile.py
#!/usr/bin/python
import sys
~~~
~~~~

and the following appears later on in the document:

~~~~
~~~testfile.py
if 0==1:
    print >> sys.stderr, "Error"
~~~
~~~~

it will extract to form the file `testfile.py` containing the content:

~~~
#!/usr/bin/python
import sys
if 0==1:
    print >> sys.stderr, "Error"
~~~


Important/Update filter
-----------------------

I have found through experience that two particular types of syntax
are very useful within scientific notes, so the following filter
implements these.

### Important

If a paragraph is prefixed with `->` followed by a space, the
paragraph will be typeset slightly differently to make it stand out.
This is intended to be used for tangents, side notes, or warnings
about potential mishaps.  It is for anything that does not follow the
main narrative of the text, but is too important to be contained
within a footnote.  It is okay (and recommended for aesthetics) to
indent the lines following the arrow/space, but not necessary.  So for
instance, the following:

~~~
-> When you are using the "Important" notation, make sure that the
   arrow is always followed by a space.  Otherwise, the parser will
   not format the output correctly.
~~~

will be formatted like:

-> When you are using the "Important" notation, make sure that the
   arrow is always followed by a space.  Otherwise, the parser will
   not format the output correctly.

### Updates

It is very useful to be able to "Close" notes in addition to opening
them.  However, sometimes, it is necessary to update it after the
fact.  The Update notation can be used to specify that a particular
piece of text was not included in the original.

Whenever a paragraph starts with `UPDATE ([date]):` followed by a
space, it will be typeset differently (similar to `->` notation).
Here, `[date]` is to be replaced by the date the update was made.
This date must not contain any spaces, and should contain only
numbers, slashes, and hyphens.

So the text:

~~~
UPDATE (4/1/2214): I come from the land of palindromes
~~~

will be typeset as:

UPDATE (4/1/2214): I come from the land of palindromes


Data directory support
----------------------

Each note file has the option of having its own special directory.  If
a directory exists with the name `[filename].d` (where `[filename]` is
the name of the note's filename), all file links in the document will
be scanned.  If such a file (or directory tree) exists within the `.d`
directory, the link will be modified to point to the file within the
directory.

Consider the following directory tree as an example:

~~~
note.n
file1.txt
file2.csv
note.n.d/
--- file2.csv
--- file3.jpg
--- extradir/
--- --- file2.csv
--- --- file4.mp3
~~~

The following links will be transformed to be:

| Original               | Modified                        |
|------------------------+---------------------------------|
| `[text](file1.txt)`    | `[text](file1.txt)`             |
| `[text](file2.csv)`    | `[text](note.n.d/file2.csv)`    |
| `![alt](file3.jpg)`    | `![alt](note.n.d/file3.jpg)`    |
| `<extradir/file4.mp3>` | `<note.n.d/extradir/file4.mp3>` |


Extra metadata 
--------------

Extra metadata is defined within notes files.  The following keys are
recognized:

`By`

: An alias for `Author`

`Opened`

: Specifies the date that the note was created.  This should always be
  set in a note file, though this is not enforced.

`Closed`

: Specifies the date that the note was finished.  By convention, once
  a file is closed, it is good practice to make all of the
  modifications after this point use the [UPDATE notation][updates].

`Due`

: Primarily for assignments/reports that are to be completed on a
  specified day.  The date is displayed as the Due date on the
  finished document, no matter what the Open/Closed/Date say.

It is important to note that the `Closed` tag is the only one that can
also go at the end of a document.  If a paragraph begins with the
string `Closed:` followed by a space, then the Closed date is set
based on the text that follows this.


Easy citations
--------------

Parenthetical citations to books or scientific papers can be created
by simply linking to the identifier.  DOI, ISBN, and
PMID^[PubMed Central IDs, also called PMCIDs, are not supported.]
(PubMed ID) codes are accepted.  The text of the link will be replaced
by the first author and the year of the publication, surrounded by
parentheses.

The citation displayed is very crude, and is only provided for
convenience.  *It is not a proper parenthetical citation, and should
not be directly included in publications.* In the case of multiple
authors, "et al." is not added, and in the case of two authors, only
the first is listed.  Furthermore, due to some imperfect database
records (especially in the ISBN records), the author's full name may
be used, or the author's name may be absent.  In these cases, Notes3
will try its best to display some useful information.  If there is an
error (either due to the absence of an internet connection or some
other problem retrieving the information), the DOI, ISBN, or PMID
itself will be enclosed within parentheses.

If a retrieval is successful once, the result is cached permanently,
and will then always be available until the cache file is deleted.

As an example, the link `<doi:10.1016/j.cortex.2014.01.013>`
renders as
[(Liu 2014)](http://dx.doi.org/10.1016/j.cortex.2014.01.013).  Text is
always overwritten, so the link
`[this paper](doi:10.1016/j.cortex.2014.01.013)` also renders as
[(Liu 2014)](http://dx.doi.org/10.1016/j.cortex.2014.01.013).


Bracket links to files
----------------------

Normal markdown syntax does not allow links to files to use the
bracket notation.  With this filter, links such as `<filename.txt>`
are valid if the file `filename.txt` exists either in the same
directory as the note, or in the note's `.d` directory (see the
section on [data directories][data directory support]).  If the file
does not exist, the text `<filename.txt>` will be printed verbatim in
the final document, escaping the angle brackets if necessary (such as
in HTML).

Additionally, this allows images to be included with the angle bracket
notation, by using the syntax `!<image.jpg>`.

-> Note that this component will only work properly if the `raw_html`
   Pandoc feature is disabled.  This is not a problem when using
   official Notes3 scripts (since it is disabled by default), but if
   using this filter outside of Notes3, you will need to disable the
   `raw_html` feature with the command line option `-f markdown-raw_html`.

-> If using this filter separately from the others, the data directory
   support filter must be present in the same directory as this filter
   in order to support finding files in the `.d` directory.