1
package Text::Markdown;
6
use Digest::MD5 qw(md5_hex);
11
our $VERSION = '1.0.19';
12
our @EXPORT_OK = qw(markdown);
16
Text::Markdown - Convert Markdown syntax to (X)HTML
20
use Text::Markdown 'markdown';
21
my $html = markdown($text);
23
use Text::Markdown 'markdown';
24
my $html = markdown( $text, {
25
empty_element_suffix => '>',
30
my $m = Text::Markdown->new;
31
my $html = $m->markdown($text);
34
my $m = Text::MultiMarkdown->new(
35
empty_element_suffix => '>',
38
my $html = $m->markdown( $text );
42
Markdown is a text-to-HTML filter; it translates an easy-to-read /
43
easy-to-write structured text format into HTML. Markdown's text format
44
is most similar to that of plain text email, and supports features such
45
as headers, *emphasis*, code blocks, blockquotes, and links.
47
Markdown's syntax is designed not as a generic markup language, but
48
specifically to serve as a front-end to (X)HTML. You can use span-level
49
HTML tags anywhere in a Markdown document, and you can use block level
50
HTML tags (like <div> and <table> as well).
54
This module implements the 'original' Markdown markdown syntax from:
56
http://daringfireball.net/projects/markdown/
60
Text::Markdown supports a number of options to it's processor which control the behavior of the output document.
62
These options can be supplied to the constructor, on in a hash with the individual calls to the markdown method.
63
See the synopsis for examples of both of the above styles.
65
The options for the processor are:
69
=item empty element suffix
71
This option can be used to generate normal HTML output. By default, it is ' />', which is xHTML, change to '>' for normal HTML.
75
Controls indent width in the generated markup, defaults to 4
77
=item markdown_in_html_blocks
79
Controls if Markdown is processed when inside HTML blocks. Defaults to 0.
87
# Regex to match balanced [brackets]. See Friedl's
88
# "Mastering Regular Expressions", 2nd Ed., pp. 328-331.
89
our ($g_nested_brackets, $g_nested_parens);
90
$g_nested_brackets = qr{
92
[^\[\]]+ # Anything other than brackets
95
(??{ $g_nested_brackets }) # Recursive set of nested brackets
99
# Doesn't allow for whitespace, because we're using it to match URLs:
100
$g_nested_parens = qr{
101
(?> # Atomic matching
102
[^()\s]+ # Anything other than parens or whitespace
105
(??{ $g_nested_parens }) # Recursive set of nested brackets
110
# Table of hash values for escaped characters:
112
foreach my $char (split //, '\\`*_{}[]()>#+-.!') {
113
$g_escape_table{$char} = md5_hex($char);
120
A simple constructor, see the SYNTAX and OPTIONS sections for more information.
125
my ($class, %p) = @_;
127
$p{base_url} ||= ''; # This is the base url to be used for WikiLinks
129
$p{tab_width} = 4 unless (defined $p{tab_width} and $p{tab_width} =~ m/^\d+$/);
131
$p{empty_element_suffix} ||= ' />'; # Change to ">" for HTML output
133
# Is markdown processed in HTML blocks? See t/15inlinehtmldonotturnoffmarkdown.t
134
$p{markdown_in_html_blocks} = $p{markdown_in_html_blocks} ? 1 : 0;
136
my $self = { params => \%p };
137
bless $self, ref($class) || $class;
143
The main function as far as the outside world is concerned. See the SYNOPSIS
149
my ( $self, $text, $options ) = @_;
151
# Detect functional mode, and create an instance for this run..
153
if ( $self ne __PACKAGE__ ) {
154
my $ob = __PACKAGE__->new();
155
# $self is text, $text is options
156
return $ob->markdown($self, $text);
159
croak('Calling ' . $self . '->markdown (as a class method) is not supported.');
165
%$self = (%{ $self->{params} }, %$options, params => $self->{params});
167
$self->_CleanUpRunData($options);
169
return $self->_Markdown($text);
172
sub _CleanUpRunData {
173
my ($self, $options) = @_;
174
# Clear the global hashes. If we don't clear these, you get conflicts
175
# from other articles when generating a page which contains more than
176
# one article (e.g. an index page that shows the N most recent
178
$self->{_urls} = $options->{urls} ? $options->{urls} : {}; # FIXME - document passing this option (tested in 05options.t).
179
$self->{_titles} = {};
180
$self->{_html_blocks} = {};
181
# Used to track when we're inside an ordered or unordered list
182
# (see _ProcessListItems() for details)
183
$self->{_list_level} = 0;
189
# Main function. The order in which other subs are called here is
190
# essential. Link and image substitutions need to happen before
191
# _EscapeSpecialChars(), so that any *'s or _'s in the <a>
192
# and <img> tags get encoded.
194
my ($self, $text, $options) = @_;
196
$text = $self->_CleanUpDoc($text);
198
# Turn block-level HTML blocks into hash entries
199
$text = $self->_HashHTMLBlocks($text) unless $self->{markdown_in_html_blocks};
201
$text = $self->_StripLinkDefinitions($text);
203
$text = $self->_RunBlockGamut($text);
205
$text = $self->_UnescapeSpecialChars($text);
207
$text = $self->_ConvertCopyright($text);
214
Returns a reference to a hash with the key being the markdown reference and the value being the URL.
216
Useful for building scripts which preprocess a list of links before the main content. See t/05options.t
217
for an example of this hashref being passed back into the markdown method to create links.
224
return $self->{_urls};
228
my ($self, $text) = @_;
230
# Standardize line endings:
231
$text =~ s{\r\n}{\n}g; # DOS to Unix
232
$text =~ s{\r}{\n}g; # Mac to Unix
234
# Make sure $text ends with a couple of newlines:
237
# Convert all tabs to spaces.
238
$text = $self->_Detab($text);
240
# Strip any lines consisting only of spaces and tabs.
241
# This makes subsequent regexen easier to write, because we can
242
# match consecutive blank lines with /\n+/ instead of something
243
# contorted like /[ \t]*\n+/ .
244
$text =~ s/^[ \t]+$//mg;
249
sub _StripLinkDefinitions {
251
# Strips link definitions from text, stores the URLs and titles in
254
my ($self, $text) = @_;
255
my $less_than_tab = $self->{tab_width} - 1;
257
# Link defs are in the form: ^[id]: url "optional title"
259
^[ ]{0,$less_than_tab}\[(.+)\]: # id = \$1
261
\n? # maybe *one* newline
263
<?(\S+?)>? # url = \$2
265
\n? # maybe one newline
268
(?<=\s) # lookbehind for whitespace
273
)? # title is optional
276
$self->{_urls}{lc $1} = $self->_EncodeAmpsAndAngles( $2 ); # Link IDs are case-insensitive
278
$self->{_titles}{lc $1} = $3;
279
$self->{_titles}{lc $1} =~ s/"/"/g;
288
# Internal function used to safely MD5sum chunks of the input, which might be Unicode in Perl's internal representation.
290
return unless defined $input;
291
if (Encode::is_utf8 $input) {
292
return md5_hex(Encode::encode('utf8', $input));
295
return md5_hex($input);
299
sub _HashHTMLBlocks {
300
my ($self, $text) = @_;
301
my $less_than_tab = $self->{tab_width} - 1;
303
# Hashify HTML blocks:
304
# We only want to do this for block-level HTML tags, such as headers,
305
# lists, and tables. That's because we still want to wrap <p>s around
306
# "paragraphs" that are wrapped in non-block-level tags, such as anchors,
307
# phrase emphasis, and spans. The list of tags we're looking for is
311
p | div | h[1-6] | blockquote | pre | table |
312
dl | ol | ul | script | noscript | form |
313
fieldset | iframe | math | ins | del
318
(?: # Match one attr name/value pair
319
\s+ # There needs to be at least some whitespace
320
# before each attribute name.
321
[\w.:_-]+ # Attribute name
324
".+?" # "Attribute value"
326
'.+?' # 'Attribute value'
331
my $empty_tag = qr{< \w+ $tag_attrs \s* />}oxms;
332
my $open_tag = qr{< $block_tags $tag_attrs \s* >}oxms;
333
my $close_tag = undef; # let Text::Balanced handle this
335
use Text::Balanced qw(gen_extract_tagged);
336
my $extract_block = gen_extract_tagged($open_tag, $close_tag, undef, { ignore => [$empty_tag] });
339
while ($text =~ s{^(([ ]{0,$less_than_tab}<)?.*\n)}{}m) {
342
# current line could be start of code block
344
my ($tag, $remainder) = $extract_block->($cur_line . $text);
346
my $key = _md5_utf8($tag);
347
$self->{_html_blocks}{$key} = $tag;
348
push @chunks, "\n\n" . $key . "\n\n";
352
# No tag match, so toss $cur_line into @chunks
353
push @chunks, $cur_line;
357
# current line could NOT be start of code block
358
push @chunks, $cur_line;
362
push @chunks, $text; # Whatever is left.
364
$text = join '', @chunks;
366
# Special case just for <hr />. It was easier to make a special case than
367
# to make the other regex more complicated.
368
$text = $self->_HashHR($text);
370
$text = $self->_HashHTMLComments($text);
372
$text = $self->_HashPHPASPBlocks($text);
378
my ($self, $text) = @_;
379
my $less_than_tab = $self->{tab_width} - 1;
383
(?<=\n\n) # Starting after a blank line
385
\A\n? # the beginning of the doc
388
[ ]{0,$less_than_tab}
389
<(hr) # start tag = $2
392
/?> # the matching end tag
394
(?=\n{2,}|\Z) # followed by a blank line or end of document
397
my $key = _md5_utf8($1);
398
$self->{_html_blocks}{$key} = $1;
399
"\n\n" . $key . "\n\n";
405
sub _HashHTMLComments {
406
my ($self, $text) = @_;
407
my $less_than_tab = $self->{tab_width} - 1;
409
# Special case for standalone HTML comments:
412
(?<=\n\n) # Starting after a blank line
414
\A\n? # the beginning of the doc
417
[ ]{0,$less_than_tab}
424
(?=\n{2,}|\Z) # followed by a blank line or end of document
427
my $key = _md5_utf8($1);
428
$self->{_html_blocks}{$key} = $1;
429
"\n\n" . $key . "\n\n";
435
sub _HashPHPASPBlocks {
436
my ($self, $text) = @_;
437
my $less_than_tab = $self->{tab_width} - 1;
439
# PHP and ASP-style processor instructions (<?ā¦?> and <%ā¦%>)
442
(?<=\n\n) # Starting after a blank line
444
\A\n? # the beginning of the doc
447
[ ]{0,$less_than_tab}
454
(?=\n{2,}|\Z) # followed by a blank line or end of document
457
my $key = _md5_utf8($1);
458
$self->{_html_blocks}{$key} = $1;
459
"\n\n" . $key . "\n\n";
466
# These are all the transformations that form block-level
467
# tags like paragraphs, headers, and list items.
469
my ($self, $text) = @_;
471
# Do headers first, as these populate cross-refs
472
$text = $self->_DoHeaders($text);
474
# And now, protect our tables
475
$text = $self->_HashHTMLBlocks($text) unless $self->{markdown_in_html_blocks};
477
# Do Horizontal Rules:
478
$text =~ s{^[ ]{0,2}([ ]?\*[ ]?){3,}[ \t]*$}{\n<hr$self->{empty_element_suffix}\n}gmx;
479
$text =~ s{^[ ]{0,2}([ ]? -[ ]?){3,}[ \t]*$}{\n<hr$self->{empty_element_suffix}\n}gmx;
480
$text =~ s{^[ ]{0,2}([ ]? _[ ]?){3,}[ \t]*$}{\n<hr$self->{empty_element_suffix}\n}gmx;
482
$text = $self->_DoLists($text);
484
$text = $self->_DoCodeBlocks($text);
486
$text = $self->_DoBlockQuotes($text);
488
# We already ran _HashHTMLBlocks() before, in Markdown(), but that
489
# was to escape raw HTML in the original Markdown source. This time,
490
# we're escaping the markup we've just created, so that we don't wrap
491
# <p> tags around block-level tags.
492
$text = $self->_HashHTMLBlocks($text);
494
$text = $self->_FormParagraphs($text);
501
# These are all the transformations that occur *within* block-level
502
# tags like paragraphs, headers, and list items.
504
my ($self, $text) = @_;
506
$text = $self->_DoCodeSpans($text);
507
$text = $self->_EscapeSpecialCharsWithinTagAttributes($text);
508
$text = $self->_EscapeSpecialChars($text);
510
# Process anchor and image tags. Images must come first,
511
# because ![foo][f] looks like an anchor.
512
$text = $self->_DoImages($text);
513
$text = $self->_DoAnchors($text);
515
# Make links out of things like `<http://example.com/>`
516
# Must come after _DoAnchors(), because you can use < and >
517
# delimiters in inline links like [this](<url>).
518
$text = $self->_DoAutoLinks($text);
520
$text = $self->_EncodeAmpsAndAngles($text);
522
$text = $self->_DoItalicsAndBold($text);
524
# FIXME - Is hard coding space here sane, or does this want to be related to tab width?
526
$text =~ s/ {2,}\n/ <br$self->{empty_element_suffix}\n/g;
531
sub _EscapeSpecialChars {
532
my ($self, $text) = @_;
533
my $tokens ||= $self->_TokenizeHTML($text);
535
$text = ''; # rebuild $text from the tokens
536
# my $in_pre = 0; # Keep track of when we're inside <pre> or <code> tags.
537
# my $tags_to_skip = qr!<(/?)(?:pre|code|kbd|script|math)[\s>]!;
539
foreach my $cur_token (@$tokens) {
540
if ($cur_token->[0] eq "tag") {
541
# Within tags, encode * and _ so they don't conflict
542
# with their use in Markdown for italics and strong.
543
# We're replacing each such character with its
544
# corresponding MD5 checksum value; this is likely
545
# overkill, but it should prevent us from colliding
546
# with the escape values by accident.
547
$cur_token->[1] =~ s! \* !$g_escape_table{'*'}!ogx;
548
$cur_token->[1] =~ s! _ !$g_escape_table{'_'}!ogx;
549
$text .= $cur_token->[1];
551
my $t = $cur_token->[1];
552
$t = $self->_EncodeBackslashEscapes($t);
559
sub _EscapeSpecialCharsWithinTagAttributes {
561
# Within tags -- meaning between < and > -- encode [\ ` * _] so they
562
# don't conflict with their use in Markdown for code, italics and strong.
563
# We're replacing each such character with its corresponding MD5 checksum
564
# value; this is likely overkill, but it should prevent us from colliding
565
# with the escape values by accident.
567
my ($self, $text) = @_;
568
my $tokens ||= $self->_TokenizeHTML($text);
569
$text = ''; # rebuild $text from the tokens
571
foreach my $cur_token (@$tokens) {
572
if ($cur_token->[0] eq "tag") {
573
$cur_token->[1] =~ s! \\ !$g_escape_table{'\\'}!gox;
574
$cur_token->[1] =~ s{ (?<=.)</?code>(?=.) }{$g_escape_table{'`'}}gox;
575
$cur_token->[1] =~ s! \* !$g_escape_table{'*'}!gox;
576
$cur_token->[1] =~ s! _ !$g_escape_table{'_'}!gox;
578
$text .= $cur_token->[1];
585
# Turn Markdown link shortcuts into XHTML <a> tags.
587
my ($self, $text) = @_;
590
# First, handle reference-style links: [link text] [id]
593
( # wrap whole match in $1
595
($g_nested_brackets) # link text = $2
598
[ ]? # one optional space
599
(?:\n[ ]*)? # one optional newline followed by spaces
606
my $whole_match = $1;
610
if ($link_id eq "") {
611
$link_id = lc $link_text; # for shortcut links like [this][].
614
$link_id =~ s{[ ]*\n}{ }g; # turn embedded newlines into spaces
616
$self->_GenerateAnchor($whole_match, $link_text, $link_id);
620
# Next, inline-style links: [link text](url "optional title")
623
( # wrap whole match in $1
625
($g_nested_brackets) # link text = $2
629
($g_nested_parens) # href = $3
632
(['"]) # quote char = $5
635
[ \t]* # ignore any spaces/tabs between closing quote and )
636
)? # title is optional
641
my $whole_match = $1;
646
$self->_GenerateAnchor($whole_match, $link_text, undef, $url, $title);
650
# Last, handle reference-style shortcuts: [link text]
651
# These must come last in case you've also got [link test][1]
652
# or [link test](/foo)
655
( # wrap whole match in $1
657
([^\[\]]+) # link text = $2; can't contain '[' or ']'
662
my $whole_match = $1;
664
(my $link_id = lc $2) =~ s{[ ]*\n}{ }g; # lower-case and turn embedded newlines into spaces
666
$self->_GenerateAnchor($whole_match, $link_text, $link_id);
672
sub _GenerateAnchor {
673
# FIXME - Fugly, change to named params?
674
my ($self, $whole_match, $link_text, $link_id, $url, $title, $attributes) = @_;
678
$attributes = '' unless defined $attributes;
680
if ( !defined $url && defined $self->{_urls}{$link_id}) {
681
$url = $self->{_urls}{$link_id};
688
$url =~ s! \* !$g_escape_table{'*'}!gox; # We've got to encode these to avoid
689
$url =~ s! _ !$g_escape_table{'_'}!gox; # conflicting with italics/bold.
690
$url =~ s{^<(.*)>$}{$1}; # Remove <>'s surrounding URL, if present
692
$result = qq{<a href="$url"};
694
if ( !defined $title && defined $link_id && defined $self->{_titles}{$link_id} ) {
695
$title = $self->{_titles}{$link_id};
698
if ( defined $title ) {
699
$title =~ s/"/"/g;
700
$title =~ s! \* !$g_escape_table{'*'}!gox;
701
$title =~ s! _ !$g_escape_table{'_'}!gox;
702
$result .= qq{ title="$title"};
705
$result .= "$attributes>$link_text</a>";
712
# Turn Markdown image shortcuts into <img> tags.
714
my ($self, $text) = @_;
717
# First, handle reference-style labeled images: ![alt text][id]
720
( # wrap whole match in $1
722
(.*?) # alt text = $2
725
[ ]? # one optional space
726
(?:\n[ ]*)? # one optional newline followed by spaces
735
my $whole_match = $1;
739
if ($link_id eq '') {
740
$link_id = lc $alt_text; # for shortcut links like ![this][].
743
$self->_GenerateImage($whole_match, $alt_text, $link_id);
747
# Next, handle inline images: ![alt text](url "optional title")
748
# Don't forget: encode * and _
751
( # wrap whole match in $1
753
(.*?) # alt text = $2
757
($g_nested_parens) # src url - href = $3
760
(['"]) # quote char = $5
764
)? # title is optional
769
my $whole_match = $1;
777
$self->_GenerateImage($whole_match, $alt_text, undef, $url, $title);
784
# FIXME - Fugly, change to named params?
785
my ($self, $whole_match, $alt_text, $link_id, $url, $title, $attributes) = @_;
789
$attributes = '' unless defined $attributes;
792
$alt_text =~ s/"/"/g;
793
# FIXME - how about >
795
if ( !defined $url && defined $self->{_urls}{$link_id}) {
796
$url = $self->{_urls}{$link_id};
799
# If there's no such link ID, leave intact:
800
return $whole_match unless defined $url;
802
$url =~ s! \* !$g_escape_table{'*'}!ogx; # We've got to encode these to avoid
803
$url =~ s! _ !$g_escape_table{'_'}!ogx; # conflicting with italics/bold.
804
$url =~ s{^<(.*)>$}{$1}; # Remove <>'s surrounding URL, if present
806
if (!defined $title && length $link_id && defined $self->{_titles}{$link_id} && length $self->{_titles}{$link_id}) {
807
$title = $self->{_titles}{$link_id};
810
$result = qq{<img src="$url" alt="$alt_text"};
811
if (defined $title && length $title) {
812
$title =~ s! \* !$g_escape_table{'*'}!ogx;
813
$title =~ s! _ !$g_escape_table{'_'}!ogx;
814
$title =~ s/"/"/g;
815
$result .= qq{ title="$title"};
817
$result .= $attributes . $self->{empty_element_suffix};
823
my ($self, $text) = @_;
825
# Setext-style headers:
832
$text =~ s{ ^(.+)[ \t]*\n=+[ \t]*\n+ }{
833
$self->_GenerateHeader('1', $1);
836
$text =~ s{ ^(.+)[ \t]*\n-+[ \t]*\n+ }{
837
$self->_GenerateHeader('2', $1);
844
# ## Header 2 with closing hashes ##
850
^(\#{1,6}) # $1 = string of #'s
852
(.+?) # $2 = Header text
854
\#* # optional closing #'s (not counted)
857
my $h_level = length($1);
858
$self->_GenerateHeader($h_level, $2);
864
sub _GenerateHeader {
865
my ($self, $level, $id) = @_;
867
return "<h$level>" . $self->_RunSpanGamut($id) . "</h$level>\n\n";
872
# Form HTML ordered (numbered) and unordered (bulleted) lists.
874
my ($self, $text) = @_;
875
my $less_than_tab = $self->{tab_width} - 1;
877
# Re-usable patterns to match list item bullets and number markers:
878
my $marker_ul = qr/[*+-]/;
879
my $marker_ol = qr/\d+[.]/;
880
my $marker_any = qr/(?:$marker_ul|$marker_ol)/;
882
# Re-usable pattern to match any entirel ul or ol list:
886
[ ]{0,$less_than_tab}
887
(${marker_any}) # $3 = first list item marker
896
(?! # Negative lookahead for another list item marker
904
# We use a different prefix before nested lists than top-level lists.
905
# See extended comment in _ProcessListItems().
907
# Note: There's a bit of duplication here. My original implementation
908
# created a scalar regex pattern as the conditional result of the test on
909
# $self->{_list_level}, and then only ran the $text =~ s{...}{...}egmx
910
# substitution once, using the scalar as the pattern. This worked,
911
# everywhere except when running under MT on my hosting account at Pair
912
# Networks. There, this caused all rebuilds to be killed by the reaper (or
913
# perhaps they crashed, but that seems incredibly unlikely given that the
914
# same script on the same server ran fine *except* under MT. I've spent
915
# more time trying to figure out why this is happening than I'd like to
916
# admit. My only guess, backed up by the fact that this workaround works,
917
# is that Perl optimizes the substition when it can figure out that the
918
# pattern will never change, and when this optimization isn't on, we run
919
# afoul of the reaper. Thus, the slightly redundant code to that uses two
920
# static s/// patterns rather than one conditional pattern.
922
if ($self->{_list_level}) {
928
my $list_type = ($3 =~ m/$marker_ul/) ? "ul" : "ol";
929
# Turn double returns into triple returns, so that we can make a
930
# paragraph for the last item in a list, if necessary:
931
$list =~ s/\n{2,}/\n\n\n/g;
932
my $result = ( $list_type eq 'ul' ) ?
933
$self->_ProcessListItems($list, $marker_ul)
934
: $self->_ProcessListItems($list, $marker_ol);
935
$result = "<$list_type>\n" . $result . "</$list_type>\n";
945
my $list_type = ($3 =~ m/$marker_ul/) ? "ul" : "ol";
946
# Turn double returns into triple returns, so that we can make a
947
# paragraph for the last item in a list, if necessary:
948
$list =~ s/\n{2,}/\n\n\n/g;
949
my $result = ( $list_type eq 'ul' ) ?
950
$self->_ProcessListItems($list, $marker_ul)
951
: $self->_ProcessListItems($list, $marker_ol);
952
$result = "<$list_type>\n" . $result . "</$list_type>\n";
961
sub _ProcessListItems {
963
# Process the contents of a single ordered or unordered list, splitting it
964
# into individual list items.
967
my ($self, $list_str, $marker_any) = @_;
970
# The $self->{_list_level} global keeps track of when we're inside a list.
971
# Each time we enter a list, we increment it; when we leave a list,
972
# we decrement. If it's zero, we're not in a list anymore.
974
# We do this because when we're not inside a list, we want to treat
975
# something like this:
977
# I recommend upgrading to version
978
# 8. Oops, now this line is treated
981
# As a single paragraph, despite the fact that the second line starts
982
# with a digit-period-space sequence.
984
# Whereas when we're inside a list (or sub-list), that line will be
985
# treated as the start of a sub-list. What a kludge, huh? This is
986
# an aspect of Markdown's syntax that's hard to parse perfectly
987
# without resorting to mind-reading. Perhaps the solution is to
988
# change the syntax rules such that sub-lists must start with a
989
# starting cardinal number; e.g. "1." or "a.".
991
$self->{_list_level}++;
993
# trim trailing blank lines:
994
$list_str =~ s/\n{2,}\z/\n/;
998
(\n)? # leading line = $1
999
(^[ \t]*) # leading whitespace = $2
1000
($marker_any) [ \t]+ # list marker = $3
1001
((?s:.+?) # list item text = $4
1003
(?= \n* (\z | \2 ($marker_any) [ \t]+))
1006
my $leading_line = $1;
1007
my $leading_space = $2;
1009
if ($leading_line or ($item =~ m/\n{2,}/)) {
1010
$item = $self->_RunBlockGamut($self->_Outdent($item));
1013
# Recursion for sub-lists:
1014
$item = $self->_DoLists($self->_Outdent($item));
1016
$item = $self->_RunSpanGamut($item);
1019
"<li>" . $item . "</li>\n";
1022
$self->{_list_level}--;
1028
# Process Markdown `<pre><code>` blocks.
1031
my ($self, $text) = @_;
1035
( # $1 = the code block -- one or more lines, starting with a space/tab
1037
(?:[ ]{$self->{tab_width}} | \t) # Lines must start with a tab or a tab-width of spaces
1041
((?=^[ ]{0,$self->{tab_width}}\S)|\Z) # Lookahead for non-space at line-start, or end of doc
1044
my $result; # return value
1046
$codeblock = $self->_EncodeCode($self->_Outdent($codeblock));
1047
$codeblock = $self->_Detab($codeblock);
1048
$codeblock =~ s/\A\n+//; # trim leading newlines
1049
$codeblock =~ s/\n+\z//; # trim trailing newlines
1051
$result = "\n\n<pre><code>" . $codeblock . "\n</code></pre>\n\n";
1061
# * Backtick quotes are used for <code></code> spans.
1063
# * You can use multiple backticks as the delimiters if you want to
1064
# include literal backticks in the code span. So, this input:
1066
# Just type ``foo `bar` baz`` at the prompt.
1068
# Will translate to:
1070
# <p>Just type <code>foo `bar` baz</code> at the prompt.</p>
1072
# There's no arbitrary limit to the number of backticks you
1073
# can use as delimters. If you need three consecutive backticks
1074
# in your code, use four for delimiters, etc.
1076
# * You can use spaces to get literal backticks at the edges:
1078
# ... type `` `bar` `` ...
1082
# ... type <code>`bar`</code> ...
1085
my ($self, $text) = @_;
1088
(?<!\\) # Character before opening ` can't be a backslash
1089
(`+) # $1 = Opening run of `
1090
(.+?) # $2 = The code block
1092
\1 # Matching closer
1096
$c =~ s/^[ \t]*//g; # leading whitespace
1097
$c =~ s/[ \t]*$//g; # trailing whitespace
1098
$c = $self->_EncodeCode($c);
1107
# Encode/escape certain characters inside Markdown code runs.
1108
# The point is that in code, these characters are literals,
1109
# and lose their special Markdown meanings.
1114
# Encode all ampersands; HTML entities are not
1115
# entities within a Markdown code span.
1118
# Encode $'s, but only if we're running under Blosxom.
1119
# (Blosxom interpolates Perl variables in article bodies.)
1122
if (defined($blosxom::version)) {
1128
# Do the angle bracket song and dance:
1132
# Now, escape characters that are magic in Markdown:
1133
s! \* !$g_escape_table{'*'}!ogx;
1134
s! _ !$g_escape_table{'_'}!ogx;
1135
s! { !$g_escape_table{'{'}!ogx;
1136
s! } !$g_escape_table{'}'}!ogx;
1137
s! \[ !$g_escape_table{'['}!ogx;
1138
s! \] !$g_escape_table{']'}!ogx;
1139
s! \\ !$g_escape_table{'\\'}!ogx;
1144
sub _DoItalicsAndBold {
1145
my ($self, $text) = @_;
1147
# Handle at beginning of lines:
1148
$text =~ s{ ^(\*\*|__) (?=\S) (.+?[*_]*) (?<=\S) \1 }
1149
{<strong>$2</strong>}gsx;
1151
$text =~ s{ ^(\*|_) (?=\S) (.+?) (?<=\S) \1 }
1154
# <strong> must go first:
1155
$text =~ s{ (?<=\W) (\*\*|__) (?=\S) (.+?[*_]*) (?<=\S) \1 }
1156
{<strong>$2</strong>}gsx;
1158
$text =~ s{ (?<=\W) (\*|_) (?=\S) (.+?) (?<=\S) \1 }
1161
# And now, a second pass to catch nested strong and emphasis special cases
1162
$text =~ s{ (?<=\W) (\*\*|__) (?=\S) (.+?[*_]*) (?<=\S) \1 }
1163
{<strong>$2</strong>}gsx;
1165
$text =~ s{ (?<=\W) (\*|_) (?=\S) (.+?) (?<=\S) \1 }
1171
sub _DoBlockQuotes {
1172
my ($self, $text) = @_;
1175
( # Wrap whole match in $1
1177
^[ \t]*>[ \t]? # '>' at the start of a line
1178
.+\n # rest of the first line
1179
(.+\n)* # subsequent consecutive lines
1185
$bq =~ s/^[ \t]*>[ \t]?//gm; # trim one level of quoting
1186
$bq =~ s/^[ \t]+$//mg; # trim whitespace-only lines
1187
$bq = $self->_RunBlockGamut($bq); # recurse
1190
# These leading spaces screw with <pre> content, so we need to fix that:
1199
"<blockquote>\n$bq\n</blockquote>\n\n";
1206
sub _FormParagraphs {
1209
# $text - string to process with html <p> tags
1211
my ($self, $text) = @_;
1213
# Strip leading and trailing lines:
1217
my @grafs = split(/\n{2,}/, $text);
1223
unless (defined( $self->{_html_blocks}{$_} )) {
1224
$_ = $self->_RunSpanGamut($_);
1231
# Unhashify HTML blocks
1234
if (defined( $self->{_html_blocks}{$_} )) {
1235
$_ = $self->{_html_blocks}{$_};
1239
return join "\n\n", @grafs;
1242
sub _EncodeAmpsAndAngles {
1243
# Smart processing for ampersands and angle brackets that need to be encoded.
1245
my ($self, $text) = @_;
1246
return '' if (!defined $text or !length $text);
1248
# Ampersand-encoding based entirely on Nat Irons's Amputator MT plugin:
1249
# http://bumppo.net/projects/amputator/
1250
$text =~ s/&(?!#?[xX]?(?:[0-9a-fA-F]+|\w+);)/&/g;
1253
$text =~ s{<(?![a-z/?\$!])}{<}gi;
1255
# And >'s - added by Fletcher Penney
1256
# $text =~ s{>(?![a-z/?\$!])}{>}gi;
1257
# Causes problems...
1259
# Remove encoding inside comments
1261
(?<=<!--) # Begin comment
1262
(.*?) # Anything inside
1263
(?=-->) # End comments
1274
sub _EncodeBackslashEscapes {
1276
# Parameter: String.
1277
# Returns: The string, with after processing the following backslash
1283
s! \\\\ !$g_escape_table{'\\'}!ogx; # Must process escaped backslashes first.
1284
s! \\` !$g_escape_table{'`'}!ogx;
1285
s! \\\* !$g_escape_table{'*'}!ogx;
1286
s! \\_ !$g_escape_table{'_'}!ogx;
1287
s! \\\{ !$g_escape_table{'{'}!ogx;
1288
s! \\\} !$g_escape_table{'}'}!ogx;
1289
s! \\\[ !$g_escape_table{'['}!ogx;
1290
s! \\\] !$g_escape_table{']'}!ogx;
1291
s! \\\( !$g_escape_table{'('}!ogx;
1292
s! \\\) !$g_escape_table{')'}!ogx;
1293
s! \\> !$g_escape_table{'>'}!ogx;
1294
s! \\\# !$g_escape_table{'#'}!ogx;
1295
s! \\\+ !$g_escape_table{'+'}!ogx;
1296
s! \\\- !$g_escape_table{'-'}!ogx;
1297
s! \\\. !$g_escape_table{'.'}!ogx;
1298
s{ \\! }{$g_escape_table{'!'}}ogx;
1304
my ($self, $text) = @_;
1306
$text =~ s{<((https?|ftp):[^'">\s]+)>}{<a href="$1">$1</a>}gi;
1308
# Email addresses: <address@domain.foo>
1315
[-a-z0-9]+(\.[-a-z0-9]+)*\.[a-z]+
1319
$self->_EncodeEmailAddress( $self->_UnescapeSpecialChars($1) );
1325
sub _EncodeEmailAddress {
1327
# Input: an email address, e.g. "foo@example.com"
1329
# Output: the email address as a mailto link, with each character
1330
# of the address encoded as either a decimal or hex entity, in
1331
# the hopes of foiling most address harvesting spam bots. E.g.:
1333
# <a href="mailto:foo@e
1334
# xample.com">foo
1335
# @example.com</a>
1337
# Based on a filter by Matthew Wickline, posted to the BBEdit-Talk
1338
# mailing list: <http://tinyurl.com/yu7ue>
1341
my ($self, $addr) = @_;
1344
sub { '&#' . ord(shift) . ';' },
1345
sub { '&#x' . sprintf( "%X", ord(shift) ) . ';' },
1349
$addr = "mailto:" . $addr;
1353
if ( $char eq '@' ) {
1354
# this *must* be encoded. I insist.
1355
$char = $encode[int rand 1]->($char);
1357
elsif ( $char ne ':' ) {
1358
# leave ':' alone (to spot mailto: later)
1360
# roughly 10% raw, 45% hex, 45% dec
1362
$r > .9 ? $encode[2]->($char) :
1363
$r < .45 ? $encode[1]->($char) :
1370
$addr = qq{<a href="$addr">$addr</a>};
1371
$addr =~ s{">.+?:}{">}; # strip the mailto: from the visible part
1376
sub _UnescapeSpecialChars {
1378
# Swap back in all the special characters we've hidden.
1380
my ($self, $text) = @_;
1382
while( my($char, $hash) = each(%g_escape_table) ) {
1383
$text =~ s/$hash/$char/g;
1390
# Parameter: String containing HTML markup.
1391
# Returns: Reference to an array of the tokens comprising the input
1392
# string. Each token is either a tag (possibly with nested,
1393
# tags contained therein, such as <a href="<MTFoo>">, or a
1394
# run of text between tags. Each element of the array is a
1395
# two-element array; the first is either 'tag' or 'text';
1396
# the second is the actual value.
1399
# Derived from the _tokenize() subroutine from Brad Choate's MTRegex plugin.
1400
# <http://www.bradchoate.com/past/mtregex.php>
1403
my ($self, $str) = @_;
1405
my $len = length $str;
1409
my $nested_tags = join('|', ('(?:<[a-z/!$](?:[^<>]') x $depth) . (')*>)' x $depth);
1410
my $match = qr/(?s: <! ( -- .*? -- \s* )+ > ) | # comment
1411
(?s: <\? .*? \?> ) | # processing instruction
1412
$nested_tags/iox; # nested tags
1414
while ($str =~ m/($match)/og) {
1416
my $sec_start = pos $str;
1417
my $tag_start = $sec_start - length $whole_tag;
1418
if ($pos < $tag_start) {
1419
push @tokens, ['text', substr($str, $pos, $tag_start - $pos)];
1421
push @tokens, ['tag', $whole_tag];
1424
push @tokens, ['text', substr($str, $pos, $len - $pos)] if $pos < $len;
1430
# Remove one level of line-leading tabs or spaces
1432
my ($self, $text) = @_;
1434
$text =~ s/^(\t|[ ]{1,$self->{tab_width}})//gm;
1440
# Cribbed from a post by Bart Lateur:
1441
# <http://www.nntp.perl.org/group/perl.macperl.anyperl/154>
1443
my ($self, $text) = @_;
1445
# FIXME - Better anchor/regex would be quicker.
1448
#$text =~ s{(.*?)\t}{$1.(' ' x ($self->{tab_width} - length($1) % $self->{tab_width}))}ge;
1450
# Much swifter, but pretty hateful:
1451
do {} while ($text =~ s{^(.*?)\t}{$1.(' ' x ($self->{tab_width} - length($1) % $self->{tab_width}))}mge);
1455
sub _ConvertCopyright {
1456
my ($self, $text) = @_;
1457
# Convert to an XML compatible form of copyright symbol
1459
$text =~ s/©/©/gi;
1468
=head1 OTHER IMPLEMENTATIONS
1470
Markdown has been re-implemented in a number of languages, and with a number of additions.
1472
Those that I have found are listed below:
1476
=item C - <http://www.pell.portland.or.us/~orc/Code/discount>
1478
Discount - Original Markdown, but in C. Fastest implementation available, and passes MDTest.
1479
Adds it's own set of custom features.
1481
=item python - <http://www.freewisdom.org/projects/python-markdown/>
1483
Python Markdown which is mostly compatible with the original, with an interesting extension API.
1485
=item ruby (maruku) - <http://maruku.rubyforge.org/>
1487
One of the nicest implementations out there. Builds a parse tree internally so very flexible.
1489
=item php - <http://michelf.com/projects/php-markdown/>
1491
A direct port of Markdown.pl, also has a separately maintained 'extra' version,
1492
which adds a number of features that were borrowed by MultiMarkdown.
1494
=item lua - <http://www.frykholm.se/files/markdown.lua>
1496
Port to lua. Simple and lightweight (as lua is).
1498
=item haskell - <http://johnmacfarlane.net/pandoc/>
1500
Pandoc is a more general library, supporting Markdown, reStructuredText, LaTeX and more.
1502
=item javascript - <http://www.attacklab.net/showdown-gui.html>
1504
Direct(ish) port of Markdown.pl to JavaScript
1510
To file bug reports or feature requests please send email to:
1512
bug-Text-Markdown@rt.cpan.org
1514
Please include with your report: (1) the example input; (2) the output
1515
you expected; (3) the output Markdown actually produced.
1517
=head1 VERSION HISTORY
1519
See the Changes file for detailed release notes for this version.
1524
http://daringfireball.net/
1526
PHP port and other contributions by Michel Fortin
1529
MultiMarkdown changes by Fletcher Penney
1530
http://fletcher.freeshell.org/
1532
CPAN Module Text::MultiMarkdown (based on Text::Markdown by Sebastian
1533
Riedel) originally by Darren Kulp (http://kulp.ch/)
1535
This module is maintained by: Tomas Doran http://www.bobtfish.net/
1537
=head1 THIS DISTRIBUTION
1539
Please note that this distribution is a fork of John Gruber's original Markdown project,
1540
and it *is not* in any way blessed by him.
1542
Whilst this code aims to be compatible with the original Markdown.pl (and incorporates
1543
and passes the Markdown test suite) whilst fixing a number of bugs in the original -
1544
there may be differences between the behavior of this module and Markdown.pl. If you find
1545
any differences where you believe Text::Markdown behaves contrary to the Markdown spec,
1546
please report them as bugs.
1548
Text::Markdown *does not* extend the markdown dialect in any way from that which is documented at
1549
daringfireball. If you want additional features, you should look at L<Text::MultiMarkdown>.
1551
=head1 COPYRIGHT AND LICENSE
1553
Original Code Copyright (c) 2003-2004 John Gruber
1554
<http://daringfireball.net/>
1555
All rights reserved.
1557
MultiMarkdown changes Copyright (c) 2005-2006 Fletcher T. Penney
1558
<http://fletcher.freeshell.org/>
1559
All rights reserved.
1561
Text::MultiMarkdown changes Copyright (c) 2006-2008 Darren Kulp
1562
<http://kulp.ch> and Tomas Doran <http://www.bobtfish.net>
1564
Redistribution and use in source and binary forms, with or without
1565
modification, are permitted provided that the following conditions are
1568
* Redistributions of source code must retain the above copyright notice,
1569
this list of conditions and the following disclaimer.
1571
* Redistributions in binary form must reproduce the above copyright
1572
notice, this list of conditions and the following disclaimer in the
1573
documentation and/or other materials provided with the distribution.
1575
* Neither the name "Markdown" nor the names of its contributors may
1576
be used to endorse or promote products derived from this software
1577
without specific prior written permission.
1579
This software is provided by the copyright holders and contributors "as
1580
is" and any express or implied warranties, including, but not limited
1581
to, the implied warranties of merchantability and fitness for a
1582
particular purpose are disclaimed. In no event shall the copyright owner
1583
or contributors be liable for any direct, indirect, incidental, special,
1584
exemplary, or consequential damages (including, but not limited to,
1585
procurement of substitute goods or services; loss of use, data, or
1586
profits; or business interruption) however caused and on any theory of
1587
liability, whether in contract, strict liability, or tort (including
1588
negligence or otherwise) arising in any way out of the use of this
1589
software, even if advised of the possibility of such damage.