1
Writing Extensions for Python-Markdown
2
======================================
3
4
Overview
5
--------
6
7
Python-Markdown includes an API for extension writers to plug their own 
8
custom functionality and/or syntax into the parser. There are preprocessors
9
which allow you to alter the source before it is passed to the parser, 
10
inline patterns which allow you to add, remove or override the syntax of
11
any inline elements, and postprocessors which allow munging of the
12
output of the parser before it is returned. If you really want to dive in, 
13
there are also blockprocessors which are part of the core BlockParser.
14
15
As the parser builds an [ElementTree][] object which is later rendered 
16
as Unicode text, there are also some helpers provided to ease manipulation of 
17
the tree. Each part of the API is discussed in its respective section below. 
18
Additionaly, reading the source of some [[Available Extensions]] may be helpful.
19
For example, the [[Footnotes]] extension uses most of the features documented 
20
here.
21
22
* [Preprocessors][]
23
* [InlinePatterns][]
24
* [Treeprocessors][] 
25
* [Postprocessors][]
26
* [BlockParser][]
27
* [Working with the ElementTree][]
28
* [Integrating your code into Markdown][]
29
    * [extendMarkdown][]
30
    * [OrderedDict][]
31
    * [registerExtension][]
32
    * [Config Settings][]
33
    * [makeExtension][]
34
35
<h3 id="preprocessors">Preprocessors</h3>
36
37
Preprocessors munge the source text before it is passed into the Markdown 
38
core. This is an excellent place to clean up bad syntax, extract things the 
39
parser may otherwise choke on and perhaps even store it for later retrieval.
40
41
Preprocessors should inherit from ``markdown.preprocessors.Preprocessor`` and 
42
implement a ``run`` method with one argument ``lines``. The ``run`` method of 
43
each Preprocessor will be passed the entire source text as a list of Unicode 
44
strings. Each string will contain one line of text. The ``run`` method should 
45
return either that list, or an altered list of Unicode strings.
46
47
A pseudo example:
48
49
    class MyPreprocessor(markdown.preprocessors.Preprocessor):
50
        def run(self, lines):
51
            new_lines = []
52
            for line in lines:
53
                m = MYREGEX.match(line)
54
                if m:
55
                    # do stuff
56
                else:
57
                    new_lines.append(line)
58
            return new_lines
59
60
<h3 id="inlinepatterns">Inline Patterns</h3>
61
62
Inline Patterns implement the inline HTML element syntax for Markdown such as
63
``*emphasis*`` or ``[links](http://example.com)``. Pattern objects should be 
64
instances of classes that inherit from ``markdown.inlinepatterns.Pattern`` or 
65
one of its children. Each pattern object uses a single regular expression and 
66
must have the following methods:
67
68
* **``getCompiledRegExp()``**: 
69
70
    Returns a compiled regular expression.
71
72
* **``handleMatch(m)``**: 
73
74
    Accepts a match object and returns an ElementTree element of a plain 
75
    Unicode string.
76
77
Note that any regular expression returned by ``getCompiledRegExp`` must capture
78
the whole block. Therefore, they should all start with ``r'^(.*?)'`` and end
79
with ``r'(.*?)!'``. When using the default ``getCompiledRegExp()`` method 
80
provided in the ``Pattern`` you can pass in a regular expression without that 
81
and ``getCompiledRegExp`` will wrap your expression for you. This means that 
82
the first group of your match will be ``m.group(2)`` as ``m.group(1)`` will 
83
match everything before the pattern.
84
85
For an example, consider this simplified emphasis pattern:
86
87
    class EmphasisPattern(markdown.inlinepatterns.Pattern):
88
        def handleMatch(self, m):
89
            el = markdown.etree.Element('em')
90
            el.text = m.group(3)
91
            return el
92
93
As discussed in [Integrating Your Code Into Markdown][], an instance of this
94
class will need to be provided to Markdown. That instance would be created
95
like so:
96
97
    # an oversimplified regex
98
    MYPATTERN = r'\*([^*]+)\*'
99
    # pass in pattern and create instance
100
    emphasis = EmphasisPattern(MYPATTERN)
101
102
Actually it would not be necessary to create that pattern (and not just because
103
a more sophisticated emphasis pattern already exists in Markdown). The fact is,
104
that example pattern is not very DRY. A pattern for `**strong**` text would
105
be almost identical, with the exception that it would create a 'strong' element.
106
Therefore, Markdown provides a number of generic pattern classes that can 
107
provide some common functionality. For example, both emphasis and strong are
108
implemented with separate instances of the ``SimpleTagPettern`` listed below. 
109
Feel free to use or extend any of these Pattern classes.
110
111
**Generic Pattern Classes**
112
113
* **``SimpleTextPattern(pattern)``**:
114
115
    Returns simple text of ``group(2)`` of a ``pattern``.
116
117
* **``SimpleTagPattern(pattern, tag)``**:
118
119
    Returns an element of type "`tag`" with a text attribute of ``group(3)``
120
    of a ``pattern``. ``tag`` should be a string of a HTML element (i.e.: 'em').
121
122
* **``SubstituteTagPattern(pattern, tag)``**:
123
124
    Returns an element of type "`tag`" with no children or text (i.e.: 'br').
125
126
There may be other Pattern classes in the Markdown source that you could extend
127
or use as well. Read through the source and see if there is anything you can 
128
use. You might even get a few ideas for different approaches to your specific
129
situation.
130
131
<h3 id="treeprocessors">Treeprocessors</h3>
132
133
Treeprocessors manipulate an ElemenTree object after it has passed through the
134
core BlockParser. This is where additional manipulation of the tree takes
135
place. Additionally, the InlineProcessor is a Treeprocessor which steps through
136
the tree and runs the InlinePatterns on the text of each Element in the tree.
137
138
A Treeprocessor should inherit from ``markdown.treeprocessors.Treeprocessor``,
139
over-ride the ``run`` method which takes one argument ``root`` (an Elementree 
140
object) and returns either that root element or a modified root element.
141
142
A pseudo example:
143
144
    class MyTreeprocessor(markdown.treeprocessors.Treeprocessor):
145
        def run(self, root):
146
            #do stuff
147
            return my_modified_root
148
149
For specifics on manipulating the ElementTree, see 
150
[Working with the ElementTree][] below.
151
152
<h3 id="postprocessors">Postprocessors</h3>
153
154
Postprocessors manipulate the document after the ElementTree has been 
155
serialized into a string. Postprocessors should be used to work with the
156
text just before output.
157
158
A Postprocessor should inherit from ``markdown.postprocessors.Postprocessor`` 
159
and over-ride the ``run`` method which takes one argument ``text`` and returns 
160
a Unicode string.
161
162
Postprocessors are run after the ElementTree has been serialized back into 
163
Unicode text.  For example, this may be an appropriate place to add a table of 
164
contents to a document:
165
166
    class TocPostprocessor(markdown.postprocessors.Postprocessor):
167
        def run(self, text):
168
            return MYMARKERRE.sub(MyToc, text)
169
170
<h3 id="blockparser">BlockParser</h3>
171
172
Sometimes, pre/tree/postprocessors and Inline Patterns aren't going to do what 
173
you need. Perhaps you want a new type of block type that needs to be integrated 
174
into the core parsing. In such a situation, you can add/change/remove 
175
functionality of the core ``BlockParser``. The BlockParser is composed of a
176
number of Blockproccessors. The BlockParser steps through each block of text
177
(split by blank lines) and passes each block to the appropriate Blockprocessor.
178
That Blockprocessor parses the block and adds it to the ElementTree. The
179
[[Definition Lists]] extension would be a good example of an extension that
180
adds/modifies Blockprocessors.
181
182
A Blockprocessor should inherit from ``markdown.blockprocessors.BlockProcessor``
183
and implement both the ``test`` and ``run`` methods.
184
185
The ``test`` method is used by BlockParser to identify the type of block.
186
Therefore the ``test`` method must return a boolean value. If the test returns
187
``True``, then the BlockParser will call that Blockprocessor's ``run`` method.
188
If it returns ``False``, the BlockParser will move on to the next 
189
BlockProcessor.
190
191
The **``test``** method takes two arguments:
192
193
* **``parent``**: The parent etree Element of the block. This can be useful as
194
  the block may need to be treated differently if it is inside a list, for
195
  example.
196
197
* **``block``**: A string of the current block of text. The test may be a 
198
  simple string method (such as ``block.startswith(some_text)``) or a complex 
199
  regular expression.
200
201
The **``run``** method takes two arguments:
202
203
* **``parent``**: A pointer to the parent etree Element of the block. The run 
204
  method will most likely attach additional nodes to this parent. Note that
205
  nothing is returned by the method. The Elementree object is altered in place.
206
207
* **``blocks``**: A list of all remaining blocks of the document. Your run 
208
  method must remove (pop) the first block from the list (which it altered in
209
  place - not returned) and parse that block. You may find that a block of text
210
  legitimately contains multiple block types. Therefore, after processing the 
211
  first type, your processor can insert the remaining text into the beginning
212
  of the ``blocks`` list for future parsing.
213
214
Please be aware that a single block can span multiple text blocks. For example,
215
The official Markdown syntax rules state that a blank line does not end a
216
Code Block. If the next block of text is also indented, then it is part of
217
the previous block. Therefore, the BlockParser was specifically designed to 
218
address these types of situations. If you notice the ``CodeBlockProcessor``,
219
in the core, you will note that it checks the last child of the ``parent``.
220
If the last child is a code block (``<pre><code>...</code></pre>``), then it
221
appends that block to the previous code block rather than creating a new 
222
code block.
223
224
Each BlockProcessor has the following utility methods available:
225
226
* **``lastChild(parent)``**: 
227
228
    Returns the last child of the given etree Element or ``None`` if it had no 
229
    children.
230
231
* **``detab(text)``**: 
232
233
    Removes one level of indent (four spaces by default) from the front of each
234
    line of the given text string.
235
236
* **``looseDetab(text, level)``**: 
237
238
    Removes "level" levels of indent (defaults to 1) from the front of each line 
239
    of the given text string. However, this methods allows secondary lines to 
240
    not be indented as does some parts of the Markdown syntax.
241
242
Each BlockProcessor also has a pointer to the containing BlockParser instance at
243
``self.parser``, which can be used to check or alter the state of the parser.
244
The BlockParser tracks it's state in a stack at ``parser.state``. The state
245
stack is an instance of the ``State`` class.
246
247
**``State``** is a subclass of ``list`` and has the additional methods:
248
249
* **``set(state)``**: 
250
251
    Set a new state to string ``state``. The new state is appended to the end 
252
    of the stack.
253
254
* **``reset()``**: 
255
256
    Step back one step in the stack. The last state at the end is removed from 
257
    the stack.
258
259
* **``isstate(state)``**: 
260
261
    Test that the top (current) level of the stack is of the given string 
262
    ``state``.
263
264
Note that to ensure that the state stack doesn't become corrupted, each time a
265
state is set for a block, that state *must* be reset when the parser finishes
266
parsing that block.
267
268
An instance of the **``BlockParser``** is found at ``Markdown.parser``.
269
``BlockParser`` has the following methods:
270
271
* **``parseDocument(lines)``**: 
272
273
    Given a list of lines, an ElementTree object is returned. This should be 
274
    passed an entire document and is the only method the ``Markdown`` class 
275
    calls directly.
276
277
* **``parseChunk(parent, text)``**: 
278
279
    Parses a chunk of markdown text composed of multiple blocks and attaches 
280
    those blocks to the ``parent`` Element. The ``parent`` is altered in place 
281
    and nothing is returned. Extensions would most likely use this method for 
282
    block parsing.
283
284
* **``parseBlocks(parent, blocks)``**: 
285
286
    Parses a list of blocks of text and attaches those blocks to the ``parent``
287
    Element. The ``parent`` is altered in place and nothing is returned. This 
288
    method will generally only be used internally to recursively parse nested 
289
    blocks of text.
290
291
While is is not recommended, an extension could subclass or completely replace
292
the ``BlockParser``. The new class would have to provide the same public API.
293
However, be aware that other extensions may expect the core parser provided
294
and will not work with such a drastically different parser.
295
296
<h3 id="working_with_et">Working with the ElementTree</h3>
297
298
As mentioned, the Markdown parser converts a source document to an 
299
[ElementTree][] object before serializing that back to Unicode text. 
300
Markdown has provided some helpers to ease that manipulation within the context 
301
of the Markdown module.
302
303
First, to get access to the ElementTree module import ElementTree from 
304
``markdown`` rather than importing it directly. This will ensure you are using 
305
the same version of ElementTree as markdown. The module is named ``etree`` 
306
within Markdown.
307
308
    from markdown import etree
309
    
310
``markdown.etree`` tries to import ElementTree from any known location, first 
311
as a standard library module (from ``xml.etree`` in Python 2.5), then as a third
312
party package (``Elementree``). In each instance, ``cElementTree`` is tried 
313
first, then ``ElementTree`` if the faster C implementation is not available on 
314
your system.
315
316
Sometimes you may want text inserted into an element to be parsed by 
317
[InlinePatterns][]. In such a situation, simply insert the text as you normally
318
would and the text will be automatically run through the InlinePatterns. 
319
However, if you do *not* want some text to be parsed by InlinePatterns,
320
then insert the text as an ``AtomicString``.
321
322
    some_element.text = markdown.AtomicString(some_text)
323
324
Here's a basic example which creates an HTML table (note that the contents of 
325
the second cell (``td2``) will be run through InlinePatterns latter):
326
327
    table = etree.Element("table") 
328
    table.set("cellpadding", "2")                      # Set cellpadding to 2
329
    tr = etree.SubElement(table, "tr")                 # Add child tr to table
330
    td1 = etree.SubElement(tr, "td")                   # Add child td1 to tr
331
    td1.text = markdown.AtomicString("Cell content")   # Add plain text content
332
    td2 = etree.SubElement(tr, "td")                   # Add second td to tr
333
    td2.text = "*text* with **inline** formatting."    # Add markup text
334
    table.tail = "Text after table"                    # Add text after table
335
336
You can also manipulate an existing tree. Consider the following example which 
337
adds a ``class`` attribute to ``<a>`` elements:
338
339
	def set_link_class(self, element):
340
		for child in element: 
341
		    if child.tag == "a":
342
                child.set("class", "myclass") #set the class attribute
343
            set_link_class(child) # run recursively on children
344
345
For more information about working with ElementTree see the ElementTree
346
[Documentation](http://effbot.org/zone/element-index.htm) 
347
([Python Docs](http://docs.python.org/lib/module-xml.etree.ElementTree.html)).
348
349
<h3 id="integrating_into_markdown">Integrating Your Code Into Markdown</h3>
350
351
Once you have the various pieces of your extension built, you need to tell 
352
Markdown about them and ensure that they are run in the proper sequence. 
353
Markdown accepts a ``Extension`` instance for each extension. Therefore, you
354
will need to define a class that extends ``markdown.Extension`` and over-rides
355
the ``extendMarkdown`` method. Within this class you will manage configuration 
356
options for your extension and attach the various processors and patterns to 
357
the Markdown instance. 
358
359
It is important to note that the order of the various processors and patterns 
360
matters. For example, if we replace ``http://...`` links with <a> elements, and 
361
*then* try to deal with  inline html, we will end up with a mess. Therefore, 
362
the various types of processors and patterns are stored within an instance of 
363
the Markdown class in [OrderedDict][]s. Your ``Extension`` class will need to 
364
manipulate those OrderedDicts appropriately. You may insert instances of your 
365
processors and patterns into the appropriate location in an OrderedDict, remove
366
a built-in instance, or replace a built-in instance with your own.
367
368
<h4 id="extendmarkdown">extendMarkdown</h4>
369
370
The ``extendMarkdown`` method of a ``markdown.Extension`` class accepts two 
371
arguments:
372
373
* **``md``**:
374
375
    A pointer to the instance of the Markdown class. You should use this to 
376
    access the [OrderedDict][]s of processors and patterns. They are found 
377
    under the following attributes:
378
379
    * ``md.preprocessors``
380
    * ``md.inlinePatterns``
381
    * ``md.parser.blockprocessors``
382
    * ``md.treepreprocessors``
383
    * ``md.postprocessors``
384
385
    Some other things you may want to access in the markdown instance are:
386
387
    * ``md.htmlStash``
388
    * ``md.output_formats``
389
    * ``md.set_output_format()``
390
    * ``md.registerExtension()``
391
392
* **``md_globals``**:
393
394
    Contains all the various global variables within the markdown module.
395
396
Of course, with access to those items, theoretically you have the option to 
397
changing anything through various [monkey_patching][] techniques. However, you 
398
should be aware that the various undocumented or private parts of markdown 
399
may change without notice and your monkey_patches may break with a new release.
400
Therefore, what you really should be doing is inserting processors and patterns
401
into the markdown pipeline. Consider yourself warned.
402
403
[monkey_patching]: http://en.wikipedia.org/wiki/Monkey_patch
404
405
A simple example:
406
407
    class MyExtension(markdown.Extension):
408
        def extendMarkdown(self, md, md_globals):
409
            # Insert instance of 'mypattern' before 'references' pattern
410
            md.inlinePatterns.add('mypattern', MyPattern(md), '<references')
411
412
<h4 id="ordereddict">OrderedDict</h4>
413
414
An OrderedDict is a dictionary like object that retains the order of it's
415
items. The items are ordered in the order in which they were appended to
416
the OrderedDict. However, an item can also be inserted into the OrderedDict
417
in a specific location in relation to the existing items.
418
419
Think of OrderedDict as a combination of a list and a dictionary as it has 
420
methods common to both. For example, you can get and set items using the 
421
``od[key] = value`` syntax and the methods ``keys()``, ``values()``, and 
422
``items()`` work as expected with the keys, values and items returned in the 
423
proper order. At the same time, you can use ``insert()``, ``append()``, and 
424
``index()`` as you would with a list.
425
426
Generally speaking, within Markdown extensions you will be using the special 
427
helper method ``add()`` to add additional items to an existing OrderedDict. 
428
429
The ``add()`` method accepts three arguments:
430
431
* **``key``**: A string. The key is used for later reference to the item.
432
433
* **``value``**: The object instance stored in this item.
434
435
* **``location``**: Optional. The items location in relation to other items. 
436
437
    Note that the location can consist of a few different values:
438
439
    * The special strings ``"_begin"`` and ``"_end"`` insert that item at the 
440
      beginning or end of the OrderedDict respectively. 
441
    
442
    * A less-than sign (``<``) followed by an existing key (i.e.: 
443
      ``"<somekey"``) inserts that item before the existing key.
444
    
445
    * A greater-than sign (``>``) followed by an existing key (i.e.: 
446
      ``">somekey"``) inserts that item after the existing key. 
447
448
Consider the following example:
449
450
    >>> import markdown
451
    >>> od = markdown.OrderedDict()
452
    >>> od['one'] =  1           # The same as: od.add('one', 1, '_begin')
453
    >>> od['three'] = 3          # The same as: od.add('three', 3, '>one')
454
    >>> od['four'] = 4           # The same as: od.add('four', 4, '_end')
455
    >>> od.items()
456
    [("one", 1), ("three", 3), ("four", 4)]
457
458
Note that when building an OrderedDict in order, the extra features of the
459
``add`` method offer no real value and are not necessary. However, when 
460
manipulating an existing OrderedDict, ``add`` can be very helpful. So let's 
461
insert another item into the OrderedDict.
462
463
    >>> od.add('two', 2, '>one')         # Insert after 'one'
464
    >>> od.values()
465
    [1, 2, 3, 4]
466
467
Now let's insert another item.
468
469
    >>> od.add('twohalf', 2.5, '<three') # Insert before 'three'
470
    >>> od.keys()
471
    ["one", "two", "twohalf", "three", "four"]
472
473
Note that we also could have set the location of "twohalf" to be 'after two'
474
(i.e.: ``'>two'``). However, it's unlikely that you will have control over the 
475
order in which extensions will be loaded, and this could affect the final 
476
sorted order of an OrderedDict. For example, suppose an extension adding 
477
'twohalf' in the above examples was loaded before a separate  extension which 
478
adds 'two'. You may need to take this into consideration when adding your 
479
extension components to the various markdown OrderedDicts.
480
481
Once an OrderedDict is created, the items are available via key:
482
483
    MyNode = od['somekey']
484
485
Therefore, to delete an existing item:
486
487
    del od['somekey']
488
489
To change the value of an existing item (leaving location unchanged):
490
491
    od['somekey'] = MyNewObject()
492
493
To change the location of an existing item:
494
495
    t.link('somekey', '<otherkey')
496
497
<h4 id="registerextension">registerExtension</h4>
498
499
Some extensions may need to have their state reset between multiple runs of the
500
Markdown class. For example, consider the following use of the [[Footnotes]] 
501
extension:
502
503
    md = markdown.Markdown(extensions=['footnotes'])
504
    html1 = md.convert(text_with_footnote)
505
    md.reset()
506
    html2 = md.convert(text_without_footnote)
507
508
Without calling ``reset``, the footnote definitions from the first document will
509
be inserted into the second document as they are still stored within the class
510
instance. Therefore the ``Extension`` class needs to define a ``reset`` method
511
that will reset the state of the extension (i.e.: ``self.footnotes = {}``).
512
However, as many extensions do not have a need for ``reset``, ``reset`` is only
513
called on extensions that are registered.
514
515
To register an extension, call ``md.registerExtension`` from within your 
516
``extendMarkdown`` method:
517
518
519
    def extendMarkdown(self, md, md_globals):
520
        md.registerExtension(self)
521
        # insert processors and patterns here
522
523
Then, each time ``reset`` is called on the Markdown instance, the ``reset`` 
524
method of each registered extension will be called as well. You should also
525
note that ``reset`` will be called on each registered extension after it is
526
initialized the first time. Keep that in mind when over-riding the extension's
527
``reset`` method.
528
529
<h4 id="configsettings">Config Settings</h4>
530
531
If an extension uses any parameters that the user may want to change,
532
those parameters should be stored in ``self.config`` of your 
533
``markdown.Extension`` class in the following format:
534
535
    self.config = {parameter_1_name : [value1, description1],
536
                   parameter_2_name : [value2, description2] }
537
538
When stored this way the config parameters can be over-ridden from the
539
command line or at the time Markdown is initiated:
540
541
    markdown.py -x myextension(SOME_PARAM=2) inputfile.txt > output.txt
542
543
Note that parameters should always be assumed to be set to string
544
values, and should be converted at run time. For example:
545
546
    i = int(self.getConfig("SOME_PARAM"))
547
548
<h4 id="makeextension">makeExtension</h4>
549
550
Each extension should ideally be placed in its own module starting
551
with the  ``mdx_`` prefix (e.g. ``mdx_footnotes.py``).  The module must
552
provide a module-level function called ``makeExtension`` that takes
553
an optional parameter consisting of a dictionary of configuration over-rides 
554
and returns an instance of the extension.  An example from the footnote 
555
extension:
556
557
    def makeExtension(configs=None) :
558
        return FootnoteExtension(configs=configs)
559
560
By following the above example, when Markdown is passed the name of your 
561
extension as a string (i.e.: ``'footnotes'``), it will automatically import
562
the module and call the ``makeExtension`` function initiating your extension.
563
564
You may have noted that the extensions packaged with Python-Markdown do not
565
use the ``mdx_`` prefix in their module names. This is because they are all
566
part of the ``markdown.extensions`` package. Markdown will first try to import
567
from ``markdown.extensions.extname`` and upon failure, ``mdx_extname``. If both
568
fail, Markdown will continue without the extension.
569
570
However, Markdown will also accept an already existing instance of an extension.
571
For example:
572
573
    import markdown
574
    import myextension
575
    configs = {...}
576
    myext = myextension.MyExtension(configs=configs)
577
    md = markdown.Markdown(extensions=[myext])
578
579
This is useful if you need to implement a large number of extensions with more
580
than one residing in a module.
581
582
[Preprocessors]: #preprocessors
583
[InlinePatterns]: #inlinepatterns
584
[Treeprocessors]: #treeprocessors
585
[Postprocessors]: #postprocessors
586
[BlockParser]: #blockparser
587
[Working with the ElementTree]: #working_with_et
588
[Integrating your code into Markdown]: #integrating_into_markdown
589
[extendMarkdown]: #extendmarkdown
590
[OrderedDict]: #ordereddict
591
[registerExtension]: #registerextension
592
[Config Settings]: #configsettings
593
[makeExtension]: #makeextension
594
[ElementTree]: http://effbot.org/zone/element-index.htm