TextTools - Fast text manipulation tools for Python

mxTextTools - Fast text manipulation tools for Python

Engine : Objects : Functions : Constants : Examples : Structure : Installation : Copyright : History : Home

Version 1.0.2

Introduction

A while ago, in spring 1997, I started out to write some tools that were supposed to make string handling and parsing text faster than what the standard library has to offer. I had a need for this since I was (and still am) working on an online formatting system for web pages. After some initial prototypes of what I call Tagging Engine written totally in Python I started rewriting the main parts in C and soon realized that I needed a little more sophisticated searching tools.

I could walk through text pretty fast, but in many situations I just needed to replace some text with some other text.

The next step was to create a new types for fast searching in text. I decided to code up an enhanced version of the well known Boyer-Moore search algorithm. This made me think a bit more about searching and how knowledge about the text and the search pattern could be better used to make it work even faster. The result was an algorithm that uses a suffix skip array, which I call Fast Search Algorithm.

The two search types are built upon a small C lib I wrote for this. The implementations are optimized for gcc/Linux and from the tests I ran I can say that they out-perform every other technique I have tried. Even the very fast Boyer-Moore implementation of fgrep (1).

Then I reintegrated those search utilities into the Tagging Engine and also added a fast variant for doing 'char out of a string'-kind of tests. These are done using 'sets', i.e. strings that contain one bit per character position (and thus 32 bytes long).

All this got wrapped up in a nice Python package:

a fast search mechanism,
a finite state machine for doing fast tagging,
a set of functions aiding in post-processing the output of the two and
a set of functions handling sets of characters.

One word about the word 'tagging'. This originated from what is done in HTML to mark some text with a certain extra information. I extended this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the object) which can be used to do all kinds of nifty things.

Tagging Engine

Marking certains parts of a text should not involve storing hundreds of small strings. This is why the Tagging Engine uses a specially formatted list of tuples to return the results:

Tag List

A tag list is a list of tuples marking certain slices of a text. The tuples always have the format

(object, left_index, right_index, sublist)

with the meaning: object contains information about the slice [left_index:right_index] in some text. The sublist is either another taglist created by recursively invoking the Tagging Engine or None.

Tag Table

To create such taglists, you have to define a Tag Table and let the Tagging Engine use it to mark the text. Tag Tables are really just standard Python tuples containing other tuples in a specific format:

tag_table = (('lowercase',AllIn,a2z,+1,+2),
	     ('upper',AllIn,A2Z,+1),
	     (None,AllIn,white+newline,+1),
	     (None,AllNotIn,alpha+white+newline,+1),
	     (None,EOF,Here,-4)) # EOF

The tuples contained in the table use a very simple format:

(tagobj, command+flags, command_argument
	      		[,jump_no_match] [,jump_match=+1])

Semantics

The Tagging Engine reads the Tag Table starting at the top entry. While performing the command actions (see below for details) it moves a read-head over the characters of the text. The engine stops when a command fails to match and no alternative is given or when it reaches a non-existing entry, e.g. by jumping beyond the end of the table.

Tag Table entries are processed as follows:

If the command matched, say the slice text[l:r], the default action is to append (tagobj,l,r,sublist) to the taglist (this behaviour can be modified by using special flags; if you use None as tagobj, no tuple is appended) and to continue matching with the table entry that is reached by adding jump_match to the current position (think of them as relative jump offsets). The head position of the engine stays where the command left it (over index r), e.g. for (None,AllIn,'A') right after the last 'A' matched.

In case the command does not match, the engine either continues at the table entry reached after skipping jump_no_match entries, or if this value is not given, terminates matching the current table and returns not matched. The head position is always restored to the position it was in before the non-matching command was executed, enabling backtracking.

The format of the command_argument is dependant on the command. It can be a string, a set, a search object, a tuple of some other wild animal from Python land. See the command section below for details.

A table matches a string if and only if the Tagging Engine reaches a table index that lies out of bounds, e.g. one beyond the end of the table. The engine then returns matched ok.

Tagging Commands

The commands and constants used here are integers defined in Constants/TagTables.py and imported into the package's root module. For the purpose of explaining the taken actions we assume that the tagging engine was called with tag(text,table,start=0,end=len(text)). The current head position is indicated by x.

Command Matching Argument Action

Fail Here Causes the engine to fail matching at the current head position.

Jump To Causes the engine to perform a relative jump by jump_no_match entries.

AllIn string Matches all characters found in text[x:end] up to the first that is not included in string. At least one character must match.

AllNotIn string Matches all characters found in text[x:end] up to the first that is included in string. At least one character must match.

AllInSet set Matches all characters found in text[x:end] up to the first that is not included in the string set. At least one character must match.

Is character Matches iff text[x] == character.

IsNot character Matches iff text[x] != character.

IsIn string Matches iff text[x] is in string.

IsNotIn string Matches iff text[x] is not in string.

IsInSet set Matches iff text[x] is in set.

Word string Matches iff text[x:x+len(string)] == string.

WordStart string Matches all characters up to the first occurance of string in text[x:end]. If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the first character of string in the found occurance. At least one character must match.

WordEnd string Matches all characters up to the first occurance of string in text[x:end]. If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the last character of string in the found occurance.

sWordStart search object Same as WordStart except that the search object is used to perform the necessary action (which can be much faster) and zero matching characters are allowed.

sWordEnd search object Same as WordEnd except that the search object is used to perform the necessary action (which can be much faster).

sFindWord search object Uses the search object to find the given substring. If found, the tagobj is assigned only to the slice of the substring. The character leading up to it are ignored. The head position is adjusted to right after the substring -- just like for sWordEnd.

Call function Calls the matching function(text,x,end). The function must return the index y of the character in text[x:end] right after the matching substring. The entry is considered to be matching, iff x != y. The engines head is positioned on y in that case.

CallArg (function,[arg0,...]) Same as Call except that function(text,x,end[,arg0,...]) is being called. The command argument must be a tuple.

Table tagtable or ThisTable Matches iff tagtable matches text[x:end]. This calls the engine recursively. In case of success the head position is adjusted to point right after the match and the returned taglist is made available in the subtags field of this tables taglist entry. You may pass the special constant ThisTable instead of a Tag Table if you want to call the current table recursively.

SubTable tagtable or ThisTable Same as Table except that the subtable uses this table's taglist as its tag list. The subtags entry is set to None. You may pass the special constant ThisTable instead of a Tag Table if you want to call the current table recursively.

TableInList (list_of_tables,index) Same as Table except that the matching table to be used is read from the list_of_tables at position index whenever this command is executed. This makes self-referencing tables possible which would otherwise not be possible (since Tag Tables are immutable tuples). Note that it can also introduce circular references, so be warned !

EOF Here Matches iff the head position is beyond end.

Skip offset Always matches and moves the head position to x + offset.

Move position Always matches and moves the head position to text[position]. Negative indices move the head to text[end+position+1], e.g. (None,Move,-1) moves to behind the last input character, or EOF.

Loop count Remains undocumented for this release.

LoopControl Break/Reset Remains undocumented for this release.

The following flags can be added to the command integers above:

CallTag: Instead of appending (tagobj,l,r,subtags) to the taglist upon successful matching, call tagobj(taglist,text,l,r,subtags).
AppendToTag: Instead of appending (tagobj,l,r,subtags) to the taglist upon successful matching, call tagobj.append((None,l,r,subtags)).
AppendTagobj: Instead of appending (tagobj,l,r,subtags) to the taglist upon successful matching, append tagobj itself. Note that this can cause the taglist to have a non-standard format, i.e. functions relying on the standard format could fail. This flag is mainly intended to build join-lists usable by the join()-function (see below).

Some additional constants that can be used as argument or relative jump position:

To: Useful as argument for 'Jump'.
Here: Useful as argument for 'Fail' and 'EOF'.
MatchOk: Jumps to a table index out of bounds, causing the current table to immediatly return with 'matches ok'.
ToEOF: Useful as argument for 'Move'.
ThisTable: Useful as argument for 'Table' and 'SubTable'. See there for more information.

Internally, the Tag Table is used as program for a finite state machine which is coded in C and accessible through the package as tag() function along with the constants used for the commands (e.g. Allin, AllNotIn, etc.).

I admit, these tables don't look very elegant. In fact I would much rather write them in some meta language that gets compiled into these tables instead of handcoding them. But I've never had time to do much research into this, so unless someone volunteers, it will stay like this for some time.

Tip: if you are getting an error 'call of a non-function' while writing a table definition, you probably have a missing ',' somewhere in the tuple !

Debugging

The packages includes a nearly complete Python emulation of the Tagging Engine in the Examples subdirectory (pytag.py). Though it is unsupported it might still provide some use since it has a builtin debugger that will let you step through the Tag Tables as they are executed. See the source for further details.

As an alternative you can build a version of the Tagging Engine that provides lots of debugging output. See mxTextTools/Setup for explanations on how to do this. When enabled the module will create several .log files containing the debug information of various parts of the implementation whenever the Python interpreter is run with the debug flag enabled (python -d). These files should give a fairly good insight into the workings of the Tag Engine (though it still isn't as elegant as it could be).

Note that the debug version of the module is almost as fast as the regular build, so you might as well do regular work with it.

Search Objects

These objects are immutable and usable for one search string per object only. They can be applied to as many text strings as you like -- much like compiled regular expressions. Matching is done exact (doing the translations on-the-fly).

Search Object Constructors

There are two types of search objects. The Boyer-Moore type uses less memory, while the Fast Search type gives you enhanced speed with a little more memory overhead.

Note: The Fast Search object is *not* included in the public release, since I wan't to write a paper about it and therefore can't make it available yet.

BMS(match[,translate]): Create a Boyer Moore substring search object for the string match; translate is an optional translate-string like the one used in the module 're', i.e. a 256 character string mapping the oridnals of the base character set to new characters.
FS(match[,translate]): Create a Fast substring Search object for the string match; translate is an optional translate-string like the one used in the module 're'.

Search Object Instance Methods

The two search types have the same methods:

search(text,[start=0,len_text=len(text)]): Search for the substring match in text, looking only at the slice [start:len_text] and return the slice (l,r) where the substring was found, or (start,start) if it was not found.
find(text,[start=0,len_text=len(text)]): Search for the substring match in text, looking only at the slice [start:len_text] and return the index where the substring was found, or -1 if it was not found. This interface is compatible with string.find.
findall(text,start=0,len_text=len(text)): Same as search(), but return a list of all non-overlapping slices (l,r) where the match string can be found in text.

Note that translating the text before doing the search often results in a better performance. Use string.translate() to do that efficiently.

Functions

These functions are defined in the package:

tag(text,tagtable[,startindex=0,len_text=len(text),taglist=[]])

This is the interface to the Tagging Engine. It returns a tuple (success, taglist, nextindex), where nextindex indicates the next index to be processed after the last character matched by the Tag Table. In case of a non match (success == 0), it points to the error location in text. If you provide a tag list it will be used for the processing. Passing None results in no tag list being created at all.

join(joinlist[,sep=''])

This function works much like the corresponding function in module 'string'. It pastes slices from other strings together to form a new string. The format expected as joinlist is similar to a tag list: it is a list of tuples (string,l,r[,...]) (the resulting string will then include the slice string[l:r]) or strings (which are copied as a whole). Extra entries in the tuple are ignored. The optional argument is a seperator to be used in joining the slices together, it defaults to the empty string (unlike string.join).

Important Note: The syntax used for negative slices is different than the Python standard: -1 corresponds to the first character *after* the string, e.g. ('Example',0,-1) gives 'Example' and not 'Exampl', like in Python. To avoid confusion, don't use negative indices.

cmp(a,b)

Compare two valid taglist tuples w/r to their slice position. This is useful for sorting joinlists and not much slower than sorting integers, since the function is coded in C.

joinlist(text,list[,start=0,stop=len(text)])

Produces a joinlist suitable for passing to join() from a list of tuples (replacement,l,r,...) in such a way that all slices text[l:r] are replaced by the given replacement.

A few restrictions apply, though:

the list must be sorted ascending (e.g. using the cmp() as compare function)
it may not contain overlapping slices
the slices may not contain negative indices
if the taglist cannot contain overlapping slices, you can give this function the taglist produced by tag() directly (sorting is not needed, as the list will already be sorted)

If one of these conditions is not met, a ValueError is raised.

set(string[,logic=1])

Returns a character set for string: a bit encoded version of the characters occurring in string. If logic is 0, then all characters not in string will be in the set.

invset(string)

Same as set(string,0).

setfind(text,set[,start=0,stop=len(text)])

Find the first occurence of any character from set in text[start:stop]. set must be a string obtained from set().

setsplit(text,set[,start=0,stop=len(text)])

Split text[start:stop] into substrings using set, omitting the splitting parts and empty substrings. set must be a string obtained from set().

setsplitx(text,set[,start=0,stop=len(text)])

Split text[start:stop] into substrings using set, so that every second entry consists only of characters in set. set must be a string obtained from set().

upper(string)

Returns the string with all characters converted to upper case. Note that the translation string used is generated at import time. Locale settings will only have an effect if set prior to importing the package. This function is almost twice as fast as the one in the string module.

lower(string)

Returns the string with all characters converted to lower case. Same note as for upper().

is_whitespace(text,start=0,stop=len(text))

Returns 1 iff text[start:stop] only contains whitespace characters (as defined in Constants/Sets.py), 0 otherwise.

replace(text,what,with,start=0,stop=len(text))

Works just like string.replace() -- only faster since a search object is used in the process.

find(text,what,start=0,stop=len(text))

Works just like string.find() -- only faster since a search object is used in the process.

findall(text,what,start=0,stop=len(text))

Returns a list of slices representing all non-overlapping occurances of what in text[start:stop]. The slices are given as 2-tuples (left,right) meaning that what can be found at text[left:right].

collapse(text,seperator=' ')

Takes a string, removes all line breaks, converts all whitespace to a single seperator and returns the result. Tim Peters will like this one with seperator '-'.

charsplit(text,char,start=0,stop=len(text))

Returns a list that results from splitting text[start:stop] at all occurances of the character given in char. This is a special case of string.split() that has been optimized for single character splitting running 40% faster.

splitat(text,char,nth=1,start=0,stop=len(text))

Returns a 2-tuple that results from splitting text[start:stop] at the nth occurance of char and return the result as 2-tuple. If the character is not found, the second string is empty. nth may also be negative: the search is then done from the right and the first string is empty in case the character is not found. The splitting character itself is not included in the two substrings.

The TextTools.py also defines some other functions, but these are left undocumented since they may disappear in future releases.

Constants

The package exports these constants. They are defined in Constants/Sets.

a2z: 'abcdefghijklmnopqrstuvwxyz'
A2Z: 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
a2z: 'abcdefghijklmnopqrstuvwxyz'
umlaute: 'äöüß'
Umlaute: 'ÄÖÜ'
alpha: A2Z + a2z
a2z: 'abcdefghijklmnopqrstuvwxyz'
german_alpha: A2Z + a2z + umlaute + Umlaute
number: '0123456789'
alphanumeric: alpha + number
white: ' \t\v'
newline: '\n\r'
formfeed: '\f'
whitespace: white + newline + formfeed
any: All characters from \000-\377
*_set: All of the above as character sets.

Examples of Use

The Examples/ subdirectory of the package contains a few examples of how tables can be written and used. Here is a non-trivial example for parsing HTML (well, most of it):


    from TextTools import *

    error = '***syntax error'			# error tag obj

    tagname_set = set(alpha+'-'+number)
    tagattrname_set = set(alpha+'-'+number)
    tagvalue_set = set('"\'> ',0)
    white_set = set(' \r\n\t')

    tagattr = (
	   # name
	   ('name',AllInSet,tagattrname_set),
	   # skip junk
	   (None,AllInSet,white_set,+1),
	   # with value ?
	   (None,Is,'=',MatchOk),
	   # skip junk
	   (None,AllInSet,white_set,+1),
	   # unquoted value
	   ('value',AllInSet,tagvalue_set,+1,MatchOk),
	   # double quoted value
	   (None,Is,'"',+4),
	     ('value',AllNotIn,'"'),
	     (None,Is,'"'),
	     (None,Jump,To,MatchOk),
	   # single quoted value
	   (None,Is,'\''),
	     ('value',AllNotIn,'\''),
	     (None,Is,'\'')
	   )

    valuetable = (
	# ignore whitespace + '='
	(None,AllInSet,set(' \r\n\t='),+1),
	# unquoted value
	('value',AllInSet,tagvalue_set,+1,MatchOk),
	# double quoted value
	(None,Is,'"',+4),
	 ('value',AllNotIn,'"'),
	 (None,Is,'"'),
	 (None,Jump,To,MatchOk),
	# single quoted value
	(None,Is,'\''),
	 ('value',AllNotIn,'\''),
	 (None,Is,'\'')
	)

    allattrs = (# look for attributes
	       (None,AllInSet,white_set,+4),
	        (None,Is,'>',+1,MatchOk),
	        ('tagattr',Table,tagattr),
	        (None,Jump,To,-3),
	       (None,Is,'>',+1,MatchOk),
	       # handle incorrect attributes
	       (error,AllNotIn,'> \r\n\t'),
	       (None,Jump,To,-6)
	       )

    htmltag = ((None,Is,'<'),
	       # is this a closing tag ?
	       ('closetag',Is,'/',+1),
	       # a coment ?
	       ('comment',Is,'!',+8),
		(None,Word,'--',+4),
		('text',sWordStart,BMS('-->'),+1),
		(None,Skip,3),
		(None,Jump,To,MatchOk),
		# a SGML-Tag ?
		('other',AllNotIn,'>',+1),
		(None,Is,'>'),
		    (None,Jump,To,MatchOk),
		   # XMP-Tag ?
		   ('tagname',Word,'XMP',+5),
		    (None,Is,'>'),
		    ('text',WordStart,'</XMP>'),
		    (None,Skip,len('</XMP>')),
		    (None,Jump,To,MatchOk),
		   # get the tag name
		   ('tagname',AllInSet,tagname_set),
		   # look for attributes
		   (None,AllInSet,white_set,+4),
		    (None,Is,'>',+1,MatchOk),
		    ('tagattr',Table,tagattr),
		    (None,Jump,To,-3),
		   (None,Is,'>',+1,MatchOk),
		   # handle incorrect attributes
		   (error,AllNotIn,'> \n\r\t'),
		   (None,Jump,To,-6)
		  )

    htmltable = (# HTML-Tag
		 ('htmltag',Table,htmltag,+1,+4),
		 # not HTML, but still using this syntax: error or inside XMP-tag !
		 (error,Is,'<',+3),
		  (error,AllNotIn,'>',+1),
		  (error,Is,'>'),
		 # normal text
		 ('text',AllNotIn,'<',+1),
		 # end of file
		 ('eof',EOF,Here,-5),
		)

I hope this doesn't scare you away :-) ... it's fast as hell.

Package Structure

[TextTools]
       [Constants]
              Sets.py
              TagTables.py
       Doc/
       [Examples]
              HTML.py
              Loop.py
              Python.py
              RTF.py
              RegExp.py
              Tim.py
              Words.py
              altRTF.py
              pytag.py
       [mxTextTools]
              test.py
       TextTools.py

Entries enclosed in brackets are packages (i.e. they are directories that include a __init__.py file). Ones with slashes are just ordinary subdirectories that are not accessible via import.

The package TextTools imports everything needed from the other components. It is sometimes also handy to do a from TextTools.Constants.TagTables import *.

Examples/ contains a few demos of what the Tag Tables can do.

Installation

First, download the archive, unzip it in a directory on your Python path and then follow these steps (assuming you have already installed Python):

Note: The archive includes a pre-compiled DLL for Windows NT/95 contributed by Gordon McMillan. It may not be the most current version though (check TextTools.__version__). If you intend to use this file, you can skip the compilation steps below.

Instructions for compiling the C extension on Unix platforms:

run make -f Makefile.pre.in boot in the mxTextTools directory
run make
get Python 1.5 or above running, and execute test.py; it should not report any errors.
give me feedback :-)

Instructions for compiling the C extension on Windows platforms using VC++ (courtesy of Gordon McMillan):

Create a Project for a Win32 Dynamic-Link Library
Rename the mxTextTools.c file to mxTextTools.cpp and add it to the project
Use the same runtime library your Python is built with (should be Multithreaded DLL)
Point the compiler to Python/Include
Point the linker to Python/Lib and Python15.lib (no other libs are necessary)
Build
Move the resulting mxTextTools.dll to mxTextTools.pyd
get Python 1.5 or above running, and execute test.py; it should not report any errors.
give me feedback :-)

Though the module has been tested and is in constant use, there may still be some bugs left. Please post any bug reports, questions etc. directly to me.

What I'd like to hear from you...

Can you get the module to compile/work on other platforms other than Linux ? [yes for SGI/Irix, Macintosh, WinXX]

Copyright & Disclaimer

Permission to use, copy, modify, and distribute this software and its documentation for any purpose and without fee or royalty is hereby granted, provided that the above copyright notice appear in all copies and that both the copyright notice and this permission notice appear in supporting documentation or portions thereof, including modifications, that you make.

THE AUTHOR MARC-ANDRE LEMBURG DISCLAIMS ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE !

History & Future

Things that still need to be done:

Invent a nice matching language that can be compiled into tagtables.
Provide some more examples.
Clean up the C implementation and this document some more.
Do some benchmarking...

Things that changed from 1.0.1 to 1.0.2:

Fixed some of the undocumented printing functions.
Added Tim.py example for dynamic programming using Tag Tables.
Tuned the Tagging Engine a little more. Added optimizations to TextTools.join(). It is faster then string.join() now (but only excepts real Python lists as input).
Added collapse(). Tim Peters will like this one...
Tuned setsplit, setsplitx and joinlist somewhat. The performance is now comparable to string.split (for tasks producing the same output).
Added charsplit() and splitat().
Fixed a bug in join() that prevented the function from returning '' for empty lists. It raised a SystemError instead.
Added better exception reporting to the tagging engine. Errors are now reported together with the index of the Tag Table entry that caused the exception.
Fixed and reformatted included debugging support. If you want the C engine to be very verbose about what it's doing, compile the engine using '-DMAL_DEBUG -DMAL_PYTHON'. If you run the Python interpreter with '-d' option, the engine will print tons of information to stdout, e.g. "python -d Examples/HTML.py Doc/mxTextTools.html". The engine remains silent without the -d switch.
Added special ThisTable constant to simplify writing recursive Tag Tables.

Things that changed from 1.0.0 to 1.0.1:

Added new functions find() and findall().
Fixed a few quirks that caused compilation problems on Windows. Eliminated the dependency on hack.py in TextTools.py and some of the examples.
Added a compiled Windows PYD-file of the C extension. Thanks to Gordon McMillan for providing it and pointing out a couple of portability bugs.
Added instructions on how to build the C extension under WinXX courtesy of Gordon McMillan.
Added some type casts to make CodeWarrior/Mac happy. Thanks to Just van Rossum for this hint.

Things that changed from the really old TagIt module version 0.7 to mxTextTools 1.0.0:

Added lots of new commands, fixed some bugs, added documentation and wrapped everything into a package.
Added character set handling routines and search objects.

Command	Matching Argument	Action
Fail	Here	Causes the engine to fail matching at the current head position.
Jump	To	Causes the engine to perform a relative jump by `jump_no_match` entries.
AllIn	string	Matches all characters found in `text[x:end]` up to the first that is not included in string. At least one character must match.
AllNotIn	string	Matches all characters found in `text[x:end]` up to the first that is included in string. At least one character must match.
AllInSet	set	Matches all characters found in `text[x:end]` up to the first that is not included in the string set. At least one character must match.
Is	character	Matches iff `text[x] == character`.
IsNot	character	Matches iff `text[x] != character`.
IsIn	string	Matches iff `text[x] is in string`.
IsNotIn	string	Matches iff `text[x] is not in string`.
IsInSet	set	Matches iff `text[x] is in set`.
Word	string	Matches iff `text[x:x+len(string)] == string`.
WordStart	string	Matches all characters up to the first occurance of string in `text[x:end]`. If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the first character of string in the found occurance. At least one character must match.
WordEnd	string	Matches all characters up to the first occurance of string in `text[x:end]`. If string is not found, the command does not match and the head position remains unchanged. Otherwise, the head stays on the last character of string in the found occurance.
sWordStart	search object	Same as WordStart except that the search object is used to perform the necessary action (which can be much faster) and zero matching characters are allowed.
sWordEnd	search object	Same as WordEnd except that the search object is used to perform the necessary action (which can be much faster).
sFindWord	search object	Uses the search object to find the given substring. If found, the tagobj is assigned only to the slice of the substring. The character leading up to it are ignored. The head position is adjusted to right after the substring -- just like for sWordEnd.
Call	function	Calls the matching `function(text,x,end)`. The function must return the index `y` of the character in `text[x:end]` right after the matching substring. The entry is considered to be matching, iff `x != y`. The engines head is positioned on `y` in that case.
CallArg	(function,[arg0,...])	Same as Call except that `function(text,x,end[,arg0,...])` is being called. The command argument must be a tuple.
Table	tagtable or ThisTable	Matches iff tagtable matches `text[x:end]`. This calls the engine recursively. In case of success the head position is adjusted to point right after the match and the returned taglist is made available in the subtags field of this tables taglist entry. You may pass the special constant `ThisTable` instead of a Tag Table if you want to call the current table recursively.
SubTable	tagtable or ThisTable	Same as Table except that the subtable uses this table's taglist as its tag list. The subtags entry is set to None. You may pass the special constant `ThisTable` instead of a Tag Table if you want to call the current table recursively.
TableInList	(list_of_tables,index)	Same as Table except that the matching table to be used is read from the `list_of_tables` at position `index` whenever this command is executed. This makes self-referencing tables possible which would otherwise not be possible (since Tag Tables are immutable tuples). Note that it can also introduce circular references, so be warned !
EOF	Here	Matches iff the head position is beyond `end`.
Skip	offset	Always matches and moves the head position to `x + offset`.
Move	position	Always matches and moves the head position to `text[position]`. Negative indices move the head to `text[end+position+1]`, e.g. (None,Move,-1) moves to behind the last input character, or EOF.
Loop	count	Remains undocumented for this release.
LoopControl	Break/Reset	Remains undocumented for this release.