Engine : Objects : Functions : Constants : Examples : Structure : Installation : Copyright : History : Home | Version 1.0.2 |
A while ago, in spring 1997, I started out to write some tools that were supposed to make string handling and parsing text faster than what the standard library has to offer. I had a need for this since I was (and still am) working on an online formatting system for web pages. After some initial prototypes of what I call Tagging Engine written totally in Python I started rewriting the main parts in C and soon realized that I needed a little more sophisticated searching tools.
I could walk through text pretty fast, but in many situations I just needed to replace some text with some other text.
The next step was to create a new types for fast searching in text. I decided to code up an enhanced version of the well known Boyer-Moore search algorithm. This made me think a bit more about searching and how knowledge about the text and the search pattern could be better used to make it work even faster. The result was an algorithm that uses a suffix skip array, which I call Fast Search Algorithm.
The two search types are built upon a small C lib I wrote for this. The implementations are optimized for gcc/Linux and from the tests I ran I can say that they out-perform every other technique I have tried. Even the very fast Boyer-Moore implementation of fgrep (1).
Then I reintegrated those search utilities into the Tagging Engine and also added a fast variant for doing 'char out of a string'-kind of tests. These are done using 'sets', i.e. strings that contain one bit per character position (and thus 32 bytes long).
All this got wrapped up in a nice Python package:
One word about the word 'tagging'. This originated from what is done in HTML to mark some text with a certain extra information. I extended this notion to assigning Python objects to text substrings. Every substring marked in this way carries a 'tag' (the object) which can be used to do all kinds of nifty things.
Marking certains parts of a text should not involve storing hundreds of small strings. This is why the Tagging Engine uses a specially formatted list of tuples to return the results:
Tag List
A tag list is a list of tuples marking certain slices of a text. The tuples always have the format
(object, left_index, right_index, sublist)
with the meaning: object
contains
information about the slice
[left_index:right_index]
in some text. The
sublist
is either another taglist created
by recursively invoking the Tagging Engine or
None
.
Tag Table
To create such taglists, you have to define a Tag Table and let the Tagging Engine use it to mark the text. Tag Tables are really just standard Python tuples containing other tuples in a specific format:
tag_table = (('lowercase',AllIn,a2z,+1,+2), ('upper',AllIn,A2Z,+1), (None,AllIn,white+newline,+1), (None,AllNotIn,alpha+white+newline,+1), (None,EOF,Here,-4)) # EOF
The tuples contained in the table use a very simple format:
(tagobj, command+flags, command_argument [,jump_no_match] [,jump_match=+1])Semantics
The Tagging Engine reads the Tag Table starting at the top entry. While performing the command actions (see below for details) it moves a read-head over the characters of the text. The engine stops when a command fails to match and no alternative is given or when it reaches a non-existing entry, e.g. by jumping beyond the end of the table.
Tag Table entries are processed as follows:
If the command
matched, say the slice
text[l:r]
, the default action is to append
(tagobj,l,r,sublist)
to the taglist (this
behaviour can be modified by using special
flags
; if you use None
as tagobj,
no tuple is appended) and to continue matching with the
table entry that is reached by adding
jump_match
to the current position (think of
them as relative jump offsets). The head position of the
engine stays where the command left it (over index
r
), e.g. for (None,AllIn,'A')
right after the last 'A' matched.
In case the command
does not match, the
engine either continues at the table entry reached after
skipping jump_no_match
entries, or if this
value is not given, terminates matching the current
table and returns not matched. The head position is
always restored to the position it was in before the
non-matching command was executed, enabling
backtracking.
The format of the command_argument
is dependant
on the command. It can be a string, a set, a search object,
a tuple of some other wild animal from Python land. See the
command section below for details.
A table matches a string if and only if the Tagging Engine reaches a table index that lies out of bounds, e.g. one beyond the end of the table. The engine then returns matched ok.
Tagging Commands
The commands and constants used here are integers defined in
Constants/TagTables.py and imported into the
package's root module. For the purpose of explaining the
taken actions we assume that the tagging engine was called
with tag(text,table,start=0,end=len(text))
. The
current head position is indicated by x
.
Command | Matching Argument | Action |
Fail | Here | Causes the engine to fail matching at the current head position. |
Jump | To |
Causes the engine to perform a relative jump by
jump_no_match entries.
|
AllIn | string |
Matches all characters found in text[x:end]
up to the first that is not included in string. At least
one character must match.
|
AllNotIn | string |
Matches all characters found in text[x:end]
up to the first that is included in string. At least one
character must match.
|
AllInSet | set |
Matches all characters found in text[x:end]
up to the first that is not included in the string
set. At least one character must match.
|
Is | character |
Matches iff text[x] == character .
|
IsNot | character |
Matches iff text[x] != character .
|
IsIn | string |
Matches iff text[x] is in string .
|
IsNotIn | string |
Matches iff text[x] is not in string .
|
IsInSet | set |
Matches iff text[x] is in set .
|
Word | string |
Matches iff text[x:x+len(string)] == string .
|
WordStart | string |
Matches all characters up to the first occurance of
string in text[x:end] . If string is not
found, the command does not match and the head position
remains unchanged. Otherwise, the head stays on the first
character of string in the found occurance. At least one
character must match.
|
WordEnd | string |
Matches all characters up to the first occurance of
string in text[x:end] . If string is not
found, the command does not match and the head position
remains unchanged. Otherwise, the head stays on the last
character of string in the found occurance.
|
sWordStart | search object | Same as WordStart except that the search object is used to perform the necessary action (which can be much faster) and zero matching characters are allowed. |
sWordEnd | search object | Same as WordEnd except that the search object is used to perform the necessary action (which can be much faster). |
sFindWord | search object | Uses the search object to find the given substring. If found, the tagobj is assigned only to the slice of the substring. The character leading up to it are ignored. The head position is adjusted to right after the substring -- just like for sWordEnd. |
Call | function |
Calls the matching
function(text,x,end) . The function must
return the index y of the character in
text[x:end] right after the matching
substring. The entry is considered to be matching, iff
x != y . The engines head is positioned on
y in that case.
|
CallArg | (function,[arg0,...]) |
Same as Call except that
function(text,x,end[,arg0,...]) is being
called. The command argument must be a tuple.
|
Table | tagtable or ThisTable |
Matches iff tagtable matches text[x:end] .
This calls the engine recursively. In case of success
the head position is adjusted to point right after the
match and the returned taglist is made available in the
subtags field of this tables taglist entry. You may pass
the special constant ThisTable instead of a
Tag Table if you want to call the current table
recursively.
|
SubTable | tagtable or ThisTable |
Same as Table except that the subtable uses this table's
taglist as its tag list. The subtags entry is set to
None. You may pass the special constant
ThisTable instead of a Tag Table if you
want to call the current table recursively.
|
TableInList | (list_of_tables,index) |
Same as Table except that the matching table to be used
is read from the list_of_tables at position
index whenever this command is
executed. This makes self-referencing tables possible
which would otherwise not be possible (since Tag Tables
are immutable tuples). Note that it can also introduce
circular references, so be warned !
|
EOF | Here |
Matches iff the head position is beyond end .
|
Skip | offset |
Always matches and moves the head position to x +
offset .
|
Move | position |
Always matches and moves the head position to
text[position] . Negative indices move the
head to text[end+position+1] ,
e.g. (None,Move,-1) moves to behind the last input
character, or EOF.
|
Loop | count | Remains undocumented for this release. |
LoopControl | Break/Reset | Remains undocumented for this release. |
The following flags can be added to the command integers above:
tagobj(taglist,text,l,r,subtags)
.
tagobj.append((None,l,r,subtags))
.
tagobj
itself. Note that this can cause the taglist to have a
non-standard format, i.e. functions relying on the
standard format could fail. This flag is mainly intended
to build join-lists usable by the
join()
-function (see below).
Some additional constants that can be used as argument or relative jump position:
Internally, the Tag Table is used as program for a finite
state machine which is coded in C and accessible through the
package as tag()
function along with the
constants used for the commands (e.g. Allin, AllNotIn,
etc.).
I admit, these tables don't look very elegant. In fact I would much rather write them in some meta language that gets compiled into these tables instead of handcoding them. But I've never had time to do much research into this, so unless someone volunteers, it will stay like this for some time.
Tip: if you are getting an error 'call of a non-function' while writing a table definition, you probably have a missing ',' somewhere in the tuple !
Debugging
The packages includes a nearly complete Python emulation of the Tagging Engine in the Examples subdirectory (pytag.py). Though it is unsupported it might still provide some use since it has a builtin debugger that will let you step through the Tag Tables as they are executed. See the source for further details.
As an alternative you can build a version of the Tagging Engine that provides lots of debugging output. See mxTextTools/Setup for explanations on how to do this. When enabled the module will create several .log files containing the debug information of various parts of the implementation whenever the Python interpreter is run with the debug flag enabled (python -d). These files should give a fairly good insight into the workings of the Tag Engine (though it still isn't as elegant as it could be).
Note that the debug version of the module is almost as fast
as the regular build, so you might as well do regular work
with it.
These objects are immutable and usable for one search string
per object only. They can be applied to as many text strings
as you like -- much like compiled regular
expressions. Matching is done exact (doing the translations
on-the-fly).
Search Object Constructors
There are two types of search objects. The Boyer-Moore
type uses less memory, while the Fast Search type gives
you enhanced speed with a little more memory overhead.
Note: The Fast Search object is *not* included in
the public release, since I wan't to write a paper about
it and therefore can't make it available yet.
Search Object Instance Methods
The two search types have the same methods:
Note that translating the text before doing the search
often results in a better performance. Use
These functions are defined in the package:
Important Note: The syntax used for negative
slices is different than the Python standard: -1
corresponds to the first character *after* the string,
e.g. ('Example',0,-1) gives 'Example' and not 'Exampl',
like in Python. To avoid confusion, don't use negative
indices.
A few restrictions apply, though:
The TextTools.py also defines some other functions, but
these are left undocumented since they may disappear in future
releases.
The package exports these constants. They are defined in
Constants/Sets.
The Examples/ subdirectory of the package contains a
few examples of how tables can be written and used. Here is a
non-trivial example for parsing HTML (well, most of it):
I hope this doesn't scare you away :-) ... it's
fast as hell.
Entries enclosed in brackets are packages (i.e. they are
directories that include a __init__.py file). Ones with
slashes are just ordinary subdirectories that are not accessible
via
The package TextTools imports everything needed from the other
components. It is sometimes also handy to do a
Examples/ contains a few demos of what the Tag Tables
can do.
First, download the archive, unzip it in a
directory on your Python path and then follow these
steps (assuming you have already installed Python):
Note: The archive includes a pre-compiled DLL for
Windows NT/95 contributed by Gordon McMillan. It may not be
the most current version though (check
TextTools.__version__). If you intend to use this file, you
can skip the compilation steps below.
Instructions for compiling the C extension on Unix platforms:
Instructions for compiling the C extension on Windows
platforms using VC++ (courtesy of Gordon McMillan):
Though the module has been tested and is in constant use,
there may still be some bugs left. Please post any bug
reports, questions etc. directly to me.
© 1997,1998, Copyright by Marc-André Lemburg; All
Rights Reserved. mailto: mal@lemburg.com
Permission to use, copy, modify, and distribute this
software and its documentation for any purpose and without
fee or royalty is hereby granted, provided that the above
copyright notice appear in all copies and that both the
copyright notice and this permission notice appear in
supporting documentation or portions thereof, including
modifications, that you make.
THE AUTHOR MARC-ANDRE LEMBURG DISCLAIMS ALL WARRANTIES WITH
REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF
MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE AUTHOR BE
LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR
ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR
PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR
OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH
THE USE OR PERFORMANCE OF THIS SOFTWARE !
Things that still need to be done:
Things that changed from 1.0.1 to 1.0.2:
Things that changed from 1.0.0 to 1.0.1:
Things that changed from the really old TagIt module version 0.7 to mxTextTools
1.0.0:
Search Objects
BMS(match[,translate])
FS(match[,translate])
search(text,[start=0,len_text=len(text)])
[start:len_text]
and return
the slice (l,r)
where the substring was
found, or (start,start)
if it was not
found.
find(text,[start=0,len_text=len(text)])
[start:len_text]
and return
the index where the substring was found, or
-1
if it was not found. This interface is
compatible with string.find
.
findall(text,start=0,len_text=len(text))
search()
, but return a list of
all non-overlapping slices (l,r)
where
the match string can be found in text.string.translate()
to do that efficiently.
Functions
tag(text,tagtable[,startindex=0,len_text=len(text),taglist=[]])
(success, taglist, nextindex)
, where
nextindex indicates the next index to be processed after
the last character matched by the Tag Table. In case of a
non match (success == 0), it points to the error location
in text. If you provide a tag list it will be used for
the processing. Passing None
results in no
tag list being created at all.
join(joinlist[,sep=''])
(string,l,r[,...])
(the resulting string
will then include the slice string[l:r]
)
or strings (which are copied as a whole). Extra
entries in the tuple are ignored. The optional
argument is a seperator to be used in joining the
slices together, it defaults to the empty string
(unlike string.join).
cmp(a,b)
joinlist(text,list[,start=0,stop=len(text)])
join()
from a list of tuples
(replacement,l,r,...)
in such a way that all
slices text[l:r]
are replaced by the given
replacement.
If one of these conditions is not met, a ValueError is
raised.
set(string[,logic=1])
invset(string)
set(string,0)
.
setfind(text,set[,start=0,stop=len(text)])
text[start:stop]
. set
must be a
string obtained from set()
.
setsplit(text,set[,start=0,stop=len(text)])
set
must be a string obtained from set()
.
setsplitx(text,set[,start=0,stop=len(text)])
set
must be a string obtained from
set()
.
upper(string)
lower(string)
is_whitespace(text,start=0,stop=len(text))
replace(text,what,with,start=0,stop=len(text))
find(text,what,start=0,stop=len(text))
findall(text,what,start=0,stop=len(text))
(left,right)
meaning that
what
can be found at text[left:right].
collapse(text,seperator=' ')
charsplit(text,char,start=0,stop=len(text))
splitat(text,char,nth=1,start=0,stop=len(text))
Constants
a2z
A2Z
a2z
umlaute
Umlaute
alpha
a2z
german_alpha
number
alphanumeric
white
newline
formfeed
whitespace
any
*_set
Examples of Use
from TextTools import *
error = '***syntax error' # error tag obj
tagname_set = set(alpha+'-'+number)
tagattrname_set = set(alpha+'-'+number)
tagvalue_set = set('"\'> ',0)
white_set = set(' \r\n\t')
tagattr = (
# name
('name',AllInSet,tagattrname_set),
# skip junk
(None,AllInSet,white_set,+1),
# with value ?
(None,Is,'=',MatchOk),
# skip junk
(None,AllInSet,white_set,+1),
# unquoted value
('value',AllInSet,tagvalue_set,+1,MatchOk),
# double quoted value
(None,Is,'"',+4),
('value',AllNotIn,'"'),
(None,Is,'"'),
(None,Jump,To,MatchOk),
# single quoted value
(None,Is,'\''),
('value',AllNotIn,'\''),
(None,Is,'\'')
)
valuetable = (
# ignore whitespace + '='
(None,AllInSet,set(' \r\n\t='),+1),
# unquoted value
('value',AllInSet,tagvalue_set,+1,MatchOk),
# double quoted value
(None,Is,'"',+4),
('value',AllNotIn,'"'),
(None,Is,'"'),
(None,Jump,To,MatchOk),
# single quoted value
(None,Is,'\''),
('value',AllNotIn,'\''),
(None,Is,'\'')
)
allattrs = (# look for attributes
(None,AllInSet,white_set,+4),
(None,Is,'>',+1,MatchOk),
('tagattr',Table,tagattr),
(None,Jump,To,-3),
(None,Is,'>',+1,MatchOk),
# handle incorrect attributes
(error,AllNotIn,'> \r\n\t'),
(None,Jump,To,-6)
)
htmltag = ((None,Is,'<'),
# is this a closing tag ?
('closetag',Is,'/',+1),
# a coment ?
('comment',Is,'!',+8),
(None,Word,'--',+4),
('text',sWordStart,BMS('-->'),+1),
(None,Skip,3),
(None,Jump,To,MatchOk),
# a SGML-Tag ?
('other',AllNotIn,'>',+1),
(None,Is,'>'),
(None,Jump,To,MatchOk),
# XMP-Tag ?
('tagname',Word,'XMP',+5),
(None,Is,'>'),
('text',WordStart,'</XMP>'),
(None,Skip,len('</XMP>')),
(None,Jump,To,MatchOk),
# get the tag name
('tagname',AllInSet,tagname_set),
# look for attributes
(None,AllInSet,white_set,+4),
(None,Is,'>',+1,MatchOk),
('tagattr',Table,tagattr),
(None,Jump,To,-3),
(None,Is,'>',+1,MatchOk),
# handle incorrect attributes
(error,AllNotIn,'> \n\r\t'),
(None,Jump,To,-6)
)
htmltable = (# HTML-Tag
('htmltag',Table,htmltag,+1,+4),
# not HTML, but still using this syntax: error or inside XMP-tag !
(error,Is,'<',+3),
(error,AllNotIn,'>',+1),
(error,Is,'>'),
# normal text
('text',AllNotIn,'<',+1),
# end of file
('eof',EOF,Here,-5),
)
Package Structure
[TextTools]
[Constants]
Sets.py
TagTables.py
Doc/
[Examples]
HTML.py
Loop.py
Python.py
RTF.py
RegExp.py
Tim.py
Words.py
altRTF.py
pytag.py
[mxTextTools]
test.py
TextTools.py
import
.
from
TextTools.Constants.TagTables import *
.
Installation
What I'd like to hear from you...
Copyright & Disclaimer
History & Future