List of Text Processing Tools

Introduction

This is a small, hand-maintained, list of automated text processing tools. You may also be interested in my list of text editors and IDEs.

General-Purpose Preprocessors

  • m4 - a macro language with some open-source implementations, including GNU m4. (I personally find it very vile.)

  • GPP - a general-purpose preprocessor. Supports several alternative syntax modes. Open source (GPL).

  • filepp - an adaptation and extension of the C preprocessor for general-purpose use. Written in Perl. Open source (GPL-2-or-later).

  • chpp (Chakotay Preprocessor) - a powerful preprocessor that aims to be non-intrusive, and which can be considered a full-fledged programming system. Has been unmaintained since 1999. Open source (GPLv2).

General-purpose Template Systems

  • Template Toolkit - a flexible and highly extensible template processing system for Perl. Open source (same terms as Perl).

  • ClearSilver - a language-agnostic and fast templating system written in C.

  • Jinja2 - a “full-featured” template engine for Python 2 and Python 3. Open source under a BSD-style licence.

  • Tenjin - “the fastest template engine in the world” - available for several dynamic languages.

  • eRuby - a Ruby-based template system with several implementations. Open source.

  • Smarty - a PHP Template Engine. Open Source.

  • HTML-Template and Text-Template - two other CPAN template systems popular in the Perl world. Open Source.

  • Cheetah - a Python-Powered Template Engine. “Fast, Flexible, Powerful”. Open Source. Has been unmaintained since 2010 and does not support Python 3.

Parser Generators

  • Yacc - a LALR parser generator standard, with some popular implementations such as Berkeley Yacc (byacc) (Open source, public domain) and GNU Bison (Open source, GPLed).

  • ANTLR - “ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.” Open Source (3-clause BSD licence).

  • Parse-RecDescent - a parser-generator for Perl 5. Open source (same terms as Perl).

  • Marpa - a parser than aims to be able to parse everything in BNF. Open source (LPGL-version-3-or-later).

  • SGLR, the Scannerless Generalized LR Parser.

  • Regexp::Grammars - “Add grammatical parsing features to Perl 5.10 regexes”.

  • Parser::MGC - build simple Recursive-Descent parsers in Perl.

  • Lemon Parser Generator - an LALR parser generator for C that is maintained as part of the SQLite project. Open source (public domain).

Regular Expression Libraries

Diffing and Patching Tools

  • GNU Diffutils - an open source (GPLv3+) package which provides diff and other programs.

  • GNU patch - apply a patch/diff file. Open source (GPLv3+).

  • patchutils - Patchutils is a small collection of programs that operate on patch files. Open source.

  • comm - a UNIX command used to compare two files for common and distinct lines.

  • Meld - a GUI diff/merge tool for gtk+. Open source.

  • KDiff3 - a GUI diff/merge tool for KDE. Open source.

  • GNU wdiff - a front-end to GNU diff for comparing files on a word-per-word basis.

Specialised Processors

XML Processors

Standard UNIX Text Processing Tools

  • echo - output strings (with some possible transformations).

  • cat - output or concatenate files.

  • cut - extract sections from each line of output.

  • head - start of stream.

  • tail - end of stream.

  • paste - join multiple files horizontally.

  • sort - sorts input.

  • csplit - split files based on context lines.

  • join - merges lines of two files based on commonalities.

  • uniq - collapses adjacent lines, and makes the output unique.

  • grep - search for lines matching regular expressions.

  • sed - stream editor - a mini programming language for text processing, based on the ed text editor.

  • AWK - an even more full-fledged programming language for text processing in UNIX (with some quirks, and idiosyncrasies).

Some General-Purpose Programming Languages with Good Text Processing Support

Licence

Creative Commons License

This document is Copyright by Shlomi Fish, 2012, and is available under the terms of the Creative Commons Attribution License 3.0 Unported (or at your option any later version of that licence).

For securing additional rights, please contact Shlomi Fish and see the explicit requirements that are being spelt from abiding by that licence.