Corpus Linguistics


This website is maintained by Michael Barlow. I would appreciate hearing about new links and links listed here that are no longer active. Please send email to barlow@ruf.rice.edu. I would also like to solicit your help in building up the non-English corpus listings. Please let me know of any suitable corpora. I am slowly compiling a bibliographic reference section that focusses on Corpus Linguistics and the use of corpora in language teaching. Suggestions for the bibliography and links to actual papers are welcome.

See also the Parallel Corpora page

  • Texts: Corpora, Newspapers and News Sites
  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • English-Miscellaneous
  • Estonian
  • Ethiopic
  • French
  • Gaelic
  • German
  • Hebrew
  • Italian
  • Malay
  • Norwegian
  • Polish
  • Portuguese
  • Russian
  • Scandinavian
  • Spanish
  • Swedish
  • Turkish
  • Miscellaneous
  • Learner corpora
  • Corpus searches
  • Word lists and Stop lists
  • Software
  • Text analysis
  • Taggers
  • Online papers, theses, etc. related to CL.
  • Courses in Corpus Linguistics
  • Bibliography
  • Useful Sites and Home Pages
  • Texts: Corpora, Newspapers and News Sites

    Chinese

    Mandarin corpus Big 5 encoding

    Czech

    Czech National Corpus Experimental WWW access to the corpus.
    Institute of the Czech National Corpus

    Danish

    News in Danish. Address has changed.

    Dutch

    The Institute for Dutch Lexicology have several large corpora, which can be accessed for academic research purposes. Contact kruyt@inl.nl

    English

    American National Corpus

    Oxford Text Archive WEB site OTA FTP FTP site. (Mirror ftp site for North America -- OTA.) Good starting point. Includes British novels, Dickens, Trollope, etc. The Susanne Corpus is in this archive in the directory pub/ota/public/susanne. For background info, see Susanne. OED Online

    Project Gutenberg: (in English)
    Some literary works such as "Moby Dick" and "Through the Looking Glass" are available electronically from Project Gutenberg.

    Corpus of Spoken, Professional American-English The corpus is available commercially from Athelstan. There is a 50,000 word sample available online.

    The Bookstack An experimental index to online books.

    The Fairie Queene and other works by Edmund Spenser

    British National Corpus. A large (100 million words) corpus of modern English (1990's). BNC World Edition is now available. See also BNC Indexer

    International Corpus of English

    COBUILD offers access to a large corpus for a fee. Also has a free demo.

    Wellington Corpus of Spoken New Zealand English. CD-ROM. Written New Zealand English is also included. Corpus-Manager@vuw.ac.nz

    Penn-Helsinki Corpus of Middle English

    Lampeter Corpus Early Modern English

    ICAME, Bergen. This is the ftp site. ICAME also produces an excellent CD-ROM containing Brown, LOB, London-Lund, and Helsinki corpora among others. Also the home of Corpora news-list. Also a web-site.

    The Bergen Corpus of London Teenage Language

    Corpus of Written British Creole

    The TRAINS Spoken Dialogue Corpus

    CCAT Archive Gopher site at U. Penn. A good site for classical, historical, and religious texts.

    U.S. Government publications.

    Voice of America News (Gopher)

    CBC Canadian broadcasting. Includes sound files.

    Time and Time Daily

    Marx & Engels Online Library

    World Religious Texts

    Canterbury Tales Project An electronic Chaucer from Cambridge University Press.

    English-Miscellaneous

    Ftp site for Red Dwarf scripts
    O.J. Simpson Trial Transcripts Another transcript source. O.J. trial transcripts And Another good source.
    Estonian Law (in English)
    Neologisms
    Proper names Ftp site.
    Presidential Inaugural Addresses Old link. All the president's addresses.
    Russian novel Gopher
    Progress: Family Systems research and Therapy Full text journal. (Phillips Graduate Institute?)

    Estonian

    Estonian Corpus of Written Texts and in Estonian

    Estonian Law (in Estonian!)

    Ethiopic

    Thesaurus Linguae Aethiopicae

    French

    Louisiana French MOVED ??

    French novels

    News in French (Gopher) MOVED ??

    Dictionnaire de l'Académie française

    Radio French Internationale

    Old and Middle French

    Gaelic

    CURIA Project (medieval Irish texts)

    Manx

    German

    Mannheimer Corpora A very large, growing, online German corpus archive (778 million words in August 2000). A copyright-free portion of the archive (379 million words in August 2000) is freely searchable. Invited guests have access to the whole archive. Partially tagged.

    Project Gutenberg (German texts)

    German newspapers -- tagged corpus with syntactic structure annotated.

    German News: subscribe by sending an e-mail request to germnews@vm.gmd.de. Today's news in German

    Hebrew

    Spoken Israel Hebrew Description of the project.

    Indo-European

    Comparative Indo-European Includes 200-item lexicostatistical lists for 95 Indoeuropean speech varieties, cognation judgments between the lists, lexicostatistical percentages, etc.

    Italian

    CORIS CORpus di Italiano Scritto beind developed at CILTA. Corpus will be available online and on CD-ROM at the end of 2000.

    Italian literature (LiberLiber)

    Italian newspaper

    Malay

    Malay Classical literature. Searchable online.

    Norwegian

    Oslo Corpus of Tagged Norwegian Texts

    Polish

    Polish Newspaper

    Portuguese (Brazilian)

    Projecto Vercial Portuguese literature database

    Tony Berber Sardinha. The files disk1.taz disk2.taz are available in the directory ~ftp/pub/linguistics. The file cbmp.txt contains background information on the corpus. tony1@liverpool.ac.uk Other papers and useful info

    News from Brazil

    Russian

    Russian literature

    Russian foreign affairs articles I have not had much luck with this.

    Vesti: A Canadian-Russian Newspaper.

    Russian word list gopher.

    Scandinavian

    Language Bank of Swedish Texts

    Project Runeberg (Scandinavian classics)

    Norwegian Law

    Swedish

    Spanish

    South American oral and written texts available via ftp from lola.lllf.uam.es.

    Spanish Syntax Research Group University of Santiago de Compostela. Information about ARTHUS (1.5 million words in modern Spanish) and syntactic database (BDS, 160.000 analysed clauses of ARTHUS). In progress: a medieval and classic Spanish corpus ("ARTHUS Medieval y Clasico).

    "Maria" corpus Acquisition of Spanish.

    Mexican Newspapers: El Nacional, La Jornada, etc.

    Swedish

    Bank of Swedish

    Turkish

    Turkish with an Australian flavour.

    Miscellaneous

    Telephone speech corpus: 22 Language Corpus

    Telephone speech corpus: Alpha-numeric corpus

    Learner corpora

    Learner corpora Extensive information from Yukio Tono

    Hungarian EFL Student Writing

    ICLE - Brazilian Portuguese Sub-Corpus

    Corpus searches

    COSMAS search Institut für Deutsche Sprache, Mannheim, Germany.

    IMS Stuttgart (Penn Treebank) search -- OLD LINK??

    Cobuild Corpus Sampler

    University of Michigan Middle English Collection

    Michigan Early Modern English Materials

    Blake, Wordsworth, etc.Web concordance

    Web-based analysis of Gutenberg texts by Ron Reck. See also Corpus Access at the University of Essex.

    VISL Project Denmark. English and German corpora can be searched.

    Concordance of Great Books

    British National Corpus Simple search

    LDC Online

    Word lists and Stop lists

    French Stop list from Jean.Veronis@lpl.univ-aix.fr

    Stop lists and frequency lists for English, French and German. From Patrice Bonhomme.

    Zipped file of n-grams from the Brown Corpus

    Mike Scott's page contains several English wordlists.

    Software

    Text analysis

    COSMAS - A corpus analysis toolbox, online accessible since 1995, see COSMAS. 778 million words online, virtual corpus composition, complex query language, concordancing, collocation analysis etc.

    MonoConc Pro. Commercial Windows concordance program (produced by me). See the Athelstan site.

    MonoConc, a Mac/Windows concordance program that allows sorts (2R,1R,2L,1L) and provides simple frequency information. For information on availability, see MonoConc.

    ParaConc, a Mac/Windows concordance program for parallel texts. A version is available for free for research purposes (under license). For other uses, the single user price is $49.95. See ParaConc.

    Conc, a Mac concordance program, is available via ftp from SIL. Also available by anonymous-ftp from clr.nmsu.edu (/clr.nmsu.edu:/CLR/tools/concordances).
    Indiana University LETRS Conc QuickGuide.

    Free Text, a Mac concordance program, should be available from the U. of Michigan site. Also available from ftp://nora.hd.uib.no/pub/mac/

    HUM, developed by William Tuthill, is available by anonymous-ftp from clr.nmsu.edu (/clr.nmsu.edu:/CLR/tools/concordances).

    Perl Dan Melamed's perl tools

    Tact. Available via ftp from University of Toronto (epas.utoronto.ca).
    Indiana University LETRS TACT QuickGuide
    World Wide Web implementation of TACT -- TACTWeb. "TACTweb connects TACT to the World Wide Web-making a TACT TDB database accessible to the entire WWW community." See also Elisabeth Burr's site.

    LEXA Corpus processing Software version 6 (for DOS) is available via ftp. This is a suite of programs for tagging, lemmatization, word frequency counts, etc.

    TextAnalyst Commercial software that produces a semantic network on the basis of text input. The company, Megaputer also produces a data mining tool PolyAnalyst.

    Lexical Freenet Web-based thesaurus

    ShoeBox Fieldwork oriented program. Information available from SIL.

    VisualText A suite of commercial text analysis tools.

    Word Cruncher Info available from WPT

    WordSmith Mike Scott's WordSmith page.

    Paai's text utilities: A set of utilities consisting of unix-scripts and c-programs for frequency-counts and lexical cohesion.

    Windows CLAN

    Taggers

    Eric Brill's program Ftp site.

    TOSCA/LOB tagger for DOS. Downloadable.

    Rank Xerox in Grenoble have an interesting site. It is possible to enter text in French, English, German etc. and get it tagged.

    AMALGAM Email tagging, conversion of tagsets, ...

    AUTASYS by Alex Chengyu Fang at UCL.

    SemanTag A variant of Brill's Tagger??

    TreeTagger Language-independent HMM tagger. Parameter files for English, French, German.

    CRATER report. Discussion of a modified version of the Xerox Tagger.

    Tagger overview by Linda Van Guilder

    The Corpus Linguistics Group at the University of Birmingham has an Experimental email tagger-QTAG Texts can be sent via email to tagger@clg.bham.ac.uk

    The (LOB) CLAWS1 tag set

    CoreLex -- a tagset and database for semantic tagging based on WordNet

    Online Papers, Theses, etc. Related to CL

    Papers

    Michael Rundell The future of the corpus, and the corpus of the future

    Theses

    Torbjörn Lager Thesis-A Logical Approach to Computational Corpus Linguistics

    Books

    The BNC Handbook: Exploring the British National Corpus with SARA. Guy Aston and Lou Burnard. Edinburgh Textbooks in Empirical Linguistics.

    Corpus Linguistics : Investigating Language Structure and Use Douglas Biber, Susan Conrad, Randi Reppen

    An Introduction to Corpus LinguisticsGraeme Kennedy

    Computer Corpus Lexicography Vincent B Y Ooi. Edinburgh Textbooks in Empirical Linguistics.

    Corpus Linguistics Tony McEnery and Andrew Wilson. Edinburgh Textbooks in Empirical Linguistics.

    Language and Computers: A Practical Introduction to the Computer Analysis of Language. Geoff Barnbrook. Edinburgh Textbooks in Empirical Linguistics.

    Pattern Grammar A corpus-driven approach to the lexical grammar of English. Susan Hunston and Gill Francis Studies in Corpus Linguistics 4

    Patterns and Meanings Using corpora for English language research and teaching. Alan Partington. Studies in Corpus Linguistics 2

    Statistics for Corpus Linguistics. Michael Oakes. Edinburgh Textbooks in Empirical Linguistics.

    Terms in Context Jennifer Pearson. Studies in Corpus Linguistics 1

    Text and Technology In honour of John Sinclair. Mona Baker, Gill Francis and Elena Tognini-Bonelli (eds.) John Benjamins.

    Courses in Corpus Linguistics

    Tutorial: Concordances and Corpora Cathy Ball, Georgetown.

    Methods and Tools for Large-Scale Corpus Linguistics

    Eugene Charniak: Statistical course

    Elisabeth Burr: Korpuslinguistik course

    Tony Berber Sardinha: Corpus Linguistics courses: 1998-1999; 2000

    Mark Davies: History of the Spanish Language; Assignments and projects

    Chris Brew: Statistical NLP; Probablistic modelling

    Javier Perez-Guerra: English linguistics (written in Galician)

    Bilge Say: Using Corpora for Language Research

    Sabine Reich: Corpus course

    Bibliography

    References (by name)

    References (by topic)

    References compiled at UCREL (Computational Linguistics)

    Useful Sites and Home Pages

    Centres and Departments

    Corpus Linguistics at Birmingham University, England.

    Center for Electronic Texts in the Humanities.

    Centre for English Corpus Linguistics, Louvain

    CTI Centre for Modern Languages Based in Hull, England. Newsletter, language software guide, info on language teaching.

    Oxford Text Archive

    Oxford University Language Centre

    UCREL Site Lancaster University, England

    Tuscan Word Centre

    Other Useful Sites

    African Languages Lexicon Project (ALLEX)

    Alex gopher site
    Alex allows users to find and retrieve the full-text of documents on the Internet.

    American National Corpus

    Annotation page at Upenn. Describes some 40 tools and formats for creating and managing linguistic annotations.

    Athena Large e-text site

    Books Online

    CHILDES Parent-child interactions.

    Alex Chengyu Fang Page Alex's page contains info on his various corpus tools

    Tim Johns Classroom Concordancing Page.

    Collocations page

    Concordancing page

    Corpus Encoding Standards Coordinated by Nancy Ide

    ECI/MCI Multilingual corpus information

    Electronic Text Archive

    English Language Corpora

    the etext pages

    European Language Resources Association ELRA catalogue

    Human Languages Page at Willamette.

    ICAME and Hong Liang Qiao's web-page

    Index of electronic text projects

    Internet Corpora Index

    Literature in various languages. University of Virginia ETC. See also Le letterature del mondo

    MATE Project Annotation of spoken corpora

    Pittsburg U. Electronic Text project

    SPIRE Text visualisation analysis

    Survey of English Usage An interesting page.

    Taglog project Logic-based Corpus Theory Development Environment.

    TalkBank

    WInter Web Internationalization Page. Multilingual WWW issues. MOVED ???

    Send additions to Michael Barlow (barlow@ruf.rice.edu)

    (unknown)