Paper

Integrating linguistic and conceptual analysis in a WWW-based tool for terminography

Ingrid Meyer

University of Ottawa
imeyer@aix1.uottawa.ca

Douglas Skuce

University of Ottawa
doug@csi.uottawa.ca

Judy Kavanagh

University of Ottawa
kavanagh@csi.uottawa.ca

Laura Davidson

University of Ottawa
s523009@aix2.uottawa.ca

Keywords: terminography, terminology, knowledge engineering

Abstract

We describe a new computational toolset for terminography, which has lagged seriously behind lexicography in its use of computer aids. We begin by outlining the principal differences between terminography and lexicography, and the implications of these differences for the design of terminography-specific systems. We then describe the IKARUS toolset, which features the possibility of WWW-based cooperative work, a knowledge management tool based on Artificial Intelligence principles, and a variety of aids for identifying and analysing word behaviour in large document collections.

0. Introduction

Basic concepts. This paper is about terminography, the discipline which is concerned with building specialized dictionaries and term banks (Cf. Sager 1990 for details). We will contrast terminography with general-language lexicography, referred to henceforth simply as lexicography.

Motivation for our research. Dictionary-builders have had considerable success in incorporating computer aids into their day-to-day work (Atkins 1994). The advent of corpora in the mid-1980s, in particular, has been hailed as the start of new era in lexicography. Virtually all the excitement, however, has centered around (general-language) lexicography. Terminography, in contrast, has lagged seriously behind. While terminographers do use computers for storing lexical data, they still make virtually no use of them for capturing and analyzing this data. For the most part, terminographers are still compiling their dictionaries from paper-based documents. This is particularly unfortunate since terminological lexical items (=terms) are growing in importance in almost all areas of the language industries, including technical writing, translation, abstracting, etc.

Organization of the paper. In Section 1, we offer an explanation for this discrepancy between lexicography and terminography. We argue that the two disciplines are significantly different in nature, and hence require different tools, many of which have not become available until recently. In Section 2, we outline the basic functionality of a new toolset called IKARUS, which offers many of the features that have previously been lacking in conventional lexicographic tools. IKARUS is a WWW-enhanced version of a previous system, CODE4, which has been used in a number of academic and commercial applications, both for terminography (Meyer 1992, Meyer et al 1992) and knowledge engineering (Skuce and Lethbridge 1995). At the conference, we will illustrate the various functions of IKARUS through examples from one of our principal domain applications, which include optical disks, windowing systems, UNIX, and JAVA.

I. What computational support does terminography need?

Lexicography and terminography have much in common: they are both concerned with describing lexical items in a user-friendly format within a dictionary. The specialized nature of the lexical items studied in terminography, however, gives the discipline its own distinguishing features. In the following sections, we outline three characteristics which most fundamentally define the nature of terminography (a more detailed analysis is found in Meyer and Mackintosh 1996). For each characteristic, we also outline its implications for the design of terminography-specific computer tools.

1.1 Why terminography is done

The purpose of terminography is to identify and analyze lexical items used in specialized domains of knowledge, such as medicine, law, computing, etc. In principle, all domain-specific terms are of interest. In practise, however, terminographers are overwhelmingly preoccupied with new terms: as domains change and grow--often at a frightening pace-- terminographers must document the associated lexical changes. Unlike lexicographers, therefore, terminographers spend much of their time looking for the lexical items they will later describe. Also unlike lexicographers, terminographers' tasks are not merely descriptive, but prescriptive as well. Terminographers are often required to make objective judgements on the relative validity of competing terms, since the uncontrolled proliferation of poorly chosen terms can impede the advancement of knowledge. This normative work is often carried out by standardization committees, involving terminographers and domain experts, at both national and international levels.

Implications for tool design. Clearly, if terminographers are to use electronic corpora, they need them fast. Unlike lexicographers, they do not have the luxury of waiting years or even months for a corpus to be built. Furthermore, they need corpus analysis tools that will help them identify, and not just analyze, lexical items. Finally, they require support for standardization efforts, for example tools that facilitate international teamwork.

1.2 Terminography and onomasiology

Lexicography is essentially a semasiological enterprise: it start with words and tries to get at their meanings. Terminography, on the other hand, has a significant onomasiological component. A typical terminography project starts with the analysis of a domain, in order to establish its limits, relations to other domains, and subdomains.. Since understanding a domain means understanding the domain concepts, conceptual analysis is considered the cornerstone of terminography (Meyer and Eck 1996, Picht and Draskau 1985). It is also the most difficult aspect of terminography, since terminographers are not normally domain experts (see 1.3 below). Only when terminographers have done a certain amount of conceptual analysis can they begin to identify and describe the ways the concepts are being lexicalized within the domain. Sometimes, emergent concepts are not even lexicalized yet, and the terminographer may be required to propose a neologism.

Implications for tool design. Most importantly, the onomasiological orientation of terminography requires tools that provide support for conceptual analysis, something which has never been a major issue in lexicographic tool design. As we and others have argued elsewhere (Ahmad et al 1989, Skuce and Meyer 1991), the terminographer's conceptual analysis tasks have much in common with knowledge engineering, and tool development should therefore draw on research in this area. Also following from the onomasiological orientation of their work, terminographers require corpora that are representative of a specialized domain of knowledge. Unlike lexicography, terminography cannot take an opportunistic approach to corpus-building: the corpus must be carefully selected to cover the concepts of a particular domain, and ideally also of related domain. This, once again, explains why corpora are still so little used in terminography: since the required documents may come from extremely diverse sources, obtaining them in machine-readable form has, in the past, taken more time than was allotted to an entire project.

1.3 Terminography and introspection

Most often, terminographers are not domain experts themselves; rather, their formal training is in a language-related area (terminology proper, translation, linguistics). The amount of domain knowledge they bring to a project may vary greatly. Introspection, therefore, is a much weaker source of knowledge for the terminographer than the lexicographer. Furthermore, terminographers complement their analysis of written texts with extensive consultation with human experts. Since "true" experts may be a rare commodity, and since experts with linguistic sensitivity to boot may be even rarer, terminographers' experts may be anywhere in the world.

Implications for tool design. While electronic corpora have not been favoured by terminographers traditionally, this is no reason to give up on this avenue of research. On the contrary, the minimal role of introspection in their work makes terminographers even more dependent on texts than their lexicographer counterparts. However, since terminographers need both linguistic and conceptual information, the latter aspect will need to be developed in corpus analysis tools. Furthermore, since terminographers also rely heavily on interactions with experts, tools should facilitate terminographer-expert dialogue, regardless of where the experts may be located.

2. IKARUS: A toolset for linguistic and conceptual support

2.1 Design philosophy

IKARUS (Intelligent Knowledge Acquisition, Retrieval and Universal System), has been designed generically, as a tool to assist anyone faced with the task of acquiring and organizing specific knowledge from on-line texts. The terminographer, who needs to acquire and organize conceptual and linguistic knowledge, is thus seen as an prime potential user. IKARUS offers the following features that respond directly to the terminography-specific needs identified above:

- It features an HTML-based design, running on the WWW (e.g. under Netscape). This allows the terminographer the possibility of accessing many kinds of distributed documentation. It also allows for the type of widescale shareability required by standardization efforts and by terminographer-expert dialogue. Terminographers anywhere in the world can simultaneously interact with the same data, even posting "yellow stickies" to each other in real time (e.g. "Ann, I don't agree with your definition of heuristic" - Jane). This design philosophy is consistent with other recent work, e.g. Nuopponen 1996, Schweighofer and Scheithauer 1996, Jacquin and Liscouet 1996.

- It offers a knowledge engine that allows users to build a knowledge base (a highly structured extension of a database) with frame-like property inheritance, as found in classic Artificial Intelligence systems such as Cyc (Lenat and Guha 1990). These knowledge bases can be located anywhere on the WWW, and can incorporate pointers to retrieved documents or to Web sites, or to each other, so that a network of them appears to the user like one huge knowledge base. Using the knowledge engine, the terminographer can construct a conceptual model of a given domain, and link individual parts of the model to the original knowledge. Such conceptual models are of great importance in standardization efforts, the creation of neologisms, and dialogue with experts.

- It provides a suite of document collection and analysis tools (Kavanagh 1995) that incorporate those features of conventional, lexicographically-oriented that have been found to be of greatest interest for terminographers (Ahmad and Rogers 1996, Meyer and Mackintosh 1996). On the other hand, the corpus analysis toolset also offers features of specific interest to terminographers, for example, terminology extraction to help in identifying potential terms, and "conceptual filtering" of corpus output (described in more detail below) to aid in conceptual analysis.

IKARUS has three principal components, each of which is described briefly below.

2.2 The WWW Document Collector

Many people have experienced using a WWW search engine such as Alta Vista or Infoseek. The result requires much manual browsing to weed out the chaff. IKARUS' document collector, in contrast, submits a query to a meta search engine, which in turn passes it on to about eight normal search engines, then collates all their outputs. These are presented as a single document to the user, with a check box beside each item. The user decides which of these are useful, browsing the full texts if necessary. One click causes all selected documents to be downloaded to the local disk. Here, an indexing engine (Glimpse) indexes them for rapid retrieval of documents, paragraphs, or even sentences containing a term of interest. The display is ranked, showing those text segments with most occurrences first.

2.3 The Document Analyzer

The document analyser is essentially a concordancing program, featuring the basic functions that lexicographers have come to associate with such tools (Atkins 1992). However, is also offers a number of features that many conventional concordancers lack:

- It can identify single and multi-word terms

- It can work with tagged text

- It finds various types of collocational information, such as finding frequencies of the words that appear between any two given words, or the frequencies of the verbs that appear after a noun

- It can search for phrases that usually indicate common conceptual relations, such as phrases meaning "X is a kind of Y" or "X is a part of a Y"

- It does not require indexing (a time-consuming preprocessing step); special techniques are used to speed the searching for phrases so that reponse times are within a few seconds

2.4 The Knowledge Base Management System

Terminographers and lexicographers have traditionally "captured" the results of their analyses in a database. IKARUS, in contrast, uses the knowledge base model as developed in the Artificial Intelligence research community. Unlike a conventional database, a knowledge base can capture conceptual relations within the domain, as well as linguistic information. (The specific advantages of knowledge bases for terminography are described in Meyer and Eck 1996.) Concepts are organized in various kinds of hierarchies, of the most common are generic-specific (a car is a kind of vehicle), topic (algebra is a subdomain of mathematics), and part-of (the head is a part of the body). The terminographer may associate any number of characteristics with a concept, using a frame-like structure. For example, for the concept CD-ROM, one can create the characteristic "storage capacity:500 MB". When a concept is part of a generic-specific hierarchy, characteristics automatically inherit from general to more specific concepts. Multiple inheritance is also permitted. We are currently implementing ways of linking a number of knowledge bases together, so that if you (in Australia) have one on cars, and I (in Canada) have one on trucks, they may appear to anyone as one large knowledge base through the magic of the WWW.

In keeping with the critical importance of texts as a source of linguistic and conceptual information for the terminographer (Cf. 1.3), any knowledge base component can be closely linked to the documents on which it is based. For example, every concept (or group of related concepts) in a knowledge base can be associated with a particular document collection by an automatically constructed Unix directory hierarchy. The associated documents can thus be directly browed from the knowledge base. Furthermore, since IKARUS runs as a WWW application under a browser such as Netscape, one may place WWW addresses (i.e. URLs) within any knowledge base entry, again allowing the terminographer to go directly to a WWW site from the knowledge base. Of course, URLs may contain graphical, audio or video, and not merely text.

3. An extended example using IKARUS

In the conference presentation, the above description of IKARUS will be illustrated with concrete examples (we cannot do this here because of space and graphical constraints). Through the use of screen captures, we will "walk" the audience through the various stages of terminographic research using IKARUS: 1) searching for, and downloading, domain-specific documents found on the WWW, using the Document Collector; 2) identifying possible terms, using the Text Analyzer; 3) analyzing a particular term in depth, for both linguistic and conceptual information, using the Text Analyzer; 4) recording linguistic/conceptual information about the term, using the Knowledge Base Management System; 5) exploring the conceptual relations into which the term enters, using the Knowledge Base Management System.

Interested readers may consult the public version of IKARUS on the WWW: http://www.csi.uottawa.ca/~kavanagh/Ikarus.html.

4. Concluding remarks

IKARUS gives the terminographer the possibility of carrying out a number of tasks which are unique to terminology, and which have previously not been feasible using only tools designed for lexicography. These include: very fast collection of domain-specific documents on the WWW; identification of potential terms in these documents; analysis of both the linguistic and conceptual characteristics of these terms; recording, and browsing through, a conceptual model of the domain; and finally, through the magic of the WWW, sharing the results of terminographic analysis with terminographers or domain experts throughout the world. We hope that tools of this kind will finally launch terminographers into the computer age, and contribute to a new generation of terminological dictionaries that is not only created much faster than in the past, but that is also of higher quality due to the wealth of conceptual information that complements the linguistic.

Acknowledgements

This research has been supported by the Social Sciences and Humanities Research Council of Canada (SSHRC), by the National Sciences and Engineering Research Council of Canada (NSERC), and by Mitel, Inc.

References

Ahmad, K., Picht, H., Rogers, M. And Thomas, P. 1989. "Terminology and Knowledge Engineering: A Symbiotic Relationship Explained". Technical Report TR 89/1, Guildford: University of Surrey.

Ahmad, K. and Rogers, M. 1996 (in press). "The Analysis of Text Corpora for the Creation of Advanced Terminology Databases". In Handbook of Terminology Management. Eds. Gerhard Budin and Sue-Ellen Wright. Amsterdam/Philadelphia: John Benjamins.

Atkins, B.T.S. 1992. "Tools for Computer-Aided Corpus Lexicography: the Hector Project". In Papers in Computational Lexicography (Complex 92). Budapest: Linguistics Institute, Hungarian Academy of Sciences.

Atkins, B.T.S., Levin, B., and Zampolli, A. 1994. "Computational Approaches to the Lexicon: An Overview". In Computational Approaches to the Lexicon, Eds. B.T.S. Atkins and A. Zampolli, pp. 17-48.

Jacquin, Christine and Liscouet, Maurice. 1996. "Terminology Extraction from Text Corpora: Application to Document-Keeping Via Internet". TKE 96: Terminology and Knowledge Engineering. Eds. Christian Galinski and Klaus-Dirk Schmitz. Frankfurt: INDEKS Verlag.

Kavanagh, Judy. 1995. The Text Analyzer: A Tool for Knowledge Acquisition from Texts. Master's thesis, Dept. of Computer Science, University of Ottawa, Canada.

Lenat, D. and Guha, R. 1990. Building Large Knowledge-Based Systems. Reading, MA: Addison-Wesley.

Meyer, Ingrid. 1992. "Knowledge Management for Terminology-Intensive Applications: Needs and Tools". In Lexical Semantics and Knowledge Representation, Eds. J. Pustejovsky and S. Bergler. Berlin: Springer Verlag, pp. 21-37.

Meyer, Ingrid, Bowker, Lynne and Eck, Karen. 1992. "Towards a New Generation of Terminological Resources: An Experiment in Building a Terminological Knowledge Base". Proceedings of COLING 92, pp. 956-960.

Meyer, Ingrid and Eck, Karen. 1996 (in press). "Systematic Concept Analysis with a Knowledge-based Approach to Terminology". Handbook of Terminology, Eds. Gerhard Budin and Sue-Ellen Wright. Amsterdam/Philadelphia: John Benjamins, pp. 113-134.

Meyer, Ingrid and Mackintosh, Kristen. 1996 (in press). The Corpus from a Terminographer's Viewpoint. International Journal of Corpus Linguistics, Vol. 1, No. 2.

Nuopponen, Anita. 1996. Terminological Information and Activities in World Wide Web. TKE 96: Terminology and Knowledge Engineering. Eds. Christian Galinski and Klaus-Dirk Schmitz. Frankfurt: INDEKS Verlag.

Picht, Heribert and Draskau, Jennifer. 1985. Terminology: An Introduction. Guildford: University of Surrey.

Sager, Juan. 1990. A Practical Course in Terminology Processing. Amsterdam/Philadelphia: John Benjamins.

Schweighofer, Erich and Scheithauer, Dieter. 1996. "Legal Terminology Research in an Internet/WWW Environment". TKE 96: Terminology and Knowledge Engineering. Eds. Christian Galinski and Klaus-Dirk Schmitz. Frankfurt: INDEKS Verlag.

Skuce, Douglas and Lethbridge, Timothy. 1995. "CODE4: A Unified System for Managing Conceptual Knowledge". International Journal of human-Computer Studies.

Skuce, Douglas and Meyer, Ingrid. 1991. "Terminology and Knowledge Engineering: Exploring a Symbiotic Relationship". Proceedings of the 6th International Workshop on Knowledge Acquisition for Knowledge-Based Systems. (Banff, Oct. 1991), pp. 29-1 to 29-21.