On ‘knowledge organisation’

Brian Vickery

To organise knowledge is to gather together what we know into a comprehensive organised structure, to show its parts and their relationships. This is the work of scholars and encyclopaedists. It is not the role of the information profession. Our tasks are to make knowledge (whether organised or unorganised) available to those who seek it, to store it in an accessible way, and to provide tools and procedures that make it easier for people to find what they seek in those stores.

Among those tools are ‘finding aids’ that aim to indicate the contents of the stores, the ‘subjects’ of different knowledge items within them and the locations of these items within the stores, aids that have traditionally been known as ‘subject catalogues‘. These finding aids have themselves developed structures, such as those of a subject heading list, or of a classification, and in recent years the information profession has come to talk of such structures as ‘knowledge organisation systems’. One contentious theme has been this: to what extent should, or can, the structure and organisation of a finding aid reflect the structure and organisation of knowledge itself?

What is now called ‘knowledge organisation’ in this context has a long history. The simplest forms of a knowledge organisation system (KOS) are, after all, the contents list and the index of a textbook. The knowledge is in the text; the KOS is a supplementary tool that helps the reader to find his way around the text. But as such finding aids have become more complex, and taken on wider functions, they have acquired grander names, such as retrieval languages, taxonomies, categorisations, lexicons, thesauri, or ontologies. They are now seen as schemes that organize, manage, and retrieve information.

KOS structures

The basis of any modern KOS is an assembly and display of words or phrases (I will refer to them both as ‘terms‘) with some indication of semantic relations between them. This is a definition that would cover dictionaries, glossaries, semantic nets, frame and slot structures, concept or topic maps, and other ‘term lists’, as well as those mentioned above. So it is not surprising that contributions to the ideas of KOS come from a variety of fields - indexers, subject cataloguers, linguists, lexicographers, taxonomists, logicians, computer programmers, artificial intelligence workers, even philosophers. The writings of Sowa provide examples of all these influences.

Why are semantic relations between terms needed in a KOS? This stems from the purposes for which KOS are used, the functions they perform. As ‘finding aids’, they may be used to facilitate, for example:

(a) Generic survey - the selection of a set of knowledge items likely to be relevant to a particular general subject. Since the subjects dealt with in knowledge are of all degrees of specificity, the KOS should be able to relate all those within a particular general subject field, to link them so that all can be found, to group them together for display to the searcher.

(b) Specific search - the selection of a set of knowledge items likely to be relevant to a particular specific topic. Since such topics can normally only be described by the combination of a number of terms, the KOS should be able to link terms together in a meaningfully related way.

(c) Sequential arrangement - the arrangement of a set of stored or selected items of knowledge in a semantically meaningful way.

(d) Choice of search terms - the display of semantically related terms from which the searcher may select those believed to represent the knowledge he seeks.

All these linking procedures require that semantic relations between terms and between subjects should, one way or another, be built into KOS. A relation between words is ‘semantic’, meaningful, if it reflects a relation that we believe to exist in reality. To that extent, at least, it must correspond to the way in which we organise our knowledge.

Types of KOS

There have been four stages in the development of KOS.

(1) First, the era in which KOS took the form of static structures (a subject index on paper, a card catalogue, etc). Subject entries can be single terms, or terms in combination (pre-coordinated). They are arranged either alphabetically, or ‘meaningfully’ in some subject pattern - typically that of a hierarchical classification.

In an alphabetical subject headings list, semantic relations can take the form of see references that link terms to others that, in the KOS concerned, are treated as synonyms; further, it may have see also references indicating indeterminate semantic relations to other terms. Homonyms may be distinguished by a bracketed label. Some such lists introduce ‘pockets’ of hierarchical classification (headings subdivided), thus using generic relations. When terms in the list are combined to form compound index headings, the relations between the combined terms are not explicit, though they may be understood by the knowledgeable user.

In an enumerative, precoordinated classification, the hierarchical links ostensibly represent the generic relation between a class and its subclasses, but in practice they may also be used for the class-membership relation. The nature of the link becomes somewhat indeterminate when, for example, a part or attribute is shown as a subclass of an entity. The schedule may include see references of indeterminate nature from one class to another, and the alphabetical index to the classification may use see references to link synonyms. Homonyms may occur at various places in the classification, and the meaning at each place is evident from the context.

In a faceted classification, a subject field or domain is first divided into explicitly named facet categories before introducing hierarchy. The use of categories such as entity, part, attribute, operation, place and time, and their combination to express subjects, implies a semantic relation between two successive facets, such as entity/part, entity/attribute. A given class of entity may be subdivided in more than one way, e.g. machines (entities) according to the operation they perform, or according to the material on which they work, or according to the nature of their end-product, thus introducing categories associated with the entities as characteristics of division. Faceted classifications (or indeed any classifications) may also use ‘relational indicators’ to express, for example, the ‘influence’ of one subject on another. See references are used as in enumerative schemes.

(2) Second, the ‘post-coordinate’ era, in which KOS took the form of independently manipulable elements (punched cards or entries in a computer file) that were used dynamically. Each element represents a term, that can be separately assigned to an item. Search is conducted by coordinating several terms that represent the query, and matching the set against those assigned to items. There are no explicit semantic relations in this procedure, they are formed only in the minds of indexer or searcher. The searcher can introduce implied semantic relations between terms by the use of AND and OR operators.

Semantic relations were introduced into post-coordinate systems by the construction of thesauri. These are alphabetical term lists within which (a) many terms are linked hierarchically in generic, class-membership (and sometimes entity/part) relations, and (b) other indeterminate relations between terms are indicated. Use cross-references indicate terms treated as synonyms. The various meanings of homonyms may be differentiated by subject field markers (qualifiers) or scope notes. Some thesauri may include a definition for each term. The indexer or searcher can browse through such a KOS to aid the choice of indexing and search terms. This tool has been developed into the ‘thesaurofacet’, which displays its terms in the two forms of faceted classification and thesaurus, the latter including relations between terms that do not occur in the classification.

Some post-coordinate indexing systems have used ‘role indicators‘, attachable to terms when they are combined into a compound. Thus we might have ‘surface/4 - cleaning/8 - sandblasting/9‘, where role 4 = entity, 8 = operation, and 9 = agency. In effect, at the time of combination, each term is assigned to a category. One system was even more explicit, expressing, for example, the entity/operation relation by adding a reciprocal role indicator to each term; so we would have ‘surface/A - cleaning/B,C - sandblasting/D‘, where A = entity operated on, B = operation on entity, C = operation effected by, D = agency effecting operation. Among the relations used in such systems were entity/attribute, entity/operation, operation/agency, operation/product, entity/component, property/measure.

Term lists used in online search aids found it useful to attach one or more of the following kinds of information to each term: part(s) of speech, semantic category, subject area marker, classification code, scope note, definition, links to semantically related terms, rules for disambiguating homonyms (Vickery & Vickery).

(3) Third, the Internet era, in which three kinds of online search have been used: (a) the hierarchical (and now sometimes faceted) classification used to display terms, through which the searcher can ‘step down’ until he reaches a term that best expresses his query, (b) citation references (URL links) from one Web item to another, (c) the ‘search engine’ index constructed by extracting single words from the texts of individual items. There are no explicit semantic relations in such an index, though the ability to search for terms adjacent or nearby in the text introduces implicit relations. The text word positions that the system records enable the display to the user of a ‘snippet’ of text that does provide him with words-in-relation.

(4) The hoped-for era of the ‘Semantic Web’. While all preceding KOS have been intended for direct human use, the new conception here is to create KOS that can be used by ‘intelligent’ software agents in their search for information. If this search is to be ‘meaningful’, it must make use of semantic relations, and moreover these relationships must be explicit within the KOS. Computer software can only ‘know’ that a particular semantic relation exists between two terms if it has been explicitly told so, or at least supplied with data and inference procedures whereby it can ‘deduce’ the fact. The kind of KOS that can provide this facility has become known as an ‘ontology’.

An ontology has been defined as ‘an explicit conceptualization of a domain of discourse, which provides a shared and common understanding of the domain’. It is therefore intended to be explicitly based on what is understood to be the structure of knowledge. It is also clearly something more than a ‘finding aid’ for text items, though ‘finding’ appropriate items on the Web is part of the function of an intelligent information agent. Ontologies will be further discussed later.

Controversy

There has been continual controversy, in one form or another, about the question: do finding aids need semantic knowledge? This has lain behind the old arguments between advocates of classification (‘they certainly need generic relations’) and of alphabetic indexing (‘maybe we need some, but they should only be supplementary’). Many developments over the years - the elaboration of subject headings lists into thesauri, the move from enumerative to faceted classification, the use of semantic relations in online search aids - have shown an increasing interest in and use of semantic relations in information retrieval. But the coming of the Internet search engines seemed to be sweeping these developments aside.

This lent importance to another controversy: should language control by the use of term lists be undertaken at both the indexing and search stages, or only at the search stage? That is, should we index (and therefore necessarily search) using a term list based on standardised terms and semantic relations, or index and search by ‘natural language’ terms and only use semantic relations at the point of search? To take a simple example: suppose a term A has a number of synonyms or equivalent expressions B, C, D, etc; should we replace all such equivalents found in text by A in the index, and direct searchers to do the same?; or should we index by A, B, C, D, etc, just as they occur in text, and ensure that in searching all these terms are OR’ed together? The first option certainly requires much more work, by both indexer and searcher. A search engine carries out the second indexing option automatically, and when simply used disregards the corresponding search option. There are however experiments by search engine operators to tackle some synonym problems such as abbreviations (HP or Hewlett Packard) and alternative spellings (or misspellings), and homonyms (people with identical names); and also to display to the searcher the multiple meanings of any homonym entered as a search term.

Since most search engines cover wide areas of subject matter (many indeed aspire to cover the whole of knowledge on the ‘visible’ Web), and vast stores of textual items, the problems of synonyms and homonyms are very severe. Despite attempts to ‘rank’ the search output so that those items most likely to be ‘relevant’ are displayed first, we are all aware of many irrelevant ‘hits’ and of missed items.

It has increasingly been claimed that the use of semantic relations in finding aids is restrictive rather than beneficial. Any single KOS, such as a classification or thesaurus, uses only a limited number of the myriad semantic relations that exist in textual items, any of which may be of interest to one searcher or another. The KOS, it is claimed, imposes an over-rigid framework that impedes flexible search. Only in a narrow subject field, where the interests and terminologies of authors and readers are already homogeneous, can we expect a standardised term list and a limited set of semantic relations to be successful. (It may well be argued that exactly the same may be said of a free text search engine - only if it covers a narrow homogeneous field is it likely to perform well.)

Response to this claim can be of two kinds. One is to accept it, to minimise the importance of semantics, standard terminologies and the like, and to embrace free text search, as do the Internet search engines. The other is to try to improve KOS, to provide more relations and more flexibility, and if possible to make use of some of the capabilities of ‘intelligent’ software. Some of these latter responses will now be considered.

Developing thesauri

The limitations of existing KOS have been summarized by Soergel et al. as follows:

The indeterminate nature of the ‘related term’ link in thesauri has long given rise to discussion, and the ANSI/NISO Guidelines for monolingual thesauri sets out a series of possible semantic relations that an RT link might represent, for example: field of work/practitioner; operation/instrument; process/agent; process/counteragent; action/product; action/target; object/special attribute; property/measure; measure/instrument. Other relations proposed are: source/product; action/property of action; operation/method; object/use.

In the light of the limitations of existing KOS (listed above), Soergel et al. have been exploring the conversion of a traditional (agricultural) thesaurus into ontology format. This would involve representing BT, NT and RT links in the form of specific predicates, for example:

BT replaced by (memberOf) or (isa) or (component of) or (spatiallyIncludedIn)
NT replaced by (hasMember) or (includesSpecific) or (hasComponent) or (spatiallyIncludes)

RT links would be replaced by more specific relations, such as those set out below:

X (causes) Y/ Y (causedBy) X
X (instrumentFor) Y / Y (performedByInstrument) X
X (processFor) Y / Y (usesProcess) X
X (beneficialFor) Y / Y (benefitsFrom) X
X (treatmentFor) Y / Y (treatedWith) X
X (harmfulFor) Y / Y (harmedBy) X
X (hasPest) Y / Y (afflicts) X
X (growsIn) Y / Y (growthEnvironmentFor) X
X (hasProperty) Y / Y (propertyOf) X
X (hasSymptom) Y / Y (indicates) X
X (similarTo) Y / Y (similarTo) X
X (oppositeTo) Y / Y (oppositeTo) X
X (hasPhase) Y / Y (phaseOf) X
X (ingests) Y / Y (ingestedBy) X

Topic maps

The elements with which ‘topic maps’ operate are topics, topic types, topic names, associations and their types and roles, occurrences and their types (Pepper).

A ‘topic’, in this context, can be any specific instance whatsoever - a person, an entity, a concept, really anything - regardless of whether it exists, about which anything whatsoever may be asserted by any means whatsoever. In the domain of Opera, for example, topics might include Tosca, Madam Butterfly, Puccini, Lucca, Rome, Italy: we could call them ‘terms’. Each topic may be an instance of a ‘topic type’: thus Tosca and Madam Butterfly are operas, Puccini is a composer, Lucca and Rome are cities, Italy is a country. These types are classes of which the topics are instances (class-membership relation). Such classes can themselves be declared as topics, and in turn might figure as subclasses of more general topic types. So we may have a hierarchy of topics (terms).

A topic may have one or more ‘names’. ‘Names exist in all shapes and forms: as formal names, symbolic names, nicknames, pet names, everyday names, login names, etc.’. Topic map KOS assign a ‘base name’ (preferred term) to each topic, and other variant names (synonyms) to be used in various contexts. A given name may be a homonym, e.g. Tosca the opera, and Tosca a character in the opera. These can be distinguished by adding scoping labels.

Topics can have ‘associations’ between them, relations between topics. Some examples might be ‘Tosca was written by Puccini’, ‘Tosca takes place in Rome’, ‘Puccini was born in Lucca’, ‘Lucca is in Italy’, ‘Puccini was influenced by Verdi’. These are semantic relations, comparable to the detailed RT links discussed above, but there is no limit to the variety of associations that may be used. There may be association types, just as there are topic types. Each topic that participates in an association plays a role in that association, the ‘association role‘. In the case of the relationship ‘Puccini was born in Lucca‘, those roles might be ‘person’ and ‘place‘; for ‘Tosca was composed by Puccini’ they might be ‘opera’ and ‘composer‘. Association roles thus correspond to the role that categories play in faceted classification. An assocaiation specifies the name and nature of a semantic relation (such as ‘composed by’), its role specifies the categories of topic that enter into this instance of the relation. Associations, their types and roles can themselves be declared as topics. [Note that the word ‘facet’ is used with a different meaning in the paper cited.]

Lastly, topics can have ‘occurrences’, i.e. links to documentary items, and individual occurrences may be assigned to various occurrence types such as ‘monograph‘, ‘article‘, ‘illustration‘.

The outcome is that a topic map establishes an intricate network of links layered over a collection of documentary knowledge within a domain. These links can be used to navigate the knowledge in many different ways. The network can be extended to grow with the collection, or can be merged with other topic maps to provide additional paths through the knowledge. A single knowledge collection may have many topic maps, as a topic map just provides one particular set of paths through it.

All the details of a topic map may be stated in a variant of the XML (eXtensible Markup Language), using RDF, a standard software-understandable format in which statements can be made about the properties and relations of anything that is ‘on the Web’. It is this feature that makes topic maps an important new development. Their semantic features, as illustrated above, parallel those in recent thesauri, but place no restrictions on the variety of relations that may be introduced.

Ontologies

Another definition of an ontology is ‘a systematic formalization of concepts, definitions, relationships, and rules that captures the semantic content of a domain in a machine-readable format‘. One important aspect of an ontology is that it is a KOS designed, not only to be in machine-readable format, but also to be usable by computer software in automated knowledge management within the subject domain. In this sense, topic maps may be regarded as a form of ontology.

A very clear exposition of basic ontology development has been provided by Noy and McGuinness. Their basic elements are (a) classes - the basic terms in a domain about which we would like either to make statements or to explain to a user; (b) their subclasses (generic relation); (c) instances of the (sub)classes (class-membership relation); (d) ‘slots’ related to the classes and instances. In general, there are several types of object properties that can become slots in an ontology. For example, where the main class is an entity such as Wine, we may have ‘intrinsic‘ properties such as the flavor of a wine; ‘extrinsic’ properties such as a wine’s name, and the area it comes from; parts of the entity, if the object is structured (these can be both physical and abstract ‘parts’, e.g., the courses of a meal); and relationships to other entities, such as the maker of a wine (the winery) and the grape the wine is made from. Thus the slots correspond to what might be other facets in a Wine classification [note that the word ‘facet‘ is used with a different meaning in the paper cited]. Prieto-Diaz offers a methodology of ontology development that explicitly uses our concept of facets to structure the top level of classes.

A more elaborate type of ontology is exemplified by those developed by Teknowledge (Nichols and Terry). There are a number of basic types of term: class, individual, attribute, relation (predicate or function). By combination of terms, assertions are created and entered into the KOS.

Classes ‘are like generic nouns that can be applied to distinct, named or nameable, individuals (examples of classes are Human, Dog, Company, Assassination, Cleaning)’. There are classes of entities and of events. To each class is attached a clear definition that captures its meaning. Classes in a domain are arranged hierarchically. The generic relation between a class and a subclass is set out explicitly in an assertion by using the predicate subclass, as in the example ‘subclass Terrier Dog‘. The instance relation between a class and an individual is also explicitly asserted, e.g. ‘instance Blackie Terrier‘, ‘instance KennedyAssassination Assassination‘.

Attributes are the qualities or properties of classes or individuals, and these too are explicitly asserted in the KOS: ‘attribute Terrier Furry‘. Attributes too can be arranged hierarchically by assertions such as ‘subattribute Red Colour‘, ‘subattribute Scarlet Red‘. Any individual or subclass ‘inherits’ all the characteristics of its parent class.

Predicates explicitly display relations between classes or individuals, entities or events. I have already noted the special cases of subclass, instance, attribute. But any relation can be used as a predicate, e.g. ‘father Brian Adam‘, ‘employee Newspaper Adam‘, ‘belongs Brian Blackie‘. A predicate would also be used to assert synonymity: ‘identical Buonaparte Napoleon‘.

As well as assertions, such an ontology contains inference rules using an ‘if-then’ operator, for example: ‘if (instance X Dog) then (chases X Cats)’, where X = any individual. Coupling this with ‘instance Blackie Terrier’ and ‘subclass Terrier Dog’ leads us to conclude that ‘chases Blackie Cats‘.

The main characteristics of such an ontology, compared to traditional KOS, are therefore (a) every semantic relation between terms, including generic and class-membership relations and synonyms, is explicitly asserted, (b) there is no limit to the variety of relations that can be used, and (c) inference rules link assertions so that deductions can be made from explicit assertions to others that are logically implied by them. In this way, ‘the semantic content of a domain’ is captured.

Conclusions

Can we draw any tentative conclusions about the possible development of KOS as finding aids? It is clearly necessary to distinguish between those tools aiming to cover the whole wide world of knowledge on the Internet, for use by the whole wide world of people, and other tools of much more specific coverage and specialist use.

A general search engine could only afford to create a ‘structured’ index to its vast input if its documentary sources already provided such ‘metadata’ in some agreed standard format, and this is impossible to envisage. Advocates of current search engines explicitly say ‘you can’t trust metadata, we have turned away from it’ (Norvig). It is also impossible to imagine that the millions of users of such a search engine could be persuaded to enter their queries in any structured format - the ease of natural language query is so beguiling. It is therefore difficult to see how search engine operators can act differently from the way they are now acting - seeking to provide tools at search time to aid specific types of synonym and homonym problems. As Norvig put it, ‘Somebody's got to do that kind of canonicalization. The problem of understanding content hasn't gone away; it's just been forced down to smaller pieces’. We may hope that as time goes on they will find ways to tackle ever more ‘small pieces’.

A free text search engine has one great advantage over any structured KOS: it does not have to undertake any intellectual updating of its content of terms and semantic relations - these all occur spontaneously within the textual knowledge they store and index. In any fast-growing field of knowledge, keeping a KOS up-to-date is a severe problem. It has even been said that a field can only be satisfactorily classified when it has ceased to grow. Therefore any developments that enable ready updating, revision, extension and change of a KOS are to be welcomed. The flexibility of such tools as topic maps seems to point in the right direction.

The need for flexibility, and to cater for the variety of possible approaches to documentary knowledge, leads some observers to abandon standardised structures and to encourage each knowledge user to provide his own index ‘tags’ to documentary items that he comes across, collectively building up a ‘KOS‘ such as ‘del-icio-us‘. The net product may have value, but it does not solve the problem of ’coming across’ the items in the first place - it is not a general finding aid. Some controlled form of tagging might be a way that individual users could contribute to the development of a tool such as a topic map.

Scientific, industrial, commercial and governmental enterprises all continue to emphasise the value and necessity of standardised terminology for their purposes, and this standardisation includes the identification of specified semantic relations between the terms. Since an ever-increasing proportion of their documentary knowledge is now in machine-readable form, they see the value of being able both to locate it and to manipulate it by computer software. This implies both that the documents themselves should be structured, capable of being taken to pieces that can be independently used, and that these individual pieces should be indexed by KOS that provide sophisticated semantic access to the collection.

But while the computer handling of both documents and their semantic content may need to grow more sophisticated, it becomes ever more necessary to hide that sophistication from the user, to offer to him an easy-to-use interface to the KOS and the document collection.

Clearly, we are at the start of a long period of experimentation and development in the evolution of knowledge organisation systems.

References

Back to top of page

Back to my home page