Preface to the Second Edition (1989)

The history of the Oxford English Dictionary (continued)

part 2 of 3

The New Oxford English Dictionary project (continued)

Data structure

Once firm plans had been made, it was intended that the conversion of the text into electronic form should begin as soon as possible. Preparations at ICC were by now well advanced. But for data capture to begin, a system for structuring the text had to be agreed upon. It was resolved that the tagging language inserted into the electronic version should do more than simply express the typographical features - layout, typeface, type size, font - of the printed text. It must, as its primary function, identify the structural elements which combine to form a dictionary entry. This was a prerequisite both for the development of the database in the future, and, as it turned out, for the automatic processes applied to the text in the course of integration.

Several months were devoted to the analysis of the structure of the OED and its Supplement, resulting in an inventory of the most important structural elements (amounting to between forty and fifty) and their current typographical realizations. The translation of this scheme into a system of tags, though not without its difficulties, was straightforward compared to the immense task of ensuring that each element of Dictionary text was supplied with the correct tag. It emerged from discussions with ICC that a tagging scheme of such size and complexity would be very hard to insert accurately into the text at the stage of initial data capture. It would require so much knowledge that the training of keyboarders would be very long and the typing very slow. It would also require extensive pre-editing of the text, which again would take an excessively long time and require much training. On the other hand, a more modest scheme would be manageable. Accordingly, a compromise mark-up scheme was devised. The fifteen or so most prominent textual elements received tags with structural meaning, while all other features of the text were coded with tags that had a conventional typographical meaning. Further coding was deferred to a later stage. Even with this scheme, ICC found it necessary to carry out a considerable amount of preliminary mark-up, conduct lengthy training sessions, and undertake several proof-reading cycles, before the text was ready to be shipped to Oxford.

On 15 May 1984, at a press conference in the premises of the Royal Society in London, a formal announcement of the launching of the New Oxford English Dictionary Project was made, including the news that IBM UK Ltd. would be making a substantial donation to the first phase of the project. Meanwhile, work on the means of carrying out the integration of the text was continuing in collaboration with IBM. Matters needing development were identified as: the database management system for holding and protecting the electronic text, the software tool by which it might be edited, and a means of correcting cross-references affected by integration. There was also the problem of enhancing the system of tagging introduced by ICC so that it should be an entirely ‘generalized’ mark-up language, that is to say, one having structural, not typographical signification. At first this planning was conducted by means of a regular meeting between staff from OUP and IBM, but at length, in mid-July, the first secondee from IBM arrived at OUP as the project's computer group manager, and began to build up his team. Form then on, the main instrument by which progress was monitored and problems were identified was a formal system of meetings, some at half-yearly and monthly intervals, at which representatives of the management of IBM were present, others occurring weekly and dealing with the minutiae of the project team’s work.

During the following autumn the project gathered momentum. In September the University of Waterloo was granted Canadian Government funding with which to establish a Centre for the New OED as a focus for database research, from the point of view of both the academic user and the computer scientist. Early sketches of a potential database structure had already been made, and, more importantly, the project had attracted the interest of several researchers who might be able to provide parsing software which would facilitate the enhancement of the mark-up language. After some months of experimentation at the University of Waterloo, work was begun on this part of the system by the project's computer group, a vital contribution at the start being made by a secondee from Waterloo.

Also in September 1984, ICC sent to Oxford test data consisting of 100 pages of Dictionary text on magnetic tape. This not only proved the feasibility of the scheme for data capture but also made it possible to try out methods of proof-reading.

In October the project team drew up a formal Statement of User Requirements, which set out the aims of the first phase and the operations which the computer system would be required to perform. This gave the computer group a basis on which to develop their detailed design of the system, an activity which occupied their attention over the two succeeding years. An Editorial Board was constituted, consisting of about forty scholars in a wide range of disciplines; the idea being that they should give advice to the project team especially when the revision, updating, and enhancement of the dictionary were planned.

Data capture

At the beginning of November 1984 the computer equipment from IBM was installed. At the same time, ICC began data capture in earnest. A team of ICC copy editors, based in Fort Washington, Pennsylvania, began to insert structural mark-up on enlarged copies of the Dictionary pages. These were passed to the data conversion personnel (both on the same site and in Tampa, Florida) for keyboarding. Data-validation routines and sample proof-reading were carried out by ICC before the proofs were shipped to Oxford. It was stipulated that the rate of errors should be no more than 7 in 10,000 keystrokes; and this requirement was met.

The first batch of magnetic tapes and proofs arrived in January 1985, and proof-reading immediately got under way. From then until June 1986 a regular cycle of data capture, proof-reading, and data correction was maintained. A team of some fifty freelance proof-readers was directed from Oxford. They were required to check not only the accuracy of the text but also the selection and positioning of the computer codes. They were provided with a detailed manual describing the structure of the Dictionary and the correct application of the tagging system. Double proof-reading - the reading of the same section of text by two people independently, followed by cross-collation - was employed for a trial period. It proved, owing mainly to the very low error rate maintained by ICC, not to reveal a markedly higher number of errors than a single reading; certainly not enough to justify the double outlay of expense and editorial effort. A single reading was therefore conducted, but experienced staff checked, emended, and supplemented all the corrections before the proofs were returned to ICC. In addition, a system of monitoring the proof-readers' work by detailed rechecking of random samples was carried out until satisfactory standards had been achieved. During the same stage, a prototype of the parsing program was run on most of the electronic text to validate its structure: this functioned rather like an additional (and, within certain limits, infallible) proof-reader.

When ICC returned the corrected tapes, these were subjected to a further check, on the screen, to ensure that the corrections had been carried out within the agreed margins. This left the text with an estimated residual error-rate of only 1 in 235,000 characters. Since most of these were minor errors of punctuation and spacing, and the text would subsequently be proof-read a second time, this was felt to be an acceptable level at which the data could proceed to automatic processing by computer.

Computer development

In July 1985 the computer group issued an Outline System Design, describing the essential components and features of the New OED computer system. Over the following eighteen months, in close consultation with the lexicographers, the group built a unique dictionary system tailored to the special needs of the project.

Once the text had been captured, it was loaded on to the project's IBM 4341 mainframe at OUP. It was important that it should be stored in a database system that would allow the necessary access and processing facilities. The operating system used was IBM's VM370; the database management system was SQL/DS. Every new version of the data created by each successive stage of processing and editing was retained in the database; no older version was overwritten, and the whole was regularly archived on to magnetic tape and stored at a remote site for safety.

The structure devised by Sir James Murray and used by him and all his successors for writing Dictionary entries was so regular that it was possible to analyse them as if they were sentences of a language with a definite syntax and grammar. They could therefore be parsed, and this was the next process to which the text was submitted. The objective of parsing, as already mentioned, was to transform the text into a version categorized by a system of generalized mark-up, known as SGML (Standard Generalized Mark-up Language), in which each element is identified by its function, not its printed appearance. The programs used for parsing were written by staff of the University of Waterloo. The ‘grammar’ of the Dictionary text with which they operated was written at Oxford. It was developed by running a postulated grammar against the Dictionary text to establish whether the latter could be transformed without rejection of the input or ambiguity in the output. Revised versions of the grammar were run repeatedly until the closest possible approximation was achieved. The grammar had to be descriptive, not prescriptive, since the computer could not be allowed to override lexicographical judgement, and only the most minor rewriting of the text to accommodate computerization was acceptable.

A particularly important proposal in the outline design was that the computer system should automatically carry out as much as possible of the integration of corresponding OED and Supplement entries, leaving the lexicographical team the task of correcting errors, harmonizing adjacent text, and coping with difficult cases. The integration routines used the mark-up to create a single sequence of text from the two component parts, following the main structural cues (headwords and sense divisions) and the instructions in the Supplement that were identified as ‘integration instructions’ during parsing. Subsequent analysis of the integration program's performance showed that it successfully handled about 80 per cent of the text, and spared the lexicographers and keyboarders between 50 per cent and 60 per cent of the number of tasks which they would otherwise have been obliged to perform interactively at the computer screen.

Integration caused the targets of thousands of cross-references to be changed, rendering the cross-references inaccurate. To cope with this problem, every cross-reference identified by the parser was numbered and copied; after integration, the stored copies were automatically matched with their targets, changed wherever necessary, and returned to the text. In a similar way the pronunciations were copied, translated into the International Phonetic Alphabet, and restored.

The problem arose of finding editorial software suitable for emending and integrating entries interactively at the computer terminal. Failing to discover any proprietary software that was adequate, the team resolved to develop its own. The product of this development was a new kind of text editor, designed for structured text, and originally known as LEXX. The initial work was carried out by an IBM secondee, and then taken over and extended by the OUP staff. This highly versatile editorial tool was designed to interface with a number of programs that controlled access to the Dictionary data held on the computer, allowed entries to be proofed for immediate checking, and provided a complete working environment with checks and controls to protect the integrity of the text. The combined sub-system was eventually named the OED Integration, Proofing, and Updating System (OEDIPUS).

Once editing was complete, the text was to be transferred for composition of galley and page proofs. It was decided that this part of the process should be performed by an outside supplier.

During 1986 data capture of the main OED and Supplement text was completed (the remaining text - the entirely new entries and the bibliography - was keyboarded during the following half-year). The last of the eighteen monthly batches of proofs was returned, corrected, to ICC in mid-August. A month later the automatic processing of the Dictionary data on the computer system began. First the text was read on to the system and validated. Next the parser was run. Structural errors encountered by the parser were corrected on-line by the editorial group. During the three months that elapsed, 5,711 corrections were made. Automatic integration itself began in March 1987, and the automatic processing of the whole text of the Dictionary was completed at the end of May.

The editing of the integrated text

After subjecting OEDIPUS to acceptance trials, the editorial group was given access to the system at the end of June 1987. The most efficient working method had already been determined by experimentation. Proofs, or more strictly speaking, printouts, of all entries that were subject to integration and the modifications resulting from it were run off by the computer system. The lexicographical group would work through these, examining the results of automatic integration and making corrections and other emendations. These alterations would be entered into the text on-line by a separate group of keyboarders. Galley proofs of the complete integrated text would then be produced by an outside supplier. Accordingly, editing of the printouts began in June, and, at the same time, a team of keyboard operators was engaged, trained, and assigned to the task of ‘interactive integration’.

After the first few months, during which no galley proofs were composed, the editorial group found itself occupied on several fronts simultaneously. On account of its huge size, the text was handled by the computer in forty alphabetical ranges or ‘tables’. At any one time, the group would be editing up to half a dozen text tables. Each of these would be undergoing one of four consecutive editorial processes. The first was the editing by lexicographers of proofs of all entries that had in any way been modified by the integration and cross-referencing programs. Next, these marked-up proofs were passed to the keyboard operators, who made the necessary emendations to the electronic text. At this stage, a number of other corrections had also to be made, some unconnected with the action of integration; also, many complicated problems of integration came to light (including entries that had wrongly eluded automatic integration) and had to be resolved, at the keyboard, by the lexicographical staff. Once the integration of a table had been approved, a magnetic tape was produced and sent to the composition suppliers, Filmtype Services Ltd., of Scarborough, North Yorkshire.

Galley proofs of the entire Dictionary text for each text table were produced and distributed to the team of proof-readers (now increased to more than sixty). On their return, the third stage began. The editorial group checked all proof-readers' corrections, and carried out many additional systematic checks, some facilitated by specific computer scans. Cross-references were dealt with at this stage. Once approved, the table was again put on tape and sent for composition. This time fully formatted page proofs were produced, and the breaks between volumes were inserted. The fourth stage consisted of the checking of these proofs to ensure that all galley proof corrections appeared correctly on them, and that no errors had crept into the text for any other reason, such as the malfunctioning of the composition programs. The final corrections to the page proofs were again keyboarded into the database at Oxford; they were applied to the printed version by Filmtype Services either by the processing of a new magnetic tape copy or by simple keyboarding. When the final proof pages for a volume were deemed acceptable, the volume was passed for press.

part 2 of 3