Jim's picture
Jim Breen's WWWJDIC Server
Monash Logo
[Introduction|Browsing in Japanese|Operating Instructions|Translating Text|Dictionary Files|Links|Examples|Verb Conjugations|Codes|Copyright|FAQ|What's New|History|Planned Improvements|Known Bugs|Technical Bits|Backdoor Entry|Acknowledgements]

Last updated: 25 January 2003 (You can jump straight to WWWJDIC).

INTRODUCTION

Welcome to WWWJDIC, the dictionary server I have developed to enable direct WWW access to the Japanese-English dictionary files I have compiled or collected, and to provide some of the functionality on the WWW of the various dictionary search engines with which I have been associated. The server is called WWWJDIC because it is a member of the JDIC/xjdic/MacJDic family of dictionary software. It was based on code from the Unix/Linux xjdic program, but has a rewritten front-end to drive HTML forms.

WWWJDIC operates at several mirror sites around the globe. All sites carry identical information. Check here for the location of the nearest mirror site.

BROWSING IN JAPANESE

As WWWJDIC provides no support for the display of Japanese words in a romanized form (Romaji), you will require some capability for displaying Japanese kana and kanji. The best way to do this is to install the appropriate Japanese fonts and set your browser to use them. Most modern browsers support that facility. If you do not wish to do that, you may access WWWJDIC via a special server that will send out bit-mapped versions of Japanese characters (see below.)

If you are a Unix/Linux person using Mozilla, Netscape, Galeon, etc. all you have to do is make sure that a Japanese font file has been installed in the correct directory (e.g. /usr/X11R6/lib/X11/fonts/misc). Recent releases of Linux come with this included. You may have to make sure mkfontdir has been run too. You will then have to make sure that the browser knows to use this font when it encounters Japanese text. This is done (e.g. in Netscape) via the Edit/Preferences/Appearance/Fonts menu. If the WWW page is correctly marked as using Japanese, any Japanese text should appear immediately. Many WWW pages are not marked correctly, so you may have to to turn on Japanese viewing via the View/Character Set/Japanese (autodetect) menu. (Note that some Unix/Linux browsers do not allow input of Japanese via input methods such as kinput2. I use Mozilla, which does support kinput2.

For Windows users, probably the best method is to make sure a Japanese True-Type Font (ttf) has been installed on your system, and set your browser to use it. The Monash ftp archive has two Microsoft Japanese fonts available: Gothic and Mincho. These are both self-installing executable files. Once a font has been installed, you need to tell your browser to use that font for Japanese text. In Netscape this would be done via the Edit/Preferences/Appearance/Fonts menu. As ever, you will probably need to restart Windows to make it work.

Windows users also have a more complete solution which is to install the language support Windows Update from Microsoft. It has become hard to find from that page, but fortunately it appears also to be available here. This brings in the Japanese Language Support and Japanese Input Method Editor which allow users to view and input Japanese with reasonable ease. The IME works with MS-IE and from V4.72 also works with Netscape.(Note that even if you have no intention of using IE, you may need to have it installed in order to be able to install the IME.) Later versions of Windows based on NT (2000, XP) come with fonts and an IME already.

Another alternative for Windows users is to install a package which traps and converts Japanese characters. A good one is Hongbo Ni's NJWIN program. His NJCOMM program extends this to providing Japanese input as well.

Macintosh users have various ways of browsing in Japanese. For example, there is Apple's Multilingual Internet Access, which is available as a Custom installation from the Macintosh OS 8.5 CD-ROM. OS 9 comes with the Apple Language Kit built-in. There is some useful summary information on Asian Languages and Macs at the University of Sydney.

If you do not want, or cannot operate a full Japanese environment for your browser, you can access WWWJDIC via another server which will insert bit-mapped graphic characters as required. One such server is available on the Monash site here.

OPERATING INSTRUCTIONS

These will be minimal, as I have tried to make the operation of WWWJDIC as intuitive as possible. There is an FAQ section at the back of this page as people keep asking me things.

Romaji

Care is needed with the form of Romaji used for input. WWWJDIC expects "wapuro romaji", i.e. it should be typed as though it was going into Input Method (IM or IME) of a Japanese-capable word-processor. Thus it is "toukyou" and "oosaka". Also, I expect an apostrophe (') to disambiguate things like hon'yaku and Shin'ichi. (Some WPs use repeated n's for this, but I don't.) Note that I can accept both Hepburn and kunrei/nihon shiki; both sin'iti and shin'ichi map to the same kana. The only couple of romaji forms that may give trouble are the voiced forms of "tsu" and "chi". For these use "dzu" and "dji". Thus you need to look up "tsudzuku", not "tsuzuku". Note that as is many IMEs, xa, xi, etc. can be used for the small kana vowels.

For people who don't like having to click the "Japanese Keyword in Romaji" radio button on the dictionary search page, you enter romaji even when it is set to "English or ..." by prefixing the romaji with an "@" character, e.g. "@koujou".

Exact Match

An option on the Word Search page is "Require exact word-match". If you select this option, only a restricted number of entries will be displayed, as follows:
  1. for a Japanese key, the headword (kanji or kana) must match exactly, without other characters before or following;
  2. for non-Japanese keys, one of the senses in the dictionary entry must match the key exactly, however two exceptions are made:
    1. any characters in parentheses before the keyword are ignored;
    2. the characters "to " preceding the keyword are ignored (thus allowing matches on English verbs).

Searching for English Words

You need to know that the dictionary files are based on Japanese head-words, and selecting entries using English keys can result in misleading results. For example, looking for "book" in the full EDICT file will return potentially 350 entries. For searching the EDICT file, you may be able to get better results by setting the common word restriction via the checkbox on the initial menu. Also using the "Exact Match" option, may improve the results. Checking the example sentences (if available) will help verify if the word is suitable. At all times the user should exercise caution.

Multi-Radical Kanji Selection

I should also mention that the Multi-Radical Kanji Selection feature does not use the 214 classical radicals. Instead it uses a slightly different set which included more basic shapes. Note that the identification of the kanji is based on the visual appearance of the elements; not on their classical radical.

Customizing

You have the opportunity to change some of the visual aspects of WWWJDIC's input and display. There is a "customization" page which lets you change the basic colours, lines/display, etc. It also lets you change from the default EUC input and output coding to either Shift-JIS or Unicode (UTF-8). Note that for modern browsers like IE and Netscape, this option need not usually be exercised, as the browser will detect the code and display almost all characters correctly. The option is really only for browsers that cannot handle EUC at all (e.g. Japanese mobile phones which only support Shift_JIS), or for regular use of dictionary files such as the Buddhism and French/German files, which contain characters outside the basic Japanese set. For users with modern browsers, Unicode (UTF-8) may be worth using as it avoids the use of bitmapped images.

The customization can take place either by setting a cookie in your browser, or by setting some URL parameters. Note that the cookies only work for the server which set them.

A word of warning about changing the colours. Since the in-line images of the JIS212 characters were converted from GIF to PNG format, they are now black_on_ivory, not black_on_transparent. This is because many browsers cannot handle transparent PNG files.

A Word about IMEs.

WWWJDIC can be used successfully with the IMEs now available for Win95/98, Windows 2000 and XP:

Stroke Order Diagrams

For the Jouyou and Jinmeiyou kanji WWWJDIC has the option of displaying a semi-animated stroke order diagram. These diagrams use the art-work from Jack Halpern's New Japanese-English Character Dictionary, which was scanned and cleaned up by Jeffrey Friedl to go into Jack's Kanji Learner's Dictionary. Jack agreed to me including them as an option in WWWJDIC, and I was able to create animated GIF files using the panels of the artwork. Some twitch a bit due the occasional alignment inaccuracies. (See Technical Bits for more details.)

TRANSLATING TEXT

One of the options of WWWJDIC is to translate the words in Japanese text. Please note, the function does NOT attempt to translate Japanese text into English; it simply attempts to identify the words in the text and to display the translations of those words. The user is expected to know enough Japanese grammar to make sense of the results. The input text is displayed in sections, with the words detected/translated in red, or in blue where an inflected verb or adjective is assumed. If a user requests that a word/phrase only be translated once (see below), the text is displayed in brown for subsequent occurrences.

You can use this option in two ways:

  1. cut-and-paste text from another application into the text box on the browser screen. (It usually seems to go automatically into the EUC I require, but if you are having problems, try the option of forcing the server to convert it to EUC.) In some cases the cut-and-paste may break characters up, resulting in a load of mojibake. Sorry if this happens, but it's a browser problem and can't be fixed in the server.
  2. specify the URL of a WWW page, and the server will fetch that page and translate the words in it. Note that in doing so, it deletes everything between < and >, i.e. all HTML labels, etc. and as a default deletes all non-Japanese characters, so all you get is the raw Japanese. (You can override this and get it to leave the non-Japanese in if you wish.) Where non-Japanese has been deleted, a "|" is inserted. (In this option, you may wish to set a new timeout value if the fetch of the WWW page takes longer than the default 60 seconds allowed.) Please note that I make no attempt to handle cookies. If you can't use this facility because the site you are viewing requires cookies enabled, you will have to use the cut-and-paste alternative.

I have recently developed some small Javascript programs which enable text to be marked and then dropped straight into this function by clicking on a Taskbar button. See my buttons for details.

The server detects words in the text as follows:

  1. gairaigo in katakana are detected and looked up;
  2. jukugo beginning with kanji are detected;
  3. where a kanji is followed by two or more hiragana, an attempt is made to match the kana against known verb/adjective inflections. If this succeeds, the equivalent dictionary form of the word is sought. If this is successful, the match is displayed, and the matched text displayed in blue;
  4. single kanji which have not been detected in the above will be matched against dictionary entries (if any). (This may be turned off by the user.)
  5. sequences of four or more hiragana are matched against a small file of words and phrases typically written in kana alone. Only exact matches are reported. (This function may be expanded, but the possibility of false matches is high.)
  6. a special case is made of an o or go hiragana, or the GO kanji preceding a kanji. In this case a check is made to see if the word is present in the dictionary files with and without the prefix.
Matches against complete dictionary entries are favoured over partial matches of longer entries, and if two equivalent matches are found, the longer is returned. Matched jukugo which are followed by what appears to be a particle (i.e. "wa", "no", "ni", "na", etc.) are trimmed back to just the jukugo to avoid misreporting matches from phrases and similar long dictionary entries.

Users may request that translations only appear once for each Japanese word or phrase.

The user can invoke any dictionary file for the matching, but the combination THE_LOT file is the default. One advantage of using this combined file is that it increases the chance of getting a correct match for a word, particularly if the text contains names. Also, the component sub-files in THE_LOT are tagged, and the match function gives preference to entries in the following order (tags shown "EP", etc.):

The reason the EDICT subset is used is so that the appropriate match is made when there are several readings of a jukugo, for example the "adult" compound will be matched against the word "otona" instead of the less common "dainin".

The full details of all the dictionary files are provided below.

Further Comments on WWW Page Translation

Please note that if you are wanting to examine Japanese text within a frame, you may have to examine the source file (e.g. View/Source) to get the address of the actual file containing the text. An alternative is to open the frame in a window of its own.

Please appreciate that the function is somewhat crude and simplistic. Also, a large amount of text will result in hundreds of searches, so the server may take a while to respond.

I have created a front page for this function which uses frames so you can have the viewed page and WWWJDIC side-by-side.

DICTIONARY FILES

In this section I provide a few words about each dictionary file, and a link to that file's documentation. (The many people who contributed to the files are acknowledged in the documentation.)

The dictionary files used by the server are:

WWWJDIC also uses the "radkfile" file from the xjdic distribution, which contains the radical-element breakdown for the JIS208 kanji. This file was originally prepared by Michael Raine and revised and extended by me. The file is used to drive the multi-radical kanji-selection feature.

Some of the dictionary files contain characters used in languages such as French, German, Sanskrit, etc., which are not available in the common JIS X 0208 character set. These characters are coded in the extension set - JIS X 0212 - however most browsers cannot display these characters correctly in the default EUC-JP coding, and they are not available at all in Shift-JIS coding. For this reason the characters are sent from the server either as HTML entities, e.g. &eacute; for é, or as a bit-mapped PNG image. Depending on the font you have chosen for your browser, these characters may appear a little strange.

Please note that the dictionary material is for the most part copyright. Unauthorized publication of material from WWWJDIC is prohibited. See the Copyright section below for more information on this.

EXAMPLE SENTENCES

A number of entries in the EDICT dictionary file are linked to Japanese/English example sentences that can be displayed by clicking on the "Ex" tag after the entry. The examples are drawn from a collection of Japanese/English sentences compiled by Professor Yasuhito Tanaka at Hyogo University and his students. The collection is in the Public Domain, and was supplied by Professor Christian Boitet. It was described by Professor Tanaka in his Pacling2001 paper. It is under consideration for use as a source of multi-lingual examples in the Papillon project.

The collection is large (approximately 180,000 pairs) and is being edited as there are a number of errors and duplications in both the Japanese and English texts. A number of the sentences have been tagged with [M] or [F] to show they have gender-specific language or words.

The process of making the examples available to users of WWWJDIC was as follows:

  1. the more obvious duplications were removed automatically, reducing the original file from 210,000 pairs to 180,000 pairs.
  2. each Japanese sentence was parsed using the Chasen morphological analyzer to extract the lexical components.
  3. the components (Japanese words) which contained at least one kanji were retained, along with all gairaigo (in katakana), and an index file was built to relate each word to the sentences which contain it. This resulted in approximately 22,000 words being identified, most of which are in the main EDICT file.
  4. a function was added to WWWJDIC to check each word found to see if it is also in the example index file. If it is, a link to the collection of sentences is added. Only the first 100 examples can be seen for any one word.

As mentioned above, the collection is in need of considerable editing. It also needs pruning, as some words are over-represented and many others have no examples at all. Still it is a beginning, and a method for testing the Tanaka corpus.

I will be progressively editing the file, particularly to remove obvious duplications, and also adding additional example sentences. Any suggested corrections or sentences to add to the collection are welcome, and should be emailed to me. If you are sending suggested sentences, please make sure both the Japanese and English are correct. Also, if the sentences are drawn from an existing publication, please provide the details of that publication.

If you would like to collect a complete copy of the current file of example sentences, including the index words, it is available here.

VERB CONJUGATIONS

Most of the verbs in the main EDICT file allow an optional display of a table of verb conjugations. Where this is available, a [V] tag appears to the right of the verb display.

The table of conjugations is generated automatically according to the part-of-speech tag in the entry. It should not be assumed that for every verb, any single conjugation is as frequently used or as natural as any other.

Associated with the table of conjugations is a page of supplementary comments which attempts to expand some of the more obscure points.

LINKS TO OTHER SYSTEMS

An interesting feature of WWWJDIC is the system of links to other servers and files. These are:

  1. to other WWW kanji/hanzi/hanja character dictionaries. These links go from the kanji information page, and enable direct access to the information about that kanji held on other databases. The databases currently linked are:

    The "unifying" code we use to implement these links is the Unicode (UCS2) code-point. We intend to have all the systems cross-linked. You can index from Chuck's and Rick's systems back to WWWJDIC.

  2. the jeKai Project. This project is developing a WWW-based dictionary of extended information about words & phrases in Japanese. WWWJDIC examines the jeKai index and when it displays a Japanese word which is in the jeKai files, it creates a link. Try it out for a word like "noren" to see how it works.
  3. the online Sanseido dictionary at Goo. The link goes from the normal word display, and triggers the JE server at that site. You can use the other dictionaries at that site, including the big Daijirin.
  4. the Google search engine, which is called with the displayed Japanese word(s) as a search key.
  5. the built-in display of animated stroke order diagrams for about 1000 common kanji.

ABBREVIATIONS AND CODES USED IN DICTIONARY ENTRIES

The dictionary entries contain a number of abbreviations and codes, mainly to reduce storage and display space.

CODE MEANING CODE MEANING CODE MEANING CODE MEANING
abbr abbreviation adj adjective (keiyoushi) adv adverb (fukushi) adj-na adjectival nouns or quasi-adjectives (keiyodoshi)
adj-no nouns which may take the genitive case particle "no" adj-pn pre-noun adjectival (rentaishi) adj-s special adjective (e.g. ookii) adj-t "taru" adjective
arch archaism aux auxiliary aux-v auxiliary verb conj conjunction
col colloquialism exp Expressions (phrases, clauses, etc.) fam familiar language fem female term or language
gikun gikun (meaning) reading gram grammatical term hon honorific or respectful (sonkeigo) language hum humble (kenjougo) language
id idiomatic expression int interjection (kandoushi) iK word containing irregular kanji usage ik word containing irregular kana usage
io irregular okurigana usage MA martial arts term male male term or language m-sl manga slang
n noun (common) (futsuumeishi) n-adv adverbial noun (fukushitekimeishi) n-t noun (temporal) (jisoumeishi) neg negative (in a negative sentence, or with negative verb)
neg-v negative verb (when used with) obs obsolete term obsc obscure term oK word containing out-dated kanji
ok out-dated or obsolete kana usage pol polite (teineigo) language pref prefix qv quod vide (see another entry)
sl slang suf suffix uK word usually written using kanji alone uk word usually written using kana alone
v1 Ichidan verb v5 Godan verb (not completely classified) v5u, v5k, etc. Godan verb with `u', `ku', etc. endings v5k-s Godan verb - Iku/Yuku special class
v5z Godan verb - -zuru special class (alternative form of -jiru verbs) v5aru Godan verb - -aru special class v5uru Godan verb - Uru old class verb (old form of Eru) vi intransitive verb
vs noun or participle which takes the aux. verb suru vs-s suru verb - special class vk Kuru verb - special class vt transitive verb
vulg vulgar expression or word P "Priority" entry, i.e. among the 20,000 more common words in Japanese X rude or X-rated term (not displayed in educational software) - -

The following abbreviations are used in the Names dictionary file.

CODE MEANING CODE MEANING CODE MEANING CODE MEANING
s surname p place-name u person name, as-yet unclassified g given name, as-yet not classified by sex
f female given name m male given name h a full (family plus given) name of a historical person - -

The THE_LOT file, used for translating words in Japanese text, has the following codes attached to each entry to show the dictionary file from which it has been selected.

CODE MEANING CODE MEANING CODE MEANING CODE MEANING
SP special words & phrases EP edict (priority subset) ED edict (the rest) NA enamdict (higher frequency names)
NB enamdict (lower frequency names) GE geodic PP pandpdic AV aviation
CC concrete LW lawdic CO compdic LS lifscidic
FM finmktdic ST stardict PL j_places (entries not already in enamdict) ES engscidic
KD small hiragana dictionary LG lingdic FO forsdic_e BU buddhdic

COPYRIGHT

The material being displayed in WWWJDIC's pages is copyright. It is drawn from dictionary files the copyright of most of which is held by the Electronic Dictionary Research and Development Group (EDRDG) at Monash University. What does this mean in practical terms? Well:
  1. you can use WWWJDIC in the same way as you use a published dictionary to assist you with translating text and words. The results of your translation may be published, sold, etc. If you make heavy use of WWWJDIC it would be nice to acknowledge that, but there is no requirement to do more;
  2. you can link to WWWJDIC, e.g. using the backdoor entry, from other servers, provided those servers are operating free-of-charge. Servers operating on a fee basis must not use WWWJDIC without authorization.
  3. if you wish to publish significant extracts of the output from WWWJDIC, for example if you use the Translate Words in Text function to generate a vocabulary list for a textbook of reading passages, then this comes under the scope of the licence for the dictionary files, which prohibits unauthorized publication of subsets of the files. You must first obtain approval from the copyright holder, and if it is a large and commercial publication, a fee may be required. Small-scale use WWWJDIC for these purposes, especially by educational institutions, will usually be approved free-of-charge provided the usage is acknowledged in the publication.
  4. the Stroke Order Diagrams are under Jack Halpern's copyright. You may link to the pages displaying those images, but you must not download and store the images without Jack's permission.
For more details, see the licence statement covering the dictionary files.

FAQ (Frequently Asked Questions)

  1. [Q] I can't use WWWJDIC from a J-Phone. I put in a search word, but get no reply, instead it goes to the main menu.
    [A] Yes, I hope to fix that eventually. J-Phones use MML not HTML, and for some reason forms are sending in information that can't be decoded.
  2. [Q] Are you planning to have a WAP interface for WWWJDIC?
    [A] Perhaps one day, but in the meantime a WAP frontend site which accesses WWWJDIC via backdoor calls is close to being released (March 2002)
  3. [Q] I like the Stroke Order Diagrams. Why do some kanji not have them?
    [A] The raw diagrams were provided by Jack Halpern, and were prepared for the Kodansha Kanji Learners Dictionary. The coverage of that book is a bit over 2,000 kanji, so that's all the diagrams available.
  4. [Q] Your server is very slow. Why don't you rewrite it in ... or move it to the .... server technology?
    [A] Actually the servers are not slow at all. They are all fast systems, and the code is quite light-weight. Most requests are served in a fraction of a second. To some users it may seem slow because of network delays and congestion. If this is your case, try using a mirror site closer to you.
  5. [Q] I have hunted for the source of WWWJDIC and can't find it. Where is it?
    [A] Locked up on the servers. I haven't released it, and at this stage have no intention of doing so. It is continually being modified, and I want to keep it under my control (after all, it is my ego trip.) I don't want any clones of WWWJDIC running around at this stage.
  6. [Q] I want to have WWWJDIC's functions on my PC without having to use an Internet connection. Is there a stand-alone version I can download?
    [A] Not at present. There is no reason why the functionality can't be in a stand-alone program, and some program such as JQuickTrans do a similar job. One day I may do a port to a stand-alone, but I am still seeking a suitable cross-platform environment, i.e. all of Unix/Linux, Macintosh and Windows. I have some information about stand-alone software on the EDICT home page.
  7. [Q] In the text word translation you don't do all the words written just in hiragana - why is that?
    [A] There are several reasons: (a) the beginnings of such words can be very difficult to detect when they are preceded by other kana as is often the case (particles, etc.). You need sophisticated segmentation software to do this. (b) many Japanese words share the same reading/pronunciation, and hence I would probably pick the wrong word.
    At present I only handle words which are at least 4 kana long and which are found in a small list of kana-only words.
  8. [Q] What are all those "vs" and "an" tags on the dictionary displays? And what are the "ED" and "LS" when I translate words in text?
    [A] Fair question. I have now added a section to this file explaining them.
  9. [Q] It's hard to read the translated web pages. I can't read it so it makes sense.
    [A] Can you display Japanese text on your browser?
    Can you read hiragana and katakana at recognize a few kanji?
    Can you understand the basics of Japanese grammar and syntax?
    If the answer to the above is "no", then I agree the translations of the words from the WWW page will not be much use. What I do is assist a person to read the text by providing translations of some of the words. It is NOT an attempt to translate the page into English.
  10. [Q] Why do you just translate the words in the text? Why don't you go the rest of the way and translate properly into English?
    [A] Machine Translation (MT) is a huge and complex task. The WWWJDIC server is comparatively simple. If I ever developed a Japanese-English MT system (most unlikely), I'd sell it; not have it free on a WWW site.
  11. [Q] I don't have a Japanese-capable browser. Will you support graphics display of kana and kanji?
    [A] Not via my server, however you can use WWWJDIC via Silas Brown's "ACCESS-J" server, which has a link on the WWWJDIC front page. With things like NJWIN, any PC browser can display Japanese. Also there are fonts you can download for Netscape and IE.
  12. [Q] I can't read the kana readings. Will you add romaji display as an option.
    [A] No. Better to learn kana. It will only take a week or two.
  13. [Q] I see you use EUC-JP. Will you add Shift-JIS as an option?
    [A] I stalled on this for years, and finally added it in March 2000, because the damned DoCoMo mobile phones need it. If you aren't using the special DoCoMo interface, you have to set it via the customization page.
  14. [Q] I have a Japanese IM with my browser. Can I key in Japanese keys directly to the dictionary search?
    [A] Yes. (Actually you can call the key "English" if you like; that works.)
  15. [Q] I get blanks instead of kanji/kana, but I have obtained NJWIN.
    [A] If you have changed to Netscape 4.0x, you'll need to get the latest NJWIN. Otherwise, I have no suggestions.
  16. [Q] I don't get any of the JIS X 0212 kanji when I specify a kanji selection.
    [A] You need to click on the button to enable these (normally they are suppressed, as few users need them.)
  17. [Q] How do I specify a JIS X 0212 kanji when selecting a JIS code.
    [A] Put an "h" in front of it, e.g. "h4064". ("h" is for hojo.)
  18. [Q] How do a specify that I want my default dictionary to be "the_lot"? The customization doesn't allow that.
    [A] I really should add that to the customization. In the meantime you can either (a) bookmark the dictionary search screen, then without your browser running edit the URL in the bookmark file to say "wwwjdic?9C", or (b) go to the initial dictionary search screen, change the "wwwjdic?1C" to "wwwjdic?9C", press enter to go to the "new" URL, then bookmark it.

WWWJDIC HISTORY

No sooner had the WWW come into being that servers accessing my dictionary files began to appear. The first, which operated briefly in 1994, was a slight rework of my xjdic program by Otfried Schwarzkopf. It overtaxed his 386, and was closed down fairly quickly, however by that stage Jeffrey Friedl's famous Dictionary engine was running. There are also Rafael Santos' system, the EVA/POETS engine at Notre Dame in Tokyo, PSP's ALISE-based system, etc. etc., as well as Lambert Schomaker's WWW edition of the KANJIDIC file.

I had intended to have a WWW version of xjdic right from the moment I knew about the WWW, and in 1994 collected some information on writing CGI programs ready for the assault. It always seemed too big a task, and anyway Jeffrey's server was doing a good job. Eventually in mid-1997 it got too much for me, as I wanted to experiment with some features not handled by Jeffrey's server, and I also wanted to see my name in the WWW lights too, so I filleted out the search-engine parts of xjdic and dashed off a new CGI-oriented front-end. It only took a week or two of spare time and was up and running. I could easily have done it years before.

WWWJDIC has proved popular, although it has probably not overtaken the early lead Jeffrey's server established. It has been relatively easy to modify, so I have tinkered with it quite a bit (see below.)

Starting in late in 1998 I have installed a number of mirrors. The first two were quite a bit of work as I had effectively written a lot of hard-coded stuff pointing at the Monash site. The code is now fairly portable (for a Unix/Linux box running Apache.) Having a lot of mirrors brought in the problem of keeping them up-to-date. To handle this, in 2000 I set up an "rsysnc server" at Monash and have set "cron" scripts running at the mirror sites which periodically interrogate the Monash site and collect and install any updated files.

WHAT'S NEW

PLANNED IMPROVEMENTS

KNOWN BUGS

(I hope/intend to fix these eventually.)

TECHNICAL BITS

Structure

WWWJDIC is a single C program which takes its parameters from the URL (QUERY_STRING) and from the various buttons (POST method). It carries as much as it can of the user's state by loading the values of the various radio/checkboxes. View the source of some of the screens if you want to see how the CGI stuff is working.

No database system is used. Each dictionary file is a single text file with a dictionary entry per line. Associated with each text file is an index file containing pointers to each element in an entry (see the xjdic documentation and source for more details on this.) The dictionary lookup is extremely fast and efficient.

The program runs under the Apache server and on a number of different Unix-like operating systems, including Solaris, AIX, FreeBSD and several Linux distributions. No attempt has been made to run it under Windows.

I originally planned to have a permanent dictionary search engine, with CGI programs calling it, as happens with Jeffrey's dictionary server. In the end I did not go ahead with this, as memory-mapped handling of the read-only dictionary files, and the significant caching carried out by the file system, achieves the same efficiency goal anyway.

Mirror Sites

Mirror sites stay up-to-date by connecting to the master site at Monash once each day, retrieving a manifest file, then retrieving any updated source or data files. The file retrieval is done using the rsync system, which is excellent for retrieving small portions of large files. (There is an anonymous rsync server running at Monash for this purpose.) According to the settings in the manifest file, modified source files are compiled, index files are generated, etc. as part of this daily update.

I get a number of enquiries from people offering to host mirrors. I am not actively seeking many more mirrors, however I like to have a reasonable geographic spread. The basic requirements for a mirror site are:

  1. I must have an account on the system. Installation is complicated and not well documented.
  2. it must be a permanent arrangement, or at least one capable of being used for several years. I don't want to go to trouble setting it up only to have it withdrawn.
  3. it must be a Unix-like operating system (Solaris, Linux, AIX, etc.)
  4. it must have an Apache server running, plus a full suite of utility software, including gcc, wget, lynx, rsync, etc.
  5. it must be very well connected to the Internet. Having a poorly connected mirror is a waste of time.

Stroke Order Diagrams

The Stroke Order Diagram animation was carried out as follows:
  1. the source of the diagrams is the digitized multi-panel form from the printed kanji dictionaries, in which the kanji is built up stroke by stroke. Jack Halpern sent these to me as BMP files.
  2. I happened to have some software I had written to extract sections of graphics files and create new files from the pieces. This software worked on PNM (Portable Anymap) graphics files so using standard Linux/Unix utilities such as bmptopnm and ppmtogif in conjunction with my own software I was able to break each static diagram up into a series of GIF files; one for each panel of the diagram.
  3. for each kanji, I used the gifsicle utility to make an animated GIF of the whole kanji.
All this took a bit of debugging, but once it was working, it only took a few minutes to generate the diagrams for the whole 2230 kanji. All this was done on a Sun system running Solaris, so the GIF files are quite legal under the Unisys patent.

Japanese Character Codes

WWWJDIC uses the EUC-JP coding for all its files and all internal processing. EUC-JP is also the default coding for the HTML it generates.

The characters encoded in the files are from the JIS X 0208 character set which contains the Japanese kana and most common 6,355 kanji along with the Russian and Greek sets, plus the JIS X 0212 character set which includes a further 5,801 kanji plus some Latin characters with diacritics (acute, grave, umlaut, etc.)

When pages are displayed using the EUC-JP or Shift_JIS encodings, characters from JIS X 0212 are displayed either as HTML entities or as 16x16 bitmapped images. If the optional UTF-8 coding is used, all characters are displayed in that coding.

BACKDOOR ENTRY

If you want interface to WWWJDIC from another page or a CGI program, there is a "backdoor" entry which enables simple searches to be initiated via the URL QUERY_STRING. To use this, you must use the cgiwrapped URL, with the "backdoor" code set. The format is: where: Examples

1MKU4ed8 - look up the kanji with the Unicode codepoint "4ed8"

4MDJkoujou - look up the Japanese word "koujou" in dictionary 4.

1MDErabbit - look up the word "rabbit" in EDICT

9MGG%xx%xx%xx%xx%xx%xx%xx - gloss the (EUC) text

Note that if you want to use this method with other sites, you will need to modify the URL accordingly.

ACKNOWLEDGMENTS

I want to record my thanks to a few of the key people who have helped with the server. E&OE.
Go to Jim Breen's Japanese Page.
Disclaimer