99 captures
16 Aug 2000 - 06 Jul 2014
Feb
MAR
Apr
05
2004
2005
2006
success
fail
About this capture
COLLECTED BY
Organization:
Alexa Crawls
Starting in 1996,
Alexa Internet
has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the
Wayback Machine
after an embargo period.
Collection:
Alexa EC
Crawl data donated by Alexa Internet. This data is currently not publicly accessible
TIMESTAMPS
The Wayback Machine - https://web.archive.org/all/20050305215826/http://www.picosearch.com:80/faqs/faq_charsets.html
PicoSearch FAQ: Character Set control
Help with Picosearch
>
FAQs
>
What's New
>
What is Picosearch?
>
Picosearch Glossary
>
Sample Customers
>
License
Search our Site
Search the FAQs
How can I control the internationl language character set used at indexing time?
Character set becomes an important issue only if it's not working for you.
Most world languages are going to be indexed just fine by PicoSearch
, so you won't even have to think about it. Many languages have been translated for PicoSearch results displaying; see the Results Language setting of your account manager.
Special Notes:
If you have a non-western character language, such as Arabic or Cyrillic, you may find that the Exact Phrase searching mode works best to prevent extra results, see
Any/All/Exact Initializing
. Also, if you intend to search other than ISO-Latin1 characters in PDFs or other special filtered formats, be sure to test the behavior first to your satisfaction. You can request a trial version of PicoSearch Professional to do this, just email us.
PicoSearch will search all single-byte character set languages.
This includes the non-Asian languages, and some Asian languages as well. But Asian languages which have hundreds or thousands of glyphs must use double-byte character sets, and these are supported individually by PicoSearch only as they are developed, including for the concordance results. Check the Alternate Character Options section of the Indexing Topics in your Account Manager for these major choices as they become available.
UTF-8 Support
: UTF-8 Unicode is increasingly popular with hosters because it can include any language rather transparently. The cost of this is that UTF-8 is not really a character set; plain Western characters are single byte, but accented characters take a varying number of bytes. So it may look good in a browser, but the actual language is less specified and software may have more not fewer problems. PicoSearch's current solution is this:
UTF-8 is supported for Western European languages
by implicit conversion to the iso-8859-1 character set (see below for details on iso sets). So French, German, Spanish, Italian, etc, these all work as fine as iso-8859-1. But Chinese, Japanese, etc, won't work any better even though UTF-8 made them seem easier, and even East European languages may not work in UTF-8.
To correctly break apart your text into all the component words by distinguishing letters from punctuation, PicoSearch needs to decide what your dominant character set is at indexing time. Character set is bigger than language, so don't worry that this means you can't be multi-lingual; it just means that you have to choose the right character set for your languages. Browsers need to know this information too, so if you're designing a non-English, non-West European site then you probably already know which character set to use and where to add it to your web pages. When displaying search results, PicoSearch will insert the charset for the browser in the output HTML page of Free accounts. If you have a paying account, you'll have to put the charset codes that you want in your Customize Template section, since the HTML page design is under your control.
PicoSearch will pick up the character set of your site's pages as specified in any one of the following ways. In these examples, iso-8859-1 is the most common set which works for English and West European languages, and it is the default but it also never hurts to state it explicitly too.
HTTP Equivalents in the HTML head
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
(most common)
<meta http-equiv="charset" content="iso-8859-1">
(less common)
<meta charset="iso-8859-1">
(Internet Explorer only, not recommended)
Server specified
In the encoding field that the HTTP equivalent overrides.
The following list is the character sets that PicoSearch knows about, providing that the set is specified for your pages as mentioned above. If you are using an unknown or unspecified set, iso-8859-1 will be used by default. A symptom of PicoSearch not using the right set would be finding a non-Western character individually. PicoSearch will tell you the character set it used for your index in the Alternate Character Options section of your Account Manager. If you have any problems, please just
contact us
.
Western European, ISO Latin1 (iso-8859-1)
most versatile, includes English, Spanish, French, German, Italian, Portugeuse, Dutch, Danish, Swedish, Catalan, and more
Central European, ISO Latin2 (iso-8859-2)
covers the Slavic languages and more, including Czech, Hungarian, Polish, Romanian, and Croatian
South European, ISO Latin3 (iso-8859-3)
special set for Maltese and some others
North European, ISO Latin4 (iso-8859-4)
for Estonian and Baltic languages including Lithuanian, Latvian, and Lappish
Cyrillic, ISO (iso-8859-5)
Arabic, ISO (iso-8859-6)
Greek, ISO (iso-8859-7)
Hebrew, ISO (iso-8859-8)
Turkish, ISO Latin5 (iso-8859-9)
Nordic, ISO Latin6 (iso-8859-10)
Central European, Win Latin2 (windows-1250)
Slavic, Windows Cyrillic (windows-1251)
Western European, Win Latin1 (windows-1252)
compare to ISO Latin1
Greek, Windows (windows-1253)
Turkish, Win Latin5 (windows-1254)
Hebrew, Windows (windows-1255)
Arabic, Windows (windows-1256)
Baltic, Windows (windows-1257)
Patents Pending. Copyright © Picosearch LLC