114 captures
01 Mar 2000 - 06 Jul 2014
Jul
AUG
Sep
25
2004
2005
2006
success
fail
About this capture
COLLECTED BY
Organization:
Alexa Crawls
Starting in 1996,
Alexa Internet
has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the
Wayback Machine
after an embargo period.
Collection:
Alexa Crawl EE
Crawl EE from Alexa Internet. This data is currently not publicly accessible.
TIMESTAMPS
The Wayback Machine - https://web.archive.org/all/20050825152742/http://www.picosearch.com:80/faqs/faq_text_too.html
PicoSearch FAQ: Indexing more than just HTML
Help with Picosearch
>
FAQs
>
New Features
>
What is Picosearch?
>
Picosearch Glossary
>
Sample Customers
>
License
Search our Site
Search the FAQs
Can PicoSearch index files besides HTML? (PDF, DOC, ASP, etc.)
Yes, PicoSearch can search many formats.
To begin with, PicoSearch will index any
plain tex
t or
HTML
file. Technically, this means files that come over the http protocol as being of content-type "text/html" or "text/plain". This should cover pages that come from ASP and other scripts.
Additionally, PicoSearch keeps getting smarter about more file types! Here is a current list of what PicoSearch can index. Remember that your search engine will also be limited to a maximum page limit, where one URL document (file or webpage) generally equals one PicoSearch page (extra-long HTML/text, or multi-page non-HTML formats like PDFs, may yield more than 1 PicoSearch page per document). If you find yourself needing more pages, see our
plan rates and services
.
For All Accounts
(Free as well as Professional and Premium Accounts!)
HTML files
(.html types including .htm and .shtml, and any content-type "text/html")
PicoSearch will index your HTML files, including any generated by addresses to server scripts like ASP, CGI Perl, etc. This feature is on by default. You can turn on/off the titles, meta-tags, image alt tags, and even the whole page's body to get different searching effects - see the Index Modes section of your Account Manager's indexing topics. (If your HTML files aren't indexing as you expected, consider the FAQ on
Finding all of your Pages
Plain Text files
(.txt, and any content-type "text/plain")
PicoSearch will index your plain text files. This feature is on by default, and you can turn it off in the Index Modes section of your Account Manager's indexing topics.
XML files
(.xml, and any content-type "text/xml")
PicoSearch will index the text (not tags) of your XML files. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
Note:
Delivery of results in XML with DTD is a separate feature that is available for an install fee,
see FAQ
.
MP3 Files
(.mp3)
PicoSearch will index the song title, artist, album, and other text tags in your MP3 files which have been created by the ID3 tag format v1.0 and v1.1. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
MIDI Files
(.midi and .mid)
PicoSearch will index your MIDI files in two ways. One, the name of the file will be indexed, as "song name:
filename.mid
". Second, the text events of the MIDI standard will all be indexed. These are the codes 1-7 respectively that are used for a general text event, copyright info, track name, track instrument name, lyric, marker, and cue. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
Shockwave Files
(.swf)
PicoSearch will follow your Shockwave file links, so you can create exciting navigation for your site using Macromedia's Flash tools. PicoSearch can also index the text fields of your Shockwave files, and since you may not have expected these fields to be searchable, this feature is off by default. You can turn field searching on in the Additional Formats section of your Account Manager's indexing topics. The fields that will be indexed are: the frame names associated to the actions in GetUrls, and the default text from editable text boxes in files created with Flash 4.
For Professional and Premium Accounts only
MS Word
(.doc)
PicoSearch will index the text of your MicroSoft Word documents for version 5, 6, 97, and 2000. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
MS Excel
(.xls)
PicoSearch will index the text of MicroSoft Excel spreadsheets for version 5, 6, 97, and 2000. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
MS PowerPoint
(.ppt)
PicoSearch will index the text of MicroSoft PowerPoint presentations. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
Rich Text Format
(.rtf)
PicoSearch will index the rich text format, commonly used in MicroSoft applications. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
Adobe PostScript
(.ps)
PicoSearch will index the text of your Adobe PostScript documents. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
Adobe PDF
(.pdf)
PicoSearch will index the text of your Adobe Acrobat PDF documents, and the Adobe titles and meta descriptions will also be found. If your PDF titles come out strange, maybe you weren't setting your PDF title property (if it's blank, the url should get used). PDF indexing is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
Please note that by default PicoSearch will also honor the Acrobat security profile with which your files have been saved, and will not index files that you have
copy-protected
. You have a separate option to include copy-protected PDFs. Thus, two common reasons for why a PDF file yeilds no content is if it is all graphical, or it is copy-protected and the option to include copy-protected PDFs is off.
Title and Meta Trick:
For non-HTML documents you may have special application attributes that are not in the body of the document. PicoSearch does index PDF titles and metas no problem, but in other formats it may not pick up such things; the title will then default to the URL (see switch "Show Just File Name if Title Defaults to URL" under Configure Results in account manager), the meta description will be the first few lines of the document, and the keywords found will be those in the text. Well just so you know, all documents are converted to plain text and then processed like HTML. So you could actually literally plant HTML title and meta tags in your non-HTML documents, and PicoSearch would use this information. If you make the text white on white, anywhere in the document, then users will never even notice! So for example, your document could show a title in search results if the following was in the text: <title>My PDF document</title>
Patents Pending. Copyright © Picosearch LLC