The Wayback Machine - https://web.archive.org/all/20050825152742/http://www.picosearch.com:80/faqs/faq_text_too.html
PicoSearch FAQ: Indexing more than just HTML
homeplanssignupregistered userscontact ushelpabout PicosearchPicosearch news
Help with Picosearch
> FAQs
> New Features
> What is Picosearch?
> Picosearch Glossary
> Sample Customers
> License



Can PicoSearch index files besides HTML? (PDF, DOC, ASP, etc.)

Yes, PicoSearch can search many formats.
 
To begin with, PicoSearch will index any plain text or HTML file. Technically, this means files that come over the http protocol as being of content-type "text/html" or "text/plain". This should cover pages that come from ASP and other scripts.
 
Additionally, PicoSearch keeps getting smarter about more file types! Here is a current list of what PicoSearch can index. Remember that your search engine will also be limited to a maximum page limit, where one URL document (file or webpage) generally equals one PicoSearch page (extra-long HTML/text, or multi-page non-HTML formats like PDFs, may yield more than 1 PicoSearch page per document). If you find yourself needing more pages, see our plan rates and services.
 
For All Accounts
(Free as well as Professional and Premium Accounts!)
  • HTML files (.html types including .htm and .shtml, and any content-type "text/html")
    PicoSearch will index your HTML files, including any generated by addresses to server scripts like ASP, CGI Perl, etc. This feature is on by default. You can turn on/off the titles, meta-tags, image alt tags, and even the whole page's body to get different searching effects - see the Index Modes section of your Account Manager's indexing topics. (If your HTML files aren't indexing as you expected, consider the FAQ on Finding all of your Pages

  • Plain Text files (.txt, and any content-type "text/plain")
     PicoSearch will index your plain text files. This feature is on by default, and you can turn it off in the Index Modes section of your Account Manager's indexing topics.

  • XML files (.xml, and any content-type "text/xml")
     PicoSearch will index the text (not tags) of your XML files. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
    Note: Delivery of results in XML with DTD is a separate feature that is available for an install fee, see FAQ.

  • MP3 Files (.mp3)
     PicoSearch will index the song title, artist, album, and other text tags in your MP3 files which have been created by the ID3 tag format v1.0 and v1.1. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • MIDI Files (.midi and .mid)
     PicoSearch will index your MIDI files in two ways. One, the name of the file will be indexed, as "song name: filename.mid". Second, the text events of the MIDI standard will all be indexed. These are the codes 1-7 respectively that are used for a general text event, copyright info, track name, track instrument name, lyric, marker, and cue. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • Shockwave Files (.swf)
     PicoSearch will follow your Shockwave file links, so you can create exciting navigation for your site using Macromedia's Flash tools. PicoSearch can also index the text fields of your Shockwave files, and since you may not have expected these fields to be searchable, this feature is off by default. You can turn field searching on in the Additional Formats section of your Account Manager's indexing topics. The fields that will be indexed are: the frame names associated to the actions in GetUrls, and the default text from editable text boxes in files created with Flash 4.


For Professional and Premium Accounts only
  • MS Word (.doc)
     PicoSearch will index the text of your MicroSoft Word documents for version 5, 6, 97, and 2000. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • MS Excel (.xls)
     PicoSearch will index the text of MicroSoft Excel spreadsheets for version 5, 6, 97, and 2000. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • MS PowerPoint (.ppt)
     PicoSearch will index the text of MicroSoft PowerPoint presentations. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • Rich Text Format (.rtf)
     PicoSearch will index the rich text format, commonly used in MicroSoft applications. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • Adobe PostScript (.ps)
     PicoSearch will index the text of your Adobe PostScript documents. This feature is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.

  • Adobe PDF (.pdf)
     PicoSearch will index the text of your Adobe Acrobat PDF documents, and the Adobe titles and meta descriptions will also be found. If your PDF titles come out strange, maybe you weren't setting your PDF title property (if it's blank, the url should get used). PDF indexing is on by default, and you can turn it off in the Additional Formats section of your Account Manager's indexing topics.
       Please note that by default PicoSearch will also honor the Acrobat security profile with which your files have been saved, and will not index files that you have copy-protected. You have a separate option to include copy-protected PDFs. Thus, two common reasons for why a PDF file yeilds no content is if it is all graphical, or it is copy-protected and the option to include copy-protected PDFs is off.


Title and Meta Trick: For non-HTML documents you may have special application attributes that are not in the body of the document. PicoSearch does index PDF titles and metas no problem, but in other formats it may not pick up such things; the title will then default to the URL (see switch "Show Just File Name if Title Defaults to URL" under Configure Results in account manager), the meta description will be the first few lines of the document, and the keywords found will be those in the text. Well just so you know, all documents are converted to plain text and then processed like HTML. So you could actually literally plant HTML title and meta tags in your non-HTML documents, and PicoSearch would use this information. If you make the text white on white, anywhere in the document, then users will never even notice! So for example, your document could show a title in search results if the following was in the text: <title>My PDF document</title>



Back to FAQs

Patents Pending. Copyright © Picosearch LLC