Administering Crawl for Web and File Share Content: Monitoring and Troubleshooting Crawls

Posted July, 2007

This document provides an overview of how the Google Search Appliance and the Google Mini crawl and index enterprise content.

For the Google Search Appliance, information about continuous crawl applies to software version 4.2, and information about full crawl and file system crawl applies to software version 4.6 and later.

For the Google Mini, all information applies to software version 4.4 and later.

Contents

  1. Using the Admin Console to Monitor a Crawl
    1. Crawl Status Messages
  2. Slow Crawl Rate
    1. Non-HTML Content
    2. Complex Content
    3. Host Load
    4. Network Problems
    5. Slow Web Servers
    6. Query Load
  3. Wait Times
  4. Errors from Web Servers
    1. URL Moved Permanently Redirect (301)
    2. URL Moved Temporarily Redirect (302)
    3. Cyclic Redirects
  5. URL Rewrite Rules
    1. BroadVision Web Server
    2. Sun Java System Web Server
    3. Microsoft Commerce Server
    4. Servers that Run Java Servlet Containers
    5. Lotus Domino Enterprise Server
      1. OpenDocument URLs
      2. URLs with # Suffixes
      3. Multiple Versions of the Same URL
    6. ColdFusion Application Server
    7. Index Pages

Using the Admin Console to Monitor a Crawl

The Admin console provides Status and Reports pages that enable you to monitor crawling. The following table describes monitoring tasks that you can perform using these pages.

Task Admin Console Page Comments
Monitor crawling status Status and Reports > Crawl Status While the search appliance is crawling, you can view summary information about events of the past 24 hours using the Status and Reports > Crawl Status page.

You can also use this page to stop a full crawl, or to pause or restart a continuous crawl.
Monitor crawling crawl Status and Reports > Crawl Diagnostics While the search appliance is crawling, you can view its history using the Status and Reports > Crawl Diagnostics page. Crawl diagnostics, as well as search logs and search reports, are organized by collection.

When the Status and Reports > Crawl Diagnostics page first appears, it shows the crawl history for the current domain. It shows each URL that has been fetched and timestamps for the last 10 fetches. If the fetch was not successful, an error message is also listed.

From the domain level, you can navigate to lower levels that show the history for a particular host, directory, or URL. At each level, the Status and Reports > Crawl Diagnostics page displays information that is pertinent to the selected level.

At the URL level, the Status and Reports > Crawl Diagnostics page shows summary information as well as a detailed Crawl History.

You can also use this page to submit a URL for recrawl.
Take a snapshot of the crawl queue Status and Reports > Crawl Queue Any time while the search appliance is crawling, you can define and view a snapshot of the queue using the Status and Reports > Crawl Queue page. A crawl queue snapshot displays URLs that are waiting to be crawled, as of the moment of the snapshot.

For each URL, the snapshot shows:
  • Enterprise PageRank
  • Last crawled time
  • Next scheduled crawl time
  • Change interval
View information about crawled files Status and Reports > Content Statistics At any time while the search appliance is crawling, you can view summary information about files that have been crawled using the Status and Reports > Content Statistics page. You can also use this page to export the summary information to a comma-separated values file.

Crawl Status Messages

In the Crawl History for a specific URL on the Status and Reports > Crawl Diagnostics page, the Crawl Status column lists various messages, as described in the following table.

Crawl Status Message Description
Crawled: New Document The search appliance successfully fetched this URL.
Crawled: Cached Version The search appliance crawled the cached version of the document. The search appliance sent an if-modified-since field in the HTTP header in its request and received a 304 response, indicating that the document is unchanged since the last crawl.
Retrying URL: Connection Timed Out The search appliance set up a connection to the Web server and sent its request, but the Web server did not respond within three minutes.
Retrying URL: Host Unreachable while trying to fetch robots.txt The search appliance could not connect to a Web server when trying to fetch robots.txt.
Retrying URL: Received 500 server error The search appliance received a 500 status message from the Web server, indicating that there was an internal error on the server.
Excluded: Document not found (404) The search appliance did not successfully fetch this URL. The Web server responded with a 404 status, which indicates that the document was not found. If a URL gets a status 404 when it is recrawled, it is removed from the index within 30 minutes.
Cookie Server Failed The search appliance did not successfully fetch a cookie using the cookie rule. Before crawling any Web pages that match patterns defined for Forms Authentication or Cookie sites, the search appliance executes the cookie rules.

Back to top

Slow Crawl Rate

The Status and Reports > Crawl Status page displays the Current Crawl Rate, which is the number of URLs being crawled per second. Slow crawling may be caused by the following factors:

These factors are described in the following sections.

Non-HTML Content

The search appliance converts non-HTML documents, such as PDF files and Microsoft Office documents, to HTML before indexing them. This is a CPU-intensive process that can take up to five seconds per document. If more than 100 documents are queued up for conversion to HTML, the search appliance stops fetching more URLs.

You can see the HTML that is produced by this process by clicking the cached link for a document in the search results.

If the search appliance is crawling a single UNIX/Linux Web server, you can run the tail command-line utility on the server access logs to see what was recently crawled. The tail utility copies the last part of a file. You can also run the tcpdump command to create a dump of network traffic that you can use to analyze a crawl.

If the search appliance is crawling multiple Web servers, it can crawl through a proxy.

Complex Content

Crawling many complex documents can cause a slow crawl rate.

To ensure that static complex documents are not recrawled as often as dynamic documents, add the URL patterns to the Crawl Infrequently URLs on the Crawl and Index > Freshness Tuning page.

Host Load

If the search appliance crawler receives many temporary server errors (500 status codes) when crawling a host, crawling slows down.

To speed up crawling, you may need to increase the value of concurrent connections to the Web server by using the Crawl and Index > Hostload Schedule page.

Network Problems

Network problems, such as latency, packet loss, or reduced bandwidth can be caused by several factors, including:

To find out what is causing a network problem, you can run tests from a device on the same network as the search appliance.

Use the wget program (available on most operating systems) to retrieve some large files from the Web server, with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network problems.

Run the traceroute network tool from a device on the same network as the search appliance and the Web server. If your network does not permit ICMP, then you can use tcptraceroute. You should run the traceroute with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network performance problems.

Packet loss is another indicator of a problem. You can narrow down the network hop that is causing the problem by seeing if there is a jump in the times taken at one point on the route.

Slow Web Servers

If response times are slow, you may have a slow Web server. To find out if your Web server is slow, use the wget command to retrieve some large files from the Web server. If it takes approximately the same time using wget as it does while crawling, you may have a slow Web server.

You can also log in to a Web server to determine whether there are any internal bottlenecks.

If you have a slow host, the search appliance crawler fetches lower-priority URLs from other hosts while continuing to crawl the slower host.

Query Load

The crawl processes on the search appliance are run at a lower priority than the processes that serve results. If the search appliance is heavily loaded serving search queries, the crawl rate drops.

Back to top

Wait Times

During continuous crawling, you may find that the search appliance is not recrawling URLs as quickly as specified by scheduled crawl times in the crawl queue snapshot. The amount of time that a URL has been in the crawl queue past its scheduled recrawl time is the URL's "wait time."

Wait times can occur when your enterprise content includes:

If the search appliance crawler needs four hours to catch up to the URLs in the crawl queue whose scheduled crawl time has already passed, the wait time for crawling the URLs is four hours. In extreme cases, wait times can be several days. The search appliance cannot recrawl a URL more frequently than the wait time.

It is not possible for an administrator to view the maximum wait time for URLs in the crawl queue or to view the number of URLs in the queue whose scheduled crawl time has passed. However, you can use the Status and Reports > Crawl Queue page to create a crawl queue snapshot, which shows:

Back to top

Errors from Web Servers

If the search appliance receives an error when fetching a URL, it records the error in Status and Reports > Crawl Diagnostics and schedules a retry after a certain time interval. The search appliance maintains an error count for each URL, and the time interval between retries increases as the error count rises. The maximum retry interval is three weeks.

The search appliance crawler distinguishes between permanent and temporary errors. There is a lower retry interval for temporary errors than for permanent errors.

Permanent errors occur when the document is no longer reachable using the URL. When the search appliance encounters a permanent error, it removes the document from the crawl queue and the index, if present.

Temporary errors occur when the URL is unavailable because of a temporary move or a temporary user or server error. When the search appliance encounters a temporary error, it retains the document in the crawl queue and the index, with the intention of recrawling it at a later time.

The following table lists permanent and temporary Web server errors.

Error Type Description
301 Permanent Redirect, URL moved permanently.
302 Temporary Redirect, URL moved temporarily.
401 Temporary Authentication required.
404 Temporary Document not found. URLs that get a 404 status response when they are recrawled are removed from the index within 30 minutes.
500 Temporary Temporary server error.
501 Permanent Not implemented.

In addition, the search appliance crawler refrains from visiting Web pages that have noindex and nofollow Robots META tags. For URLs excluded by Robots META tags, the maximum retry interval is one month.

You can view errors for a specific URL in the Crawl Status column on the Status and Reports > Crawl Diagnostics page.

URL Moved Permanently Redirect (301)

When the search appliance crawls a URL that has moved permanently, the Web server returns a 301 status. For example, the search appliance crawls the old address, http://myserver.com/301-source.html, and is redirected to the new address, http://myserver.com/301-destination.html. On the Status and Reports > Crawl Diagnostics page, the Crawl Status of the URL displays "Crawled: New Document" for both the source URL and the destination URL.

In search results, the URL of the 301 redirect appears as the URL of the destination page.

For example, if a user searches for info:http://myserver.com/301-<source>.html, the results display http://myserver.com/301-<destination>.html.

To enable search results to display a 301 redirect, ensure that start and follow URL patterns on the Crawl URLs page match both the source page and the destination page.

URL Moved Temporarily Redirect (302)

When the search appliance crawls a URL that has moved temporarily, the Web server returns a 302 status. On the Status and Reports > Crawl Diagnostics page, the Crawl Status of the URL shows two values for the source page:

There is no entry for the destination page in a 302 redirect.

In search results, the URL of the 302 redirect appears as the URL of the source page.

To enable search results to display a 302 redirect, ensure that start and follow URL patterns on the Crawl and Index > Crawl URLs page only match the source page. It is not necessary for the patterns to match the destination page.

A META tag that specifies http-equiv="refresh" is handled as a 302 redirect.

Cyclic Redirects

A cyclic redirect is a request for a URL in which the response is a redirect back to the same URL with a new cookie. The search appliance detects cyclic redirects and sets the appropriate cookie.

Back to top

URL Rewrite Rules

In certain cases, you may notice URLs in the Admin Console that differ slightly from the URLs in your environment. The reason for this is that the search appliance automatically rewrites or rejects a URL if the URL matches certain patterns. The search appliance rewrites the URL for the following reasons:

Before rewriting a URL, the search appliance crawler attempts to match it against each of the patterns described for:

If the URL matches one of the patterns, it is rewritten or rejected before it is fetched.

BroadVision Web Server

In URLs for BroadVision Web server, the search appliance removes the BV_SessionID and BV_EngineID parameters before fetching URLs.

For example, before the rewrite, this is the URL:
http://www.broadvision.com/OneToOne/SessionMgr
/home_page.jsp?BV_SessionID=NNNN0974886399.1076010447NNNN&BV_EngineID=ccceadcjdhdfelgcefe4ecefedghhdfjk.0

After the rewrite, this is the URL:
http://www.broadvision.com/OneToOne/SessionMgr/home_page.jsp

Sun Java System Web Server

In URLs for Sun Java System Web Server, the search appliance removes the GXHC_qx_session_id parameter before fetching URLs.

Microsoft Commerce Server

In URLs for Microsoft Commerce Server, the search appliance removes the shopperID parameter before fetching URLs.

For example, before the rewrite, this is the URL: http://www.shoprogers.com/homeen.asp?shopperID=PBA1XEW6H5458NRV2VGQ909

After the rewrite, this is the URL: http://www.shoprogers.com/homeen.asp

Servers that Run Java Servlet Containers

In URLs for servers that run Java servlet containers, the search appliance removes jsessionid, $jsessionid$, and $sessionid$ parameters before fetching URLs.

Lotus Domino Enterprise Server

Back to top

Lotus Domino Enterprise URLs patterns are case-sensitive and are normally recognized by the presence of .nsf in the URL along with a well-known command such as "OpenDocument" or "ReadForm." If your Lotus Domino Enterprise URL does not match any of the cases below, then it does not trigger the rewrite or reject rules.

The search appliance rejects URL patterns that contain:

The search appliance rewrites:

The following sections provide details about search appliance rewrite rules for Lotus Domino Enterprise server.

OpenDocument URLs

The search appliance rewrites OpenDocument URLs to substitute a 0 for the view name. This is a method for accessing the document regarless of view, and stops the search appliance crawler from fetching multiple views of the same document.
The syntax for this type of URL is http://Host/Database/View/DocumentID?OpenDocument. The search appliance rewrites this as http://Host /Database/0/DocumentID?OpenDocument

For example, before the rewrite, this is the URL:
http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf
/8d7955daacc5bdbd852567a1005ae562/c8dac6f3fef2f475852567a6005fb38f

After the rewrite, this is the URL:

http://www12.lotus.com/idd/doc/domino_notes/5.0.1/readme.nsf/0/c8dac6f3fef2f475852567a6005fb38f?OpenDocument

URLs with # Suffixes

The search appliance removes suffixes that begin with # from URLs that have no parameters.

Multiple Versions of the Same URL

The search appliance converts a URL that has multiple possible representations into one standard, or canonical URL. The search appliance does this conversion so that it does not fetch multiple versions of the same URL with differing order of parameters. The search appliance's canonical URL has the following syntax for the parameters that follow the question mark:

To convert a URL to a canonical URL, the search appliance makes the following changes:

For example, before the rewrite, this is the URL:
http://www-12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Count=30&Expand=3

After the rewrite, this is the URL:
http://www12.lotus.com/ldd/doc/domino_notes/5.0.1/readme.nsf?OpenDatabase&Start=1&Count=1000&ExpandView

ColdFusion Application Server

In URLs for ColdFusion application server, the search appliance removes CFID and CFTOKEN parameters before fetching URLs.

Index Pages

In URLs for index pages, the search appliance removes index.htm or index.html from the end of URLs before fetching them. It also automatically removes them from Start URLs that you enter on the Crawl and Index > Crawl URLs page in the Admin Console.

For example, before the rewrite, this is the URL:
http://www.google.com/index.html

After the rewrite, this is the URL:
http://www.google.com/

Back to top

Last modified:

Updated on