Recrawl and Resume Crawl in Website Search Engine
Some webservers are sluggish or denie access to unknown crawlers like A1 Website Search Engine. To ensure everything gets crawled, you can use "resume" functionality in website search engine tool.
There can be many reasons to resume a website scan after it stopped.
One being webservers can
overload and respond with error codes for URLs, typically
- 503 : Service Temporarily Unavailable
- 500 : Internal Server Error
- -4 : CommError
An easy way to determine if any errors are left throughout entire website is to view
directory summary for the root directory / domain.
Then use the
resume functionality
in
A1 Website Search Engine
to recrawl these URLs. Continue to do so until no errors are left.
While the website scan is running you can at any time:
- Pause and stop the website scan.
- Resume and continue your website scan.
How to use website scan and crawl resume functionality:
- To pause a scan is the same as to stop a scan. If scan is stopped/paused or internet disconnects, you can resume.
- To resume a scan you need to check the Resume option. (In old versions this was a push button.) Then click Start scan button.
- You can save the project. This enables you to load the project at a later time. Then you can resume the project.
You can force recrawl of certain URLs by changing their state flags in
Website analysis before resuming website scan.
Depending on the program and version, the default behavior after a website scan finishes
for whatever reason, e.g. being paused, is that all
URLs that are excluded by
webmaster filters such as
robots.txt file and
output filters
are removed.
Above behavior is not always wanted since it means website crawler will spend time
rediscover
URLs it has tested before. To solve:
- Older versions:
- Uncheck Scan website | Crawler options | Apply "webmaster" and "output" filters after website scan stops
- Newer versions:
- Uncheck Scan website | Output filters | After website scan stops: Remove URLs excluded
- Uncheck Scan website | Webmaster filters | After website scan stops: Remove URLs with noindex/disallow
When filtered/excluded URLs are not removed in the website scan results, you can view their
Crawler and URL state flags:
The way the website crawl works internally, we have:
- Found URLs: These are URLs that have been resolved and tested.
- Analyzed content URLs: The content of these URLs (pages) have been analyzed.
- Analyzed references URLs: The links found in content of these URLs (pages) have been resolved.
This means that all pages where all links have not been resolved will need to be analyzed again when resuming scan.
To avoid this problem the website crawler can use
HEAD
requests to quickly resolve links to verified URLs.
While this causes some extra requests during crawl to server
(albeit all
light), it will minimize waste to almost zero when using resume functionality.
To configure this,
disable:
Scan website >
Crawler engine >
Default to GET for page requests
If you do not want to recrawl the entire website,
but
Resume is not applicable because all URLs have been analyzed,
you can do the following:
- At the left side where all URLs are listed, select those you want to have crawled again.
- In Crawler State Flags uncheck all the Analysis xxx flags except
Analysis required.
- Click the Save button in the Crawler State Flags area.
- In Scan website check Resume (full).
- Start the scan.
If you have trouble getting all URLs included in website crawls, it is important to
first follow above recommendation. The reason is that URLs with error response codes are
not crawled for links. By solving that problem, you will usually also end up with more URLs.
You can find more suggestions and tips in our
website crawl help article.