Hard and Soft 404 Not found URLs in Website Download
A1 Website Download has en option to crawl error pages for links since our software has built-in protections against crawling endless error pages.
Generally speaking, crawling URLs that error with e.g.
404 - Not Found is a bad idea. To understand the reason, take a look at the following example of the process in a naive website crawler handling relative broken links:
-
- Crawler detects url http://www.example.com/directory/ gives 404 - not found.
- Crawler finds http://www.example.com/directory/ links to directory/something.
- Crawler concatenates http://www.example.com/directory/ and directory/something into http://www.example.com/directory/directory/something.
-
- Crawler detects url http://www.example.com/directory/directory/ gives 404 - not found.
- Crawler finds http://www.example.com/directory/directory/ links to directory/something.
- Crawler concatenates http://www.example.com/directory/directory/ and directory/something into http://www.example.com/directory/directory/directory/something.
-
- Classic spider trap where the website crawl will continue forever.
This is why most crawlers by default
do not continue crawling pages that return
404 - Not Found.
Some websites include important links in pages returned for e.g.
404 - not found errors. You can force
A1 Website Download to scan error pages for links
by checking option:
scan website | crawler options | crawl error pages.
Please note that links
relative to current path will be ignored when analysing error pages to avoid getting caught in an endless crawling loop.
If necessary to have error page URLs scanned for links, use one of the following kinds of links instead:
- /directory/something
- http://www.example.com/directory/something