If A1 Website Scraper freezes during scan because it requires more memory (
or requires better configuration),
you may first want to
check if the
cause and solution resides on the website.
List of things to check:
-
Check if your website is generating an infinite amount of unique URLs.
If it does, it will cause the crawler to never stop as new unique page URLs are found all the time.
A good method to discover and solve these kinds of problems is by:
- Start a website scan.
- Stop the website scan after e.g. half an hour.
- Inspect if everything appears correct, i.e. if most of the URLs found seem correct.
Example #1
A website returns
200 instead of 404 for broken page URLs. Example of infinite pattern:
Original 1/broken.html links to 1/1/broken.html links to 1/1/1/broken.html etc.
Example #2
The website platform CMS generates a huge number of 100% duplicate URLs for each actual existing URL.
To read more about duplicate URLs, see
this help page.
Remember that you can analyze and investigate
internal website linking incase something looks wrong.
-
Check if your project configuration and website content will cause the crawler to download files hundred of megabytes large.
Example #1: Your website contains many huge files (like hundreds of megabytes) the crawler must download.
(While the memory is freed after the download has completed, it can still cause problems on computers with low memory.)
Note: If you need extended help with analyzing your website for
problems, we also offer
sitemap and SEO services.
Webserver and database speed is important if you use database generated content.
Forums, Portals and CMS based websites often trigger many SQL queries per HTTP request.
Based on our experience, here are some examples of webserver performance bottlenecks:
Database
Check if the database performance is the primary bottleneck when being
hit by many simultaneous connections and queries. In addition, make sure the database is not capped (e.g. through the license) at a maximum of simultaneous users/requests.
Resources and Files
If you read/write to files/resources, this may
stall other connections if they require access to the same resources.
Resources and Sessions
On
IIS webserver, read/write of session information can get queued using a locking mechanism - this can cause problems when dealing with a high load of HTTP requests from the same crawler
session across many pages simultaneously.
Scan When Traffic Is Low
Check your webserver logs and scan your website at times with low bandwidth usage.
The A1 Website Scraper setup file installs multiple executables into the
program installation directory, each optimized for different systems.
While the setup installation will pick the one
that match your computer system the best, you may want to use one of the other in-case of too high memory usage.
The list of installed executables:
- Scraper_64b_UC.exe / Scraper_64b_W2K.exe:
- Full support of Unicode with 64bit executable.
- Minimum requires: Windows XP / 64bit.
- Largest memory usage, but can also access much more memory if available.
- Scraper_32b_UC.exe / Scraper_32b_W2K.exe:
- Full support of Unicode with 32bit executable.
- Minimum requires: Depending on version either Windows XP / 32 bit or Windows 2000 / 32bit.
- Somewhat lower memory usage.
- Scraper_32b_CP.exe / Scraper_32b_W9xNT4.exe:
- Some support of Unicode with 32bit executable.
- Minimum requires: Windows 98 / 32bit.
- Lowest memory usage.
Changing settings can make a huge difference if you are experiencing problems such as:
- Scanning of website goes very slow.
- Webserver is dropping connections preventing crawler from getting all pages.
- You have large websites with 10.000 or 100.000 (+) pages you wish to scan.
Lessen overall and/or peak resource usage on crawler computer:
- Disable Scan website | Data collection | Create log file of website scans
- Disable Scan website | Data collection | Verify external URLs exist
- Disable Scan website | Data collection | Store found external URLs
- Disable Scan website | Data collection | Store redirects, links from and to all pages etc.
- Disable Scan website | Crawler options | Crawl error pages
- Disable Scan website | Crawler options | Allow cookies
- Disable Scan website | Webmaster filters | Obey "robots.txt" file "crawl delay" directive
- Consider trying Scan website | Crawler engine | HTTP using WinInet engine and settings (Internet Explorer)
- In top menu disable Tools | After website scans: Calculate summary data (extended)
- In top menu disable Tools | After website scans: Open and show data
Lessen overall resource usage on crawler and webserver computer:
- In Scan website | Analysis filters and Scan website | Output filters;
- Configure which directories and pages website crawler can ignore:
- Many forums generate pages with similar or duplicate content.
- Excluding URLs can save a lot of HTTP requests and bandwidth.
- See help for
analysis filters
and output filters.
- Following happens when an URL is excluded:
- Only analysis filters: Defaults to HEAD instead of GET for the HTTP requests
- Only output filters: Defaults to remove the excluded URL after the website crawl finishes.
- Both analysis filters and output filters: No HTTP request at all to the excluded URL.
Lessen peak resource usage on crawler and webserver computer:
- In Scan website | Crawler engine
- Max simultaneous connections (data transfer):
- Set up if webserver backend and your own bandwidth can handle more requests.
- Set down if webserver backend (e.g. database queries) is a bottleneck.
- Max worker threads (transfer, analysis etc.):
- Set up if your own computer CPU can handle more work.
- Set down if you are doing other work on your computer.
- Adjust timeout options:
- Miliseconds to wait before read times out.
- Miliseconds to wait before connect times out.
- Connection attempts before giving up.
- How to change change settings:
- Set up if you want to make sure you grab all links and pages.
- Set down if you want website scan to be as fast as possible.
- Enable/disable usage of persistent connections.
- Enable/disable GZip compression of all data transferred between crawler and server.
- Enable/disable persistent connections which affects how client/server communicates.
- In option Default path type and handler experiment between using Indy and WinInet for HTTP.
When using
pause /
resume website scan functionality:
- In Scan website | Output filters:
- See help about
output filters.
-
Uncheck Scan website | Output filters | After website scan stops: Remove URLs excluded:
- Makes the website crawler keep all URLs. This ensures no URL is tested more than once.
- In Scan website | Webmaster filters:
- See help about
webmaster filters.
-
Uncheck Scan website | Webmaster filters | After website scan stops: Remove URLs with noindex/disallow:
- Makes the website crawler keep all URLs. This ensures no URL is tested more than once.
- See below concerning GET versus HEAD requests and how this can affect resume efficiency.
Lessen resource usage on webserver computer:
- Use a content delivery network solution (CDN for short) to server content without putting any load on your own server.
Things that can affect resource usage on both server and desktop:
- In Scan website | Crawler engine
- Enable/disable using GET requests (instead of HEAD followed by GET). If you plan on using resume functionality, you may want enable usage of HEAD requests as it allows website crawler to quickly resolve URLs found. This in turn speeds count of pages fully analyzed. Due to the internals of the crawler engine, using HEAD can also reduce memory usage.
Bonus tip: If you have a large website it can often be a good idea to make a few test scans. That way you can configure URL exclude filters before starting a major website crawl.
Increasing memory address space enables
Windows 32 bit version of products such as
our website scraper to address beyond 2 gigabytes of memory.
-
http://www.microsoft.com/whdc/system/platform/server/PAE/PAEmem.mspx
-
Memory Support and Windows Operating Systems
-
http://blogs.msdn.com/oldnewthing/archive/2004/08/05/208908.aspx
-
The oft-misunderstood /3GB switch
-
http://blogs.msdn.com/oldnewthing/archive/2004/08/10/211890.aspx
-
Myth: Without /3GB a single program can not allocate more than 2GB of virtual memory
-
http://blogs.msdn.com/slavao/archive/2006/03/12/550096.aspx
-
Be Aware: 4GB of VAS under WOW, does it really worth it?
-
http://support.microsoft.com/kb/291988/
-
A description of the 4 GB RAM Tuning feature and the Physical Address Extension switch
-
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dngenlib/html/msdn_ntvmm.asp
-
The Virtual-Memory Manager in Windows NT
(Note: Only for the very technical)