Website Scraper Analysis Filters in Website Scan
Website scan analysis filters (also known as crawler filters) help you define which pages you want analyzed for content and links during website scans in our website scraper.
Note: We have a video tutorial:
Even though the video demonstration uses
TechSEO360 some of it is also applicable for users of
A1 Website Scraper.
Analysis filters determines which pages have their content analyzed for links and other data.
You can use
analysis filters instead or in conjunction with
webmaster filters
(
robots.txt,
noindex,
nofollow etc.)
and
output filters.
-
Exclude URLs in both analysis filters and output filters
to minimize
crawl time, HTTP requests and memory usage.
-
Note: If an URL is only linked from pages that are not analyzed due to filters, it will not be found during the website scan.
-
Note:
For changes in analysis filters to take effect, you will need to crawl your website again.
URLs with file extensions
not found in the list will
not
be analyzed durng website scan.
If you remove all file extensions in the list, the file extension list filtering accepts all files.
Excluding URLs that fully or partially match a
text string,
path or
regular expression pattern from being analyzed is often a good way to limit the crawl.
- Strings:
- blogs matches relative paths that contain "blogs".
- @ matches relative paths that contain "@".
- ? matches relative paths that contain "?".
- Paths:
- :s matches relative paths that start with "s" such as http://www.microsystools.com/services/ and http://www.microsystools.com/shop/.
- :blogs/ matches relative paths that start with "blogs/" such as http://www.microsystools.com/blogs/.
- Subpaths:
- :blogs/* matches relative paths excluding itself that start with "blogs/" such as http://www.microsystools.com/blogs/sitemap-generator/.
- Regular expression:
- ::blog(s?)/ matches relative paths with regex such as http://www.microsystools.com/blogs/ and http://www.microsystools.com/blog/.
- ::blogs/(2007|2008)/ matches relative paths with regex such as http://www.microsystools.com/blogs/2007/ and http://www.microsystools.com/blogs/2008/.
- ::blogs/.*?keyword matches relative paths with regex such as http://www.microsystools.com/blogs/category/products/a1-keyword-research/.
- ::^$ matches the empty relative path (i.e. the root) with regex such as http://www.microsystools.com/.
From above examples it can be seen that:
- : alone = special match.
- : at start = paths match.
- : at start and * at end = makes paths into subpaths match.
- :: at start = regular expression match.
- None of above, normal string text match.
To add list filter item in dropdown: Type it and use the
[+] button.
To remove list filter item in dropdown: Select it and use the
[-] button.
You can view more information about the
user interface controls used
If you do not need any of the advanced options for analsys filters, you can use
the
Delete and filter button after you finished crawling a site.
This makes it easy to optimize settings and limit the amount of pages analyzed for the next time you need to crawl the website.