Website Robots.txt, Noindex, Nofollow and Canonical
TechSEO360 has optional support for obeying robots text file, noindex and nofollow in meta tags, and nofollow in link tags.
The website crawler in
TechSEO360
has many tools and options to ensure it can scan complex websites. Some of these include
complete support for robots text file, noindex and nofollow in meta tags, and nofollow in link tags.
Tip: Downloading
robots.txt
will often make webservers and analytics software identify you as a
website crawler robot.
You can find mose of these options in
Scan website | Webmaster filters.
In connection with these, you can also control how they get applied:
- Disable Scan website | Webmaster filters | After website scan stops: Remove URLs with noindex/disallow.
- Enable Create sitemap | Document options | Remove URLs excluded by "webmaster" and "output" filters.
If you use
pause and resume crawler functionality you can avoid having the same URLs repeatedly crawled by keeping them all between scans.
You can read more in our online help for TechSEO360 to learn about
analysis
and
output
filters.
The
match behavior in the website crawler used by TechSEO360 is similar to that of most search engines.
Support for
wildcard symbols in
robots.txt file:
-
Standard: Match from beginning to length of filter.
gre will match: greyfox, greenfox and green/fox.
-
Wildcard *: Match any character until another match becomes possible.
gr*fox will match: greyfox, grayfox, growl-fox and green/fox.
Tip: Wildcards filters in robots.txt are often incorrectly configured and a source of crawling problems.
The crawler in our technical SEO tool will obey the following
user agent IDs in the
robots.txt file:
- Exact match against user agent selected in: General options and tools | Internet crawler | User agent ID.
- User-agent: TechSEO360 if the product name is in above mentioned HTTP user agent string.
- User-agent: miggibot if the crawler engine name is in above mentioned HTTP user agent string.
- User-agent: *.
All found
disallow instructions in
robots.txt are internally converted into
both
analysis
and
output
filters
in
TechSEO360.
See all
state flags of all URLs as detected by the crawler - this uses options set in
Webmaster filters,
Analysis filters and
Output filters.
Alternatively, use the option
Scan website | Crawler options | Use special response codes
to have states reflected as
response codes.
For details of a specific URL, select it and view all information in
Extended data | Details,
Extended data | Linked by and similar:
For an overview of all URLs you can hide/show the data columns you want including
URL content state flags:
You can also apply a custom filter after scan to only show URLs with a certain combination of
URL state flags: