Crawl and Index from Multiple Start Paths and Aliases
Our website search engine tool supports mixing multiple domains and additional start scan paths.
While usually not recommended because of
duplicate content issues,
some websites mix domains, links and
www and
non-www usage in URLs.
In such cases, after configuring the site root to be scanned, which is usually the primary domain, make a list af
root aliases.
Note:
You need to use the
[+] button
to add a
root path alias into the
dropdown list.
In
Scan website | Crawler options
you can configure the website search engine tool to automatically add common root path aliases:
If you use
http://example.com/blogs/ as root, all paths
outside (excluding
root path aliases)
such as e.g.
http://example.com/forum/
will neither be included in
output nor for
analysis.
A better alternative might be to keep the website root as
http://example.com/ followed by using
analysis filters,
output filters
and additional
start search paths (see below)
to control your website crawl and resulting output.
Note: In this case you may need to uncheck option
Scan website | Crawler options | Fix internal URLs if website root URL redirects to a different address.
Websites with site
areas that have no
incoming links from within the rest of the website can sometimes cause a problem.
Remember that crosslinking
hidden pages will not
help if none of them are linked from anywhere else in the website.
This problem can easily be overcome in our website search engine software.
It is possible to start a website scan from multiple paths in addition to website directory root.
In newer versions, there are also buttons for quickly adding additional
start search from addresses by:
- importing list of URLs from search engines.
- importing list of URLs from a file.
- importing list of URLs from a website page URL.
- addding common urls such as typical xml sitemap paths.
Note:
You need to use the
[+] button
to add additional
start scan paths into the
dropdown list.
Note: It is often better to make sure your website is cross linked, so crawlers can find all pages on their own.