URL Rewriting and Similar
Your website can without problems use
virtual directories or
URL rewriting.
Many people use
Apache mod_rewrite to create virtual directories and URLs.
Similar solutions exist for nearly all webserver solutions.
From a crawler perspective, websites can safely use virtual directories and URLs.
Browsers, search engine bots etc. all
view your website from the
outside.
They have no knowledge of how your URL structure is
implemented.
They can not tell if pages or directories are
physical or
virtual.
Server Side Language and HTML Web Pages
In a modern website, there is often little or no correlation between the URL "file names"
and the underlying data including how it is generated, stored and retrieved.
It does not matter if a website uses Cold Fusion, ASP.Net, JSP, PHP or similar as its server side programming language.
Website crawlers only see the client-side code (HTML/CSS/Javascript) generated by the code and databases on the server.
Note:
In settings, the crawler in our site analysis tool can be set to
accept/ignore URLs with certain
file extensions
and
MIME content types.
If you have troubles, read about
finding all pages and links.
Dynamically Created Content on Server
Sites that dynamically generate page content using server-side scripts and databases are crawled without problems by site crawlers and robots.
Note:
Some search engine robots may slow when crawling URLs with
?.
However, that is mainly because search engine are worried spending crawling resourced on lots of URLs with auto generated content.
To mitigate this, you can use
mod rewrite or similar in your website.
Note:
Our website download and the
MiggiBot crawler engine does not care about how URLs look.
Mobile Websites
Many websites nowadays use
responsive and
adaptive layouts that adjust themselves in the browser
using client-side technologies, e.g.
CSS and
Javascript.
However, some websites have special website URLs for:
- Feature phones that only support WAP and similar old technologies.
- Smartphones with browsers that are very similar to desktop browsers and render content the same way.
- Desktop computers, laptops and tablets where the screen area and view port is larger.
Generally, such mobile optimized websites know they need to output content optimized for mobile devices by either:
- Assume they should always output content opimized for a given set of mobile devices, e.g. smart phones.
- Perform server-side checks on the user agent passed to it by the crawler or browser. Then, if a mobile device is identified, it will eiher redirect to a new URL or simply output content optimized for mobile devices.
If you want the A1 Website Download crawler to see the mobile content and URLs your website outputs to mobile devices, simply change the setting
General options and tools | Internet crawler | User agent ID
to one used by popular mobile devices, e.g this:
Mozilla/5.0 (Linux; Android 4.0.4; Galaxy Nexus Build/IMM76B) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.133 Mobile Safari/535.19.
You can do the same with most desktop browsers by installing a
user agent switcher plugin.
That allows you to inspect the code returned by your website to mobile browsers.
Note:
If your mobile optimized website uses a mix of
client-side and
server-side technologies such as
AJAX to detect
the user agent and alter content based on it, it will not work on many website crawlers
including, as of 2014 September at least,
A1 Website Download.
However, it will work in most browsers since they run/execute
Javascript code which can query the browser for the user agent ID.
AJAX Websites
If your website uses AJAX which is a technology where Javascript communicates with the server and alters the content
in-browser
without changing URL address, it is worth knowing that crawlability will depend on the exact implementation.
Explanation of fragments in URLs:
-
Page-relative-fragments:
Relative links within a page:
http://example.com/somepage#relative-page-link
-
AJAX-fragments:
client-side Javascript that queries
server-side code and replaces content-in-browser:
http://example.com/somepage#lookup-replace-data
-
AJAX-fragments-Google-initiative: Part of the Google initiative
Making AJAX Applications Crawlable:
http://example.com/somepage#!lookup-replace-data
This solution has since been deprecated by Google themselves. For more information see:
https://developers.google.com/webmasters/ajax-crawling/docs/getting-started
https://developers.google.com/webmasters/ajax-crawling/docs/specification
If you use this solution, you will see URLs containing #!
and _escaped_fragment_
when crawled.
Tip: To help successfully crawl AJAX websites:
- Select an AJAX enabled crawler option in Scan website | Crawler engine.
- Enable option Scan website | Crawler options | Try search inside Javascript.
- Enable option Scan website | Crawler options | Try search inside JSON.
Normally, websites never
cloak content based on
user agent string and
IP address.
However, by setting the
useragent ID
you can check the
HTML source
search engines and browsers see when retrieving pages form your website.
Note: This can also be used to test if a website responds correctly to a crawler/browser that identifies itself as being mobile.
Website bandwidth filtering and/or throttling
Some
few website platforms and module take measures against crawlers they do not recognize to
reserve bandwidth and server usage for real visitors and search engines.
Here is a list of known solutions for those website platforms:
If you are trying to crawl a forum, check our guide to optimal crawling of
forums and blogs with website download.
Website Erratically Sends The Wrong Page Content
We have seen a few cases where
the website, server,
CDN,
CMS or cache system suffered
from a bug and sent the wrong output page content when being crawled.
To prove and diagnose such a problem,
download and configure
A1 Website Download like this:
- Set Scan website | Download options | Convert URL paths in downloaded content to no conversion.
- Enable Scan website | Data collection | Store redirects and links from and to all pages.
- Enable all options in Scan website | Webmaster filters.
You can now compare the downloaded page source code
with what is reported in the
A1 Website Download,
and see if the webserver/website sent correct or incorrect content to the
A1 crawler engine.
To solve such an issue without access to the website and webserver code,
try use some of the configurations suggested further down below.
If you encounter a website that throttles crawler requests, blocks certain user agents or is very slow you will often get response codes such as:
- 403 : Forbidden
- 503 : Service Temporarily Unavailable
- -5 : TimeoutConnectError
- -6 : TimeoutReadError
To solve these try the following:
-
Set Scan website | Crawler engine | Max simultaneous connections to one.
-
Set Crawler engine | Advanced engine settings | Default to GET requests to checked/enabled.
-
Then, if necessary, have the webcrawler identify itself as a search engine or as a user surfing.
- Identify as "user surfing website":
- Set General options and tools | Internet crawler | User agent ID to Mozilla/4.0 (compatible; MSIE 7.0; Win32).
- Set Scan website | Webmaster filters | Download "robots.txt" to unchecked/disabled.
- In Scan website | Crawler engine increase the amount of time between active connections.
- Optional:: Set Scan website | Crawler engine to HTTP using WinInet engine and settings (Internet Explorer)
- Identify as "search engine crawler":
- Set General options and tools | Internet crawler | User agent ID to Googlebot/2.1 (+http://www.google.com/bot.html) or another search engine crawler ID.
- Set Scan website | Webmaster filters | Download "robots.txt" to checked/enabled.
- Set Scan website | Webmaster filters | Obey "robots.txt" file "disallow" directive to checked/enabled.
- Set Scan website | Webmaster filters | Obey "robots.txt" file "crawl-delay" directive to checked/enabled.
- Set Scan website | Webmaster filters | Obey "meta" tag "robots" noindex to checked/enabled.
- Set Scan website | Webmaster filters | Obey "meta" tag "robots" nofollow to checked/enabled.
- Set Scan website | Webmaster filters | Obey "a" tag "rel" nofollow to checked/enabled.
Note: If you continue to have problems, you can combine the above with:
-
Set Scan website | Crawler engine | Crawl-delay in miliseconds between connections to at least 3000.
Note: If your IP address has been blocked, you can try use
General options and tools | Internet crawler | HTTP proxy settings.
Proxy support depends on which HTTP engine has been selected in
Scan website | Crawler engine.
Note: If the problem is timeout errors, you can also try do repeat crawls with the
resume scan functionality.
You can also download our default
project file for problematic websites
as you can often apply the same solutions to a wide range of websites.