Crawling Forums and Blogs with Website Download
Crawling blogs and forums such as SMF, VBulletin etc. can sometimes take a long time. However, proper configuration of our website download tool can speedup website scans.
Forums and blogs are no different from other websites. Rarely will you ever need to configure website download in a special way.
However, here is a list of common topics for large and/or database websites:
- How some
website platforms cause crawling problems.
- Use
resume scan support in our website download tool.
-
Notice that you can improve resume by disabling:
- Older versions:
- Scan website | Crawler options | Apply "webmaster" and "output" filters after website scan stops
- Newer versions:
- Scan website | Output filters | After website scan stops: Remove URLs excluded
- Scan website | Webmaster filters | After website scan stops: Remove URLs with noindex/disallow
- About crawling and
finding links in websites.
- Adjusting server load and website crawl speed.
- Including content otherwise only available for subscribers using password protected pages.
- Use output filters to exclude certain URLs from being included website scan output.
- Use analysis filters to prevent certain URLs in being crawled / analyzed.
The following settings are for
demonstration purposes.
Most likely
you will never need to configure these options.
Should you need to configure settings, take time to investigate above
links and what you need. Then possibly look at underneath for inspiration. Remember, few blogs and
forums are exactly the same.
Note: There may already be
Quick presets...
available in
Scan website that match your website platform and crawl needs.
Note: If in doubt what
Login path and
Post form data corresponds to see the help page about
password protected pages and login.
List of examples:
-
phpbb
-
Configure login
- Login path : http://forum.example.com/login.php
- Post form data : username=yourusername&password=yourpassword&redirect=index.php?&login=Log in
-
Configure crawler/analysis and output/list exclude filters
- Necessary
- Recommended
- :profile.php
- :login.php
- :newreply.php
- :printthread.php
- :sendmessage.php
- :search.php
- :threadrate.php
-
vBulletin
-
Configure login
- Login path : http://forum.example.com/login.php?do=login
- Post form data : vb_login_username=yourusername&vb_login_password=yourpassword&cookieuser=1&s=&do=login&vb_login_md5password=&vb_login_md5password_utf=
-
Configure crawler/analysis and output/list exclude filters
-
WordPress
-
Configure login
- Login path : http://blog.example.com/wp-login.php
- Post form data : log=yourusername&pwd=yourpassword&rememberme=forever&wp-submit=Log+ind&redirect_to=wp-admin%2F&testcookie=1
-
Configure crawler/analysis and output/list exclude filters
- Necessary
- :wp-admin/
- :wp-login.php?action=logout
- Recommended
- Note
- If you do not exclude "admin" section using filters, try avoid edit, post, delete, trash, logout and related link types.