Configuration

Perimeter tab

Crawl configuration - perimeter
Crawl configuration - perimeter

Crawl mode

Default spider mode:
the default spider mode will crawl pages by following the links, redirects, etc. in the website, this is the recommended mode to crawl a website. This mode will provide comprehensive crawl reports, including linking reports.

List of URLs:
for specific needs or additional checks, you may paste a list of URLs to check by selecting "list of URLs".

Device & rendering

HTML only is the default setting to use most of the time. It will go very fast.

Javascript: why crawl in HTML+Javascript rendering mode?

  • to crawl websites built with javascript frameworks,
  • to get custom data in pages by scraping with XPath, CSS Path or regex,
  • to crawl like Googlebot for smartphones or perform a mobile vs desktop SEO comparison audit,
  • to get more performance indicators (DOM content loaded...),
  • to get additional data from DOM that is not present in HTML source code...

Crawling in javascript rendering mode takes more time than crawling HTML source code; for websites that don't use javascript to render content you shoud use the defaut HTML mode, except for some specific purposes listed above.

Notes: when crawling in javascript rendering mode, do not close the slimjet browser window. When crawling like a mobile, do not maximize the window. In javascript rendering mode Hextrakt gets structured breadcrumbs (with schema.org or datavocabulary.org).

Other settings

The starting URL is usually the home page. It is recommended not to limit the crawl depth as well as the number of crawled pages in order to have a comprehensive crawl.

To include other domains or subdomains, simply add them in the "URLs beginning with" field, separated by a space, e.g. http://www.mydomain.com http://www.mydomain.fr http://www.mydomain.co.uk http://blog.mydomain.com
Here, write all the domains that you want to crawl, including the domain of the starting URL.

To crawl all subdomains including http and https URLs, use the "URLs matching Regexp" field with this regexp : (http|https):\/\/[^\.]*\.?mydomain

You can write several regexp separated by a space, which will match 1st regexp OR 2nd regexp OR...

History size : beware that if you crawl big websites and keep many crawl reports in the history you will need a lot of available hard disk space.

Connection tab

Crawler connection configuration
Crawler connection configuration

Most of the time you can leave the default settings, except if you want to specify the number of parallel connections. Adaptive crawling (for HTML only crawls) will automatically set up the best crawl speed, handling the client and server resources, so if you don't know how to set the number of parallel connections, leave it checked to avoid overloading the server.

Remember that you can crawl a website only in agreement with the owner.

APIs tab

API connection configuration
API connection configuration

Enter your email to connect to your Google analytics & Search console accounts and check "Get Search Console data" and "Get Analytics data". Once you click on "Get list", a browser window will open to connect Hextrakt to your Google account to get (read) the data in Analytics and Search console.

Google API read access
Authorize Hextrakt to read your Google Analytics and Search console data

Then select the domain from the Search console websites list, to get the data for the domain(s) to crawl and select the right Google Analytics view inside (usually your master view). For more information about Google Analytics account hierarchy, read the Google Analytics help.

If you don't see the website that you want to crawl in the list, you may not have the right user permissions for the email which you entered in the Google login field.

Advanced tab

Advanced crawl configuration
Advanced crawl configuration

You can usually leave the default settings here.

  • Ignore robots.txt file: if checked, Hextrakt will crawl pages even if they are disallowed. If you want to get "Noindex & blocked by robots.txt" conflicts in Directives > Indexability report, check this.
  • Ignore nofollow: if checked, Hextrakt will crawl nofollow links.
  • Allow cookies: allow Hextrakt to manage cookies.
  • Check [...]: if checked, Hextrakt will get the status code (and Content-Length) for these URLs.
  • Maximal wait timeout: number of seconds to wait before HTTP response.
  • Number of retries: in case of network error.

Custom data

Custom data extraction
Custom data extraction

Custom data extraction is available only in javascript rendering mode. You can add up to 10 custom fields to get additional data in pages, using XPath, CSS path or regular expressions.

If you need to look for duplicate content in specific HTML parts of the pages, you can apply a hash function to the extracted data (the hash will be a unique string for all identical data), to easily find duplicates in Excel or OpenOffice.

Specific duplicate contents
Find specific duplicate contents

In order to display this additional data in a report, use the advanced URL explorer and add the "custom data..." columns.

Advanced search with custom data
Advanced search with custom data

For instance when you crawl an ecommerce website, you can extend crawl data, adding the number of reviews from each product pages:

Number of reviews
Number of reviews