Default spider mode:
the default spider mode will crawl pages by following the links, redirects, etc. in the website, this is the recommended mode to crawl a website. This mode will provide comprehensive crawl reports, including linking reports.
List of URLs:
for specific needs or additional checks, you may paste a list of URLs to check by selecting "list of URLs".
Device & rendering
HTML only is the default setting to use most of the time. It will go very fast.
- to get custom data in pages by scraping with XPath, CSS Path or regex,
- to crawl like Googlebot for smartphones or perform a mobile vs desktop SEO comparison audit,
- to get more performance indicators (DOM content loaded...),
- to get additional data from DOM that is not present in HTML source code...
The starting URL is usually the home page. It is recommended not to limit the crawl depth as well as the number of crawled pages in order to have a comprehensive crawl.
To include other domains or subdomains, simply add them in the "URLs beginning with" field, separated by a space, e.g. http://www.mydomain.com http://www.mydomain.fr http://www.mydomain.co.uk http://blog.mydomain.com
Here, write all the domains that you want to crawl, including the domain of the starting URL.
To crawl all subdomains including http and https URLs, use the "URLs matching Regexp" field with this regexp : (http|https):\/\/[^\.]*\.
You can write several regexp separated by a space, which will match 1st regexp OR 2nd regexp OR...
History size : beware that if you crawl big websites and keep many crawl reports in the history you will need a lot of available hard disk space.
Most of the time you can leave the default settings, except if you want to specify the number of parallel connections. Adaptive crawling (for HTML only crawls) will automatically set up the best crawl speed, handling the client and server resources, so if you don't know how to set the number of parallel connections, leave it checked to avoid overloading the server.
Remember that you can crawl a website only in agreement with the owner.
Enter your email to connect to your Google analytics & Search console accounts and check "Get Search Console data" and "Get Analytics data". Once you click on "Get list", a browser window will open to connect Hextrakt to your Google account to get (read) the data in Analytics and Search console.
Then select the domain from the Search console websites list, to get the data for the domain(s) to crawl and select the right Google Analytics view inside (usually your master view). For more information about Google Analytics account hierarchy, read the Google Analytics help.
If you don't see the website that you want to crawl in the list, you may not have the right user permissions for the email which you entered in the Google login field.
You can usually leave the default settings here.
- Ignore robots.txt file: if checked, Hextrakt will crawl pages even if they are disallowed. If you want to get "Noindex & blocked by robots.txt" conflicts in Directives > Indexability report, check this.
- Ignore nofollow: if checked, Hextrakt will crawl nofollow links.
- Allow cookies: allow Hextrakt to manage cookies.
- Check [...]: if checked, Hextrakt will get the status code (and Content-Length) for these URLs.
- Maximal wait timeout: number of seconds to wait before HTTP response.
- Number of retries: in case of network error.
If you need to look for duplicate content in specific HTML parts of the pages, you can apply a hash function to the extracted data (the hash will be a unique string for all identical data), to easily find duplicates in Excel or OpenOffice.
In order to display this additional data in a report, use the advanced URL explorer and add the "custom data..." columns.
For instance when you crawl an ecommerce website, you can extend crawl data, adding the number of reviews from each product pages: