FAQ

Crawl

What does asynchronous crawl mean?

Most crawlers launch several "threads" in order to download web pages in parallel. In this threaded architecture, each running thread will perform the following steps :

  1. open a web connection to the target website;
  2. send a HTTP request;
  3. wait until all the TCP packets have arrived;
  4. process the response.

Step 3 is mostly "wait, and wait again...", so that each thread is wasting its time, waiting. Therefore memory and system resources are consumed without great benefit.

Hextrakt uses an asynchronous technology. Only one thread is opened to manage multiple web connections. This unique thread is able to send and receive all the web requests in parallel, so that it is always doing something. This saves a lot of system resources (memory / CPU time), which is why we have crawled more than 3 million pages while consuming less than 200 MB of memory. This asynchronous technology is also extremely fast, because it avoids thread locking. On our test infrastructure, the crawl speed exceeds 200 pages per second.

What does adaptive crawl mean?

When crawling a website, there are some parameters which are unknown or difficult to evaluate: the bandwidth of the network, the processing power of the target web servers, the available desktop memory and CPU power. On the one hand, you want to download all the pages as fast as possible, but on the other hand, if you shoot too many requests in parallel, you will surely degrade the response times, and possibly bring down the web site.

Therefore, it is mostly impossible to decide what will be the optimal number of requests to send in parallel to the server without crashing the architecture.

This is where Hextrakt brings a major innovation, which we have called adaptive crawler. This feature will automatically optimize the download speed, while carefully adjusting the number of web requests sent in parallel and respecting the capacity of both the target website and the desktop computer.

Note: asynchronous & adaptive crawl is available for HTML only crawls.

How does the Javascript rendering work?

Hextrakt embeds the Slimjet browser, based on the Chromium browser, to crawl like Googlebot does.

Hextrakt does not use a render timeout, but a maximum timeout: when crawling in Javascript rendering mode, Hextrakt waits for all the page resources to be loaded and all javascript code to be executed, and analyzes DOM. If this time to wait is less than the max timeout (user-definable), the next page will load sooner; this can make the crawl faster.

Reports

What is the difference between "URL" and "page"?

In hextrakt, a page means a HTML document, whereas a URL may be the location of any type of file, including img, css, js, HTML...

What are the different URL crawl status?

A URL might be:

  • Found or discovered: when it appears in the HTML code of a source page.
  • Fetched: when the crawler has made an HTTP request to get the HTTP status, the headers and possibly the content (only for HTML pages) of the target URL.
  • Crawled: a HTML page is crawled when it has been fetched and analyzed.
  • Orphan: a HTML page is orphan when it was not found by the crawler, but was retrieved from the Analytics or Search Console APIs.

What are the different file natures?

  • html
    This is a crawled HTML page, which has been fetched and analyzed.
  • external
    This an HTML page which is either out of the crawl perimeter, or an orphan page.
  • uncrawled html
    This is an HTML page which was found by the crawler, but which was not crawled because it is blocked by robots.txt or found from a <a rel=nofollow> tag (if "ignore nofollow" is unchecked). A page may also be uncrawled if the crawl limit (depth or URL number set in configuration) is reached.
  • other
    This is a file which was found from a <a href> link, but which appears to be something other than a HTML file.
  • js
    This is a javascript file.
  • css
    This is a css file.
  • img
    This is an image file.
  • embed
    This is a file found from an <EMBED src= > tag.
  • extra dependency
    When crawling a website in Javascript rendering mode, the javascript code in web pages may call additional HTTP requests, which do not exist initially in the source code. These HTTP requests may be calls to Analytics services, advertising or tracking servers, social plugins, or may embed fonts or include other assets, like JSON contents.

Why are there grey bullets on the left of the URL table ?

In some reports Hextrakt lists external URLs, e.g. mobile version pages. For these external URLs, you can get some data (status code, content-type, size...) if you checked "Check external links" in the crawl configuration (advanced tab). If not checked, these URLs will not be fetched, and you will see a grey bullet on the left of the URL table. To retrieve all data for external URLs, you have to include them in the crawl perimeter, for example by including your mobile version (include URLs beginning with: http://www.mydomain.com http://m.mydomain.com).

HTML pages that are not included in the crawl perimeter get the "External" nature in Hextrakt.

I don't see any SEO visits (organic entrances) for some pages that should have visits

Please check if hextrakt is connected to the right Google Analytics profile (in the crawl configuration, "APIs" tab). Check also the Google Analytics tracking code in the pages (or Google Tag Manager settings) to see if URLs are renamed like this : ga('send', 'pageview', 'new-name.html'). Also check your Google Analytics settings to see if URLs parameters are removed.

I don't get any Google Analytics or Search console data at all

First ensure that you have connected Hextrakt to your Google account in the configuration window (APIs tab). If you manually stop the crawl, you won't get neither Analytics nor GSC data. If you run a complete crawl or if you set a maximum number of crawled pages in the crawl configuration, Hextrakt will get Analytics & GSC data.

In advanced URL explorer, the filter "tag != anything" does not show URLs whose tag is empty

The operator != will only filter values which are not empty. To get URLs which are not tagged, you have to use the "empty" operator, i.e "tag empty".

Installation & setup

How to update Hextrakt Crawler

Just download a new version from our download page. When you update Hextrakt by installing a new version, you will keep all your data, such as crawls, custom reports (filters) and tags.