What does asynchronous crawl mean?
Most crawlers launch several "threads" in order to download web pages in parallel. In this threaded architecture, each running thread will perform the following steps :
- open a web connection to the target website;
- send a HTTP request;
- wait until all the TCP packets have arrived;
- process the response.
Step 3 is mostly "wait, and wait again...", so that each thread is wasting its time, waiting. Therefore memory and system resources are consumed without great benefit.
Hextrakt uses an asynchronous technology. Only one thread is opened to manage multiple web connections. This unique thread is able to send and receive all the web requests in parallel, so that it is always doing something. This saves a lot of system resources (memory / CPU time), which is why we have crawled more than 3 million pages while consuming less than 200 MB of memory. This asynchronous technology is also extremely fast, because it avoids thread locking. On our test infrastructure, the crawl speed exceeds 200 pages per second.
What does adaptive crawl mean?
When crawling a website, there are some parameters which are unknown or difficult to evaluate: the bandwidth of the network, the processing power of the target web servers, the available desktop memory and CPU power. On the one hand, you want to download all the pages as fast as possible, but on the other hand, if you shoot too many requests in parallel, you will surely degrade the response times, and possibly bring down the web site.
Therefore, it is mostly impossible to decide what will be the optimal number of requests to send in parallel to the server without crashing the architecture.
This is where Hextrakt brings a major innovation, which we have called adaptive crawler. This feature will automatically optimize the download speed, while carefully adjusting the number of web requests sent in parallel and respecting the capacity of both the target website and the desktop computer.
Note: asynchronous & adaptive crawl is available for HTML only crawls.
Hextrakt embeds the Slimjet browser, based on the Chromium browser, to crawl like Googlebot does.
What is the difference between "URL" and "page"?
In hextrakt, a page means a HTML document, whereas a URL may be the location of any type of file, including img, css, js, HTML...
What are the different URL crawl status?
A URL might be:
- Found or discovered: when it appears in the HTML code of a source page.
- Fetched: when the crawler has made an HTTP request to get the HTTP status, the headers and possibly the content (only for HTML pages) of the target URL.
- Crawled: a HTML page is crawled when it has been fetched and analyzed.
- Orphan: a HTML page is orphan when it was not found by the crawler, but was retrieved from the Analytics or Search Console APIs.
What are the different file natures?
This is a crawled HTML page, which has been fetched and analyzed.
This an HTML page which is either out of the crawl perimeter, or an orphan page.
- uncrawled html
This is an HTML page which was found by the crawler, but which was not crawled because it is blocked by robots.txt or found from a <a rel=nofollow> tag (if "ignore nofollow" is unchecked). A page may also be uncrawled if the crawl limit (depth or URL number set in configuration) is reached.
This is a file which was found from a <a href> link, but which appears to be something other than a HTML file.
This is a css file.
This is an image file.
This is a file found from an <EMBED src= > tag.
- extra dependency
Why are there grey bullets on the left of the URL table ?
In some reports Hextrakt lists external URLs, e.g. mobile version pages. For these external URLs, you can get some data (status code, content-type, size...) if you checked "Check external links" in the crawl configuration (advanced tab). If not checked, these URLs will not be fetched, and you will see a grey bullet on the left of the URL table. To retrieve all data for external URLs, you have to include them in the crawl perimeter, for example by including your mobile version (include URLs beginning with: http://www.mydomain.com http://m.mydomain.com).
HTML pages that are not included in the crawl perimeter get the "External" nature in Hextrakt.
I don't see any SEO visits (organic entrances) for some pages that should have visits
Please check if hextrakt is connected to the right Google Analytics profile (in the crawl configuration, "APIs" tab). Check also the Google Analytics tracking code in the pages (or Google Tag Manager settings) to see if URLs are renamed like this : ga('send', 'pageview', 'new-name.html'). Also check your Google Analytics settings to see if URLs parameters are removed.
I don't get any Google Analytics or Search console data at all
First ensure that you have connected Hextrakt to your Google account in the configuration window (APIs tab). If you manually stop the crawl, you won't get neither Analytics nor GSC data. If you run a complete crawl or if you set a maximum number of crawled pages in the crawl configuration, Hextrakt will get Analytics & GSC data.
In advanced URL explorer, the filter "tag != anything" does not show URLs whose tag is empty
The operator != will only filter values which are not empty. To get URLs which are not tagged, you have to use the "empty" operator, i.e "tag empty".
Installation & setup
How to update Hextrakt Crawler
Just download a new version from our download page. When you update Hextrakt by installing a new version, you will keep all your data, such as crawls, custom reports (filters) and tags.