What Is AJAX Crawling?

An increasing number of websites make heavy use of Javascript frameworks such as React, AngularJS, Vue.js, Polymer, and Ember. Crawling Javascript-heavy websites poses challenges as the content for these sites is dynamically generated, and crawlers have the ability to collect content only from static websites.

Picture of Javascript code in an editor
Javascript (source: www.maxpixel.net)

Websites that utilize data transfer from the server and rendering the content asynchronously with Javascript are referred to as AJAX websites and the act of crawling these sites as dynamic or AJAX crawling. AddSearch supports AJAX crawling but if you’re looking for ways to make your Javascript-heavy website crawler friendly for other search engines head on over to the next sections of this article and find out more.

What is Crawling?

Crawling refers to a process of collecting content from a website to an index that can be accessed by the search engine. Crawling is carried out by the crawler which is an application that visits a website, searches for links to the web pages of the website and collects them.

Usually, the start- or homepage acts as what is called the seed page. This is where the crawler begins the process of searching links to the pages of the website. When the crawler finds links from a page, it follows them to the other pages the links refer to. Then it follows the links found from the other pages and reiterates this procedure until there are no new links to follow. During this process, all of the links to the pages are collected. In addition to the seed page, the crawler can utilize the links found from a sitemap.

When the crawler has iterated through the links the web pages or a set of selected content from the web pages is requested. Requesting selected content from web pages is referred to as scraping which is a subset of crawling. In this article, we will refer to crawling in general instead of scraping in the context of requesting the selected content from web pages.

The requested content is saved to a database which is structured as an index which is also referred to as indexing. What this means is that the selected content crawled from each web page has a reference document that holds the selected contents in the index. So when a visitor uses the search engine to find content, the search result includes the crawled content as well as the link to the web page where the content is crawled.

Ajax Crawling, Rendering Before Crawling

AJAX websites offer visitors rich interactive user experience without having to load the whole web page each time when consuming new content. While the user experience may be great, requesting content from AJAX web pages creates some challenges.

On static web pages, the content of the web page can be found from the page source. On AJAX web pages, however, most of the content is generated dynamically using AJAX technologies for transferring and rendering the content after the page is loaded and the scripts are executed.

Crawlers are able to crawl static web pages but crawling dynamically generated content is not possible. This is because crawlers are not able to execute the scripts which render the content visible to the visitor and into the memory of the web browser.

If crawlers can’t execute AJAX requests, how does AJAX crawling work? In short, the web page needs to be loaded with a browser where the AJAX requests and the Javascript that renders the page elements visible are executed. Only after this, the AJAX page can be exported as a static HTML page which, then, can be requested and indexed. This process is referred to as prerendering.

Two Ways of Prerendering AJAX Web Pages

The idea of prerendering is to make websites crawler friendly to any search engine that crawls and indexes websites. This is made possible by making static HTML snapshots of the web pages of the website through prerendering. Thus prerendering an AJAX site is also a form of Search Engine Optimization (SEO).

Prerendering can be established with libraries and frameworks which can be used with various programming languages. There are also service providers which provide prerendering through a middleware which gives separate responses to crawler and user requests.

Automation Frameworks And Headless Browsers

Some of the programming languages used for crawling web pages are Python and Javascript (with Node.js runtime.) When crawling AJAX web pages the programming languages are used in conjunction with automation frameworks, such as Selenium, and with so-called headless browsers.

Selenium is a browser automation framework with which automating interaction with the browser is fairly easy to set up. While Selenium is not mandatory for AJAX crawling it is handy if the interactions of the visitor – for instance iterating through paginated AJAX content that needs user interaction – are needed for crawling the content.

A headless browser is a web browser which doesn’t have a graphical interface. Thus it can be used from a command line – for instance from a web server. The most common headless browsers that can be used for prerendering are, PhantomJS, Headless Chrome, and Firefox Headless Mode.

Headless browsers are the most important components of prerendering dynamically generated AJAX web pages into crawlable static HTML pages. As stated earlier in this article browsers are used as a tool to request AJAX content and run the Javascript which renders the content to the browser’s memory.

Prerendering As A Service

Prerendering can also be acquired as a service. The service includes prerendering which saves AJAX pages to static HTML pages. It also supports middleware which has separate responses to crawler and user requests.

The user request routes the visitor normally to the dynamically generated page which the browser can render visible by AJAX requests and executing the Javascript. As crawlers can only crawl static HTML content the middleware routes the crawlers request to the static HTML page prerendered by the prerendering service provider. The crawler’s request is recognized by the middleware by using the user-agent of the crawler.

The idea for the prerendering service is to make AJAX website crawler friendly thus also providing the possibility for SEO. One of the commercial services is the open source based Prerender.io which can also be installed to a web server. For more information visit https://prerender.io/.

Conclusion

In this article, we have learned that crawling is a process through which the pages of the website are indexed for the use of a search engine. AJAX crawling, however, requires a few extra steps to convert the dynamic content into a static format which can be crawled. You can do this by setting up an environment with programmable frameworks associated with headless browsers. There are also services that make it easier to serve static content for crawlers.

If you prefer a service which takes care of AJAX crawling as well as provides you with a search engine that is easy to set up on your page contact our sales and order a demo.