What is Crawling?
Crawling refers to a process of collecting content from a website to an index that can be accessed by the search engine. Crawling is carried out by the crawler which is an application that visits a website, searches for links to the web pages of the website and collects them.
Usually, the start- or homepage acts as what is called the seed page. This is where the crawler begins the process of searching links to the pages of the website. When the crawler finds links from a page, it follows them to the other pages the links refer to. Then it follows the links found from the other pages and reiterates this procedure until there are no new links to follow. During this process, all of the links to the pages are collected. In addition to the seed page, the crawler can utilize the links found from a sitemap.
When the crawler has iterated through the links the web pages or a set of selected content from the web pages is requested. Requesting selected content from web pages is referred to as scraping which is a subset of crawling. In this article, we will refer to crawling in general instead of scraping in the context of requesting the selected content from web pages.
The requested content is saved to a database that is structured as an index which is also referred to as indexing. What this means is that the selected content crawled from each web page has a reference document that holds the selected contents in the index.
So when a visitor uses the search engine to find content, the search result includes the crawled content as well as the link to the web page where the content is crawled.
Ajax Crawling, Rendering Before Crawling
AJAX websites offer visitors a rich interactive user experience without having to load the whole web page each time when consuming new content. While the user experience may be great, requesting content from AJAX web pages creates some challenges.
On static web pages, the content of the web page can be found from the page source. On AJAX web pages, however, most of the content is generated dynamically using AJAX technologies for transferring and rendering the content after the page is loaded and the scripts are executed.
Crawlers are able to crawl static web pages but crawling dynamically generated content is not possible. This is because crawlers are not able to execute the scripts which render the content visible to the visitor and into the memory of the web browser.
Only after this, the AJAX page can be exported as a static HTML page which, then, can be requested and indexed. This process is referred to as prerendering.
Two Ways of Prerendering AJAX Web Pages
The idea of prerendering is to make websites crawler friendly to any search engine that crawls and indexes websites. This is made possible by making static HTML snapshots of the web pages of the website through prerendering. Thus prerendering an AJAX site is also a form of Search Engine Optimization (SEO).
Prerendering can be established with libraries and frameworks which can be used with various programming languages. There are also service providers which provide prerendering through a middleware which gives separate responses to crawler and user requests.
Automation Frameworks And Headless Browsers
Selenium is a browser automation framework with which automating interaction with the browser is fairly easy to set up. While Selenium is not mandatory for AJAX crawling it is handy if the interactions of the visitor – for instance iterating through paginated AJAX content that needs user interaction – are needed for crawling the content.
A headless browser is a web browser which doesn’t have a graphical interface. Thus it can be used from a command line – for instance from a web server. The most common headless browsers that can be used for prerendering are, PhantomJS, Headless Chrome, and Firefox Headless Mode.
Prerendering As A Service
Prerendering can also be acquired as a service. The service includes prerendering which saves AJAX pages to static HTML pages. It also supports middleware which has separate responses to crawler and user requests.
The idea for the prerendering service is to make AJAX website crawler friendly thus also providing the possibility for SEO. One of the commercial services is the open source-based Prerender.io which can also be installed to a web server.
Crawling is a process through which the pages of the website are indexed for the use of a search engine. AJAX crawling, however, requires a few extra steps to convert the dynamic content into a static format which can be crawled.
You can do this by setting up an environment with programmable frameworks associated with headless browsers. There are also services that make it easier to serve static content for crawlers.