Guide to JavaScript Web Scraping

03 December 2020 / 1 min read

Guide to JavaScript Web Scraping

Item: Guide to JavaScript Web Scraping
Rating: 7
Author: Rachael Chapman

Requirements for This Tutorial

Node.js

Web Scraping Libraries

A Practical Example and Steps of JavaScript Web Scraping

HTTP Clients That Assist With The Web Scraping Process

Basic JavaScript Web Scraping Steps

By Rachael Chapman

In JavaScript Web Scraping,

4 years ago

1 min read

Add comment

The use of Node.js has made JavaScript a more powerful web scraping language. Node.js, also written as Node js or nodejs is what runs the JavaScript language code without the need for a browser.

Node.js Package Manager (npm) also makes it easier to scrape with node.js with its huge collection of libraries. JavaScript web scraping together with node.js isn’t only easier but has a low learning curve for those who are used to JavaScript already, and it provides speed that is necessary for successful web scraping.

In this JavaScript web scraping tutorial, we will discuss all there is to know about web scraping using JavaScript in a real-life scenario. This will help you in understanding the workings of JavaScript and Node.js in web scraping.

All you need to have are at least a basic understanding of JavaScript, chrome, or Firefox developer tools, and also knowledge of CSS or jQuery selectors.

Post Quick Links

Jump straight to the section of the post you want to read:

Requirements for This Tutorial
Node.js
Web Scraping Libraries
A Practical Example and Steps of JavaScript Web Scraping
HTTP Clients That Assist With The Web Scraping Process
Basic JavaScript Web Scraping Steps

Requirements for This Tutorial

In this web scraping tutorial, you will need two pieces of software:

Any code editor you are comfortable with
Node.js which comes with the Node.js Package Manager

You need to know at this point that Node.js is a runtime framework. This means that JavaScript code doesn’t necessarily need a browser to run. Node.js is available for download for Windows, Mac OS, and Linux and you can download here.

Interesting Read : What You Need To Know Now About Encryption

Node.js

Setting It Up

Before you begin writing your code to scrape using Node.js, create a folder to store JavaScript files. These are the files where all the code you require for web scraping would be stored. After creating the folder, navigate to it and run the initialization command:

npm init –y

Doing this will create a package.json file in the directory that will contain all information about the packages that have been installed in the folder. The next thing to do is the installation of the Node.js Packages.

The Available Packages

To carry out web scraping using Node.js, certain packages referred to as libraries have to be used. These libraries are prepackaged codes that can be used more than once. You can download and reinstall them using the npm install command (the Node.js Package Manager).

To install a package, run npm install . for instance, to install Axios, run the following on your terminal:

npm install Axios

this code also supports the installation of more than one package. To install all the packages that would be used in this tutorial, run the following command:

npm install Axios cheerio json2csv

running this command will download the packages in the node_modules directory, thereby updating the package.json file.

Web Scraping Libraries

Cheerio

Cheerio is an efficient library that gives you the power of JQUERY API on the server side. It takes out all the inconsistencies of DOM, and removes all browser related features as well so that you are left with an efficient API.

Cheerio is different from a web browser as it doesn’t execute JavaScript, render any parsed or manipulated elements, or loads any external resource. So if your target website is JavaScript heavy, then you should consider using another library.

JSDOM

JSDOM is a web scraping library that parsed and interacts with parsed HTML just as a browser would do. Since it acts like a browser, it allows you to interact with your target web page on a programming level so that you can perform functions like button clicks.

Puppeteer

Just as the name suggests, the puppeteer library allows you to manipulate your browser on a programming level. This is possible by providing you with an advanced API that allows you to take automatic control of a Headless version of chrome browser, and you can also tweak it to be non-headless.

Puppeteer is so flexible it allows you to have your go at some features like taking a screenshot of a web page. This comes in handy for example when you want to have a copy of your competitors’ products pages.

Puppeteer is preferred to other tools because of its interaction with the browser. It crawls the web page in a human-like manner making it even more difficult for your bot to be detected and blocked.

Other features and possibilities that puppeteer brings are an automated user interaction with the browser such as form submissions and page navigations, web page PDF generation, and web page crawling to generate prerendered content.

You can install puppeteer by running the following command: npm install puppeteer

Nightmare

Nightmare is a great alternative to puppeteer if for some reason you need to use another automation library other than puppeteer. You can start nightmare by the following command: npm install nightmare

Regular Expressions

Using Regular Expressions makes it simple to begin web scraping without much dependencies. This simplicity comes with a big con; loss of flexibility as users find it difficult to write the correct expressions most times.

A Practical Example and Steps of JavaScript Web Scraping

A very common example of JavaScript web scraping is to scrape eCommerce stores, you can start with the fictional book store: http://books.toscrape.com. It’s similar to the real eCommerce store, but it’s made for web scraping learning.

Create The Selectors

Creating selectors is the first step to take before you begin JavaScript web scraping. This is to identify the specific elements that would be queried.

To begin, open the URL using Chrome or Firefox: http://books.toscrape.com/catalogue/category/books/mystery_3index.html

Once the page opens, right-click on the title of the relevant genre ‘Mystery’, and then select inspect. Doing this should open the Developer Tools with

mystery

selected in the Elements tab.

You can simply create the selector by right-clicking the h1 tag in the Developer Tools, go to copy, and click on copy selector. It will lead to something like this:

#content_inner > article > div.row > div.col-sm-6.product_main > h1

This is a valid selector and would work well, but the pitfall of this method is the long selector it creates. So it can be difficult to understand the code and maintain it.

After staying on the page a while longer, you will notice that there is only one h1 tag on the page. By this, you can easily create a very short selector unlike the previous:

you can use another option like the Selector Gadget extension for chrome to quickly create selectors. It’s a useful tool to scrape the web using JavaScript.

Note that even though this works most times, there are some exceptions where it will fail.

Scrape The Genre Of Interest

In scraping the genre, the first thing is to define the constants that will be referenced to Axios and Cheerio.

Const cheerio = require(“cheerio”);

Const Axios = require(“Axios”);

Save the address of the page to be scraped in the variable URL for readability

Const url = “http://books.toscrape.com/catalogue/category/books/mystery_3index.html”

Axios has a method get() that sends the HTTP GET request and it’s an asynchronous method, thus will need to await prefix:

Const response = await Axios.get (url) ;

If you need to pass more headers like User-Agent, this can be sent as the second parameter:

const response = await axios.get(url, {

\ headers:

\ {

\ "User-Agent": "custom-user-agent string",

\ }

\ });

This site doesn’t need any unique header, making learning easier.

Axios is compatible with both the Promise pattern and the async-await pattern. In this tutorial, we will focus on the async-await pattern. Its response has some characteristics like headers, data, and so on. The HTML we need is in the data attribute and can be loaded into an object to be queried using cheerio.load() method.

Const $ = cheerio.load(response.data) ;

The cheerio’s load method returns a reference to the document to be stored as a constant under any name. To enhance the look of our web scraping code so it can be more like jQuery, a $ can be used rather than a name.

This specific element can be found easily within the document just by writing $(“”) and in this case, it would be $(“h1”).

The method text() would be used generally when writing the scraping code with java as it can get the text inside any required element. This can be extracted and saved in a local variable:

Const genre = $(“h1”) . text() ;

Lastly, the console.log() will print the value of the variable on the console

Console.log(genre) ;

For the avoidance of errors, the code will have a try-catch block around it. Remember that it's good practice to use console.error for any errors, and console.log for other messages.

Below is the complete code put together. Save the code as genre.js in the folder you created at the beginning, where you ran the command npm init.

const cheerio = require("cheerio");

const axios = require("axios");

const url = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";

async function getGenre() {

try {

\ const response = await axios.get(url);

\ const document = cheerio.load(response.data);

\ const genre = document("h1").text();

\ console.log(genre);

} catch (error) {

\ console.error(error);

}

getGenre();

Finally, you will need to run this JavaScript web scraping using Node.js. open the terminal and run the following command:

Node genre.js

The code output will be the genre name:

Mystery

Scrape Target Book Listings

Let’s try scraping book listings from the web page of the mystery genre:

http://books.toscrape.com/catalogue/category/books/mystery_3/index.html

the first thing to do is to analyze the page and understand the HTML structure that is used. Open the page using a chrome browser and tap on F2 to examine the elements.

Each book is in

tag, meaning that all the books can be extracted and individual book details can be extracted too. If you parse the HTML using Cheerio, the jQuery function can be used to run the loop for individual book details.

Here is the code to begin with extracting books titles:

const books = $("article"); //Selector to get all books

books.each(function ()

\ { //running a loop

    title = $(this).find("h3 a").text(); //extracting book title

    console.log(title);//print the book title

        });

By analyzing the code, you will see that the extracted details need to be saved inside the loop. It would be best if the values are stored in an array, other book attributes can also be extracted and stored in JSON format as an array.

Below is the complete code. Begin by creating a new file, paste the code, and save it as books.js in the same folder from before where you ran npm init.

const cheerio = require("cheerio");

const axios = require("axios");

const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";

const books_data = [];

async function getBooks(url) {

try {

\ const response = await axios.get(url);

\ const $ = cheerio.load(response.data);

\ const books = $("article");

\ books.each(function () {

\ title = $(this).find("h3 a").text();

\ price = $(this).find(".price_color").text();

\ stock = $(this).find(".availability").text().trim();

\ books_data.push({ title, price, stock }); //store in array

\ });

\ console.log(books_data);//print the array

} catch (err) {

\ console.error(err);

}

getBooks(mystery);

From the terminal, run the following file using Node.js

node books.js

doing this should print an array of books on the console. This is however limited by the number of pages it scrapes; just one.

How to Handle Pagination

Books listings like this are usually spread across different pages. Websites use different ways to paginate, but the most common is having a ‘next button’ on each page with the exception being the last page.

In handling pagination in a situation such as this, you create a selector for the next page link. If it gives a value, take the href attribute value and call the getBooks function using the new URL repeatedly. Add the following after the books.each() loop:

if ($(".next a").length > 0) {

\ next_page = baseUrl + $(".next a").attr("href"); //converting to absolute URL

\ getBooks(next_page); //recursive call to the same function with new URL

}

Notice that in the above code, the href returned is a relative URL. To change it to an absolute URL you would need to concatenate a fixed part to it. The fixed part of the URL would be stored in the base URL variable:

Const baseUrl =” http://books.toscrape.com/catalogue/category/books/mystery_3/”

once the scraper gets to the last page which has no next button, the repeated call will stop and the array would have had all the information you need from every page. The final step of scraping data using Node.js is to save the extracted data.

Save Extracted Data to a CSV File

Javascript with web scraping is so far an easy process, and it’s even easier to save the data into a CSV file. You can use two packages to save; –fs and json2csv. The fs package is inbuilt and represents the file system whereas the json2csv wou9ld require installation using the following command:

npm install json2csv

after installation, create a constant to store the package’s parser

const j2cp = require(“json2csv”).parser;

you will need to provide access to the file system to write the file on the disk. Initialize the fs package to do this:

const fs = require(“fs”);

in the code, find the line that has an array with the scrapped data, and insert the following codes to create the CSV file:

const parser = new j2cp();

const csv = parser.parse(books_data); // json to CSV in memory

fs.writeFileSync("./books.csv", csv); // CSV is now written to disk

below is the complete script. You can save it as a .js file in the node.js project folder. once you run the command with node command on terminal, data from every page will be extracted and made available in books.csv file.

const fs = require("fs");

const j2cp = require("json2csv").Parser;

const axios = require("axios");

const cheerio = require("cheerio");

const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";

const books_data = [];

async function getBooks(url) {

try {

\ const response = await axios.get(url);

\ const $ = cheerio.load(response.data);

\ const books = $("article");

\ books.each(function () {

\ title = $(this).find("h3 a").text();

\ price = $(this).find(".price_color").text();

\ stock = $(this).find(".availability").text().trim();

\ books_data.push({ title, price, stock });

\ });

\ // console.log(books_data);

\ const baseUrl = "http://books.toscrape.com/catalogue/category/books/mystery_3/";

\ if ($(".next a").length > 0) {

\ next = baseUrl + $(".next a").attr("href");

\ getBooks(next);

\ } else {

\ const parser = new j2cp();

\ const csv = parser.parse(books_data);

\ fs.writeFileSync("./books.csv", csv);

\ }

} catch (err) {

\ console.error(err);

}

getBooks(mystery);

Run the following using node.js from the terminal:

node books.js

by this, we now have a new file books.csv and it holds all desired data. You can view it on any spreadsheet program of your choice.

HTTP Clients That Assist With The Web Scraping Process

Axios

Axios is a client that runs both in the browser and NodeJS. It comes with promise support and is straightforward when making HTTP requests.

Download Axios on GitHub and install using the following command; npm install axios

Request

The Request client is one of the most popular HTTP clients due to its features and simplicity. It’s available at GitHub and can be installed using the following command: npm install request

Superagent

This is another HTTP client that assists web scraping. It supports promises and also a sync/await syntax sugar and even though it’s fairly straightforward, it’s not as proper as the others.

It’s available on GitHub and you can install it with the following command: npm install superagent

Interesting Read : The Internet of Things – Technology of the future

Basic JavaScript Web Scraping Steps

Most JavaScript web scraping or Node.js web scraping would be made of three basic steps:

The HTTP request would be sent
The HTTP response would be parsed and data extracted
The resulting data would be saved in a persistent storage such as a database

We will now see how Axios can be used to send the HTTP request, Cheerio to parse the response and for data extraction, and then saving the extracted data to CSV with json2csv.

Send The HTTP Request

JavaScript web scraping begins with finding the package that can send the HTTP request and also return the response. The request and response promise of a package used to be popular, but is now being discredited.

Various examples and old codes use these packages, and Axios is a good alternative. It’s compatible with promise syntax and async-await syntax.

Parse The HTTP Response with Cheerio

You also get the Cheerio package from Node.js. Its conversion of raw HTML data format as captured by Axios into something that can be worked with using jQuery-like syntax is what makes it valuable.

Since most JavaScript developers are familiar with jQuery, it makes Cheerio a very good choice in data extraction from HTML.

Web scraping is very important as many businesses depend on the decisions they take after analyzing data.

from the internet and it is in three major steps; sending the HTTP request, parse the response, and then saving the extracted data.

Web scraping isn’t always successful as websites have serious security set up to prevent the action of bots. So to ensure that you complete your task and get the data you need, you need to use the best of every tool. You will need a pool of IPs as provided by a proxy service, and your choice of a proxy is very important in your success.

For this, we highly recommend Limeproxies, as they have optimized proxy servers that will see you to the end of your task.

About the author

Rachael Chapman

A Complete Gamer and a Tech Geek. Brings out all her thoughts and Love in Writing Techie Blogs.

Ready to get started?

START TRIAL BOOK A DEMO

Guide to JavaScript Web Scraping

Setting It Up

The Available Packages

Cheerio

JSDOM

Puppeteer

Nightmare

Regular Expressions

Create The Selectors

mystery

Scrape The Genre Of Interest

Scrape Target Book Listings

How to Handle Pagination

Save Extracted Data to a CSV File

Axios

Request

Superagent

Send The HTTP Request

Parse The HTTP Response with Cheerio

FAQ's

About the author

Related Articles

The Ultimate SEO Checklist in 2019 | By Limeproxies

Why Hide Your IP Address? (Five Ways to Hide Your IP Address)

Ready to get started?