The use of Node.js has made JavaScript a more powerful web scraping language. Node.js, also written as Node js or nodejs is what runs the JavaScript language code without the need for a browser.
Node.js Package Manager (npm) also makes it easier to scrape with node.js with its huge collection of libraries. JavaScript web scraping together with node.js isn’t only easier but has a low learning curve for those who are used to JavaScript already, and it provides speed that is necessary for successful web scraping.
In this JavaScript web scraping tutorial, we will discuss all there is to know about web scraping using JavaScript in a real-life scenario. This will help you in understanding the workings of JavaScript and Node.js in web scraping.
All you need to have are at least a basic understanding of JavaScript, chrome, or Firefox developer tools, and also knowledge of CSS or jQuery selectors.
Post Quick Links
Jump straight to the section of the post you want to read:
Requirements for This Tutorial
In this web scraping tutorial, you will need two pieces of software:
- Any code editor you are comfortable with
- Node.js which comes with the Node.js Package Manager
You need to know at this point that Node.js is a runtime framework. This means that JavaScript code doesn’t necessarily need a browser to run. Node.js is available for download for Windows, Mac OS, and Linux and you can download here.
Interesting Read : What You Need To Know Now About Encryption
Node.js
Setting It Up
Before you begin writing your code to scrape using Node.js, create a folder to store JavaScript files. These are the files where all the code you require for web scraping would be stored. After creating the folder, navigate to it and run the initialization command:
npm init –y
Doing this will create a package.json file in the directory that will contain all information about the packages that have been installed in the folder. The next thing to do is the installation of the Node.js Packages.
The Available Packages
To carry out web scraping using Node.js, certain packages referred to as libraries have to be used. These libraries are prepackaged codes that can be used more than once. You can download and reinstall them using the npm install command (the Node.js Package Manager).
To install a package, run npm install
npm install Axios
this code also supports the installation of more than one package. To install all the packages that would be used in this tutorial, run the following command:
npm install Axios cheerio json2csv
running this command will download the packages in the node_modules directory, thereby updating the package.json file.
Web Scraping Libraries
Cheerio
Cheerio is an efficient library that gives you the power of JQUERY API on the server side. It takes out all the inconsistencies of DOM, and removes all browser related features as well so that you are left with an efficient API.
Cheerio is different from a web browser as it doesn’t execute JavaScript, render any parsed or manipulated elements, or loads any external resource. So if your target website is JavaScript heavy, then you should consider using another library.
JSDOM
JSDOM is a web scraping library that parsed and interacts with parsed HTML just as a browser would do. Since it acts like a browser, it allows you to interact with your target web page on a programming level so that you can perform functions like button clicks.
Puppeteer
Just as the name suggests, the puppeteer library allows you to manipulate your browser on a programming level. This is possible by providing you with an advanced API that allows you to take automatic control of a Headless version of chrome browser, and you can also tweak it to be non-headless.
Puppeteer is so flexible it allows you to have your go at some features like taking a screenshot of a web page. This comes in handy for example when you want to have a copy of your competitors’ products pages.
Puppeteer is preferred to other tools because of its interaction with the browser. It crawls the web page in a human-like manner making it even more difficult for your bot to be detected and blocked.
Other features and possibilities that puppeteer brings are an automated user interaction with the browser such as form submissions and page navigations, web page PDF generation, and web page crawling to generate prerendered content.
You can install puppeteer by running the following command: npm install puppeteer
Nightmare
Nightmare is a great alternative to puppeteer if for some reason you need to use another automation library other than puppeteer. You can start nightmare by the following command: npm install nightmare
Regular Expressions
Using Regular Expressions makes it simple to begin web scraping without much dependencies. This simplicity comes with a big con; loss of flexibility as users find it difficult to write the correct expressions most times.
A Practical Example and Steps of JavaScript Web Scraping
A very common example of JavaScript web scraping is to scrape eCommerce stores, you can start with the fictional book store: http://books.toscrape.com. It’s similar to the real eCommerce store, but it’s made for web scraping learning.
Create The Selectors
Creating selectors is the first step to take before you begin JavaScript web scraping. This is to identify the specific elements that would be queried.
To begin, open the URL using Chrome or Firefox: http://books.toscrape.com/catalogue/category/books/mystery_3index.html
Once the page opens, right-click on the title of the relevant genre ‘Mystery’, and then select inspect. Doing this should open the Developer Tools with
mystery
selected in the Elements tab.You can simply create the selector by right-clicking the h1 tag in the Developer Tools, go to copy, and click on copy selector. It will lead to something like this:
#content_inner > article > div.row > div.col-sm-6.product_main > h1
This is a valid selector and would work well, but the pitfall of this method is the long selector it creates. So it can be difficult to understand the code and maintain it.
After staying on the page a while longer, you will notice that there is only one h1 tag on the page. By this, you can easily create a very short selector unlike the previous:
h1
you can use another option like the Selector Gadget extension for chrome to quickly create selectors. It’s a useful tool to scrape the web using JavaScript.
Note that even though this works most times, there are some exceptions where it will fail.
Scrape The Genre Of Interest
In scraping the genre, the first thing is to define the constants that will be referenced to Axios and Cheerio.
Const cheerio = require(“cheerio”);
Const Axios = require(“Axios”);
Save the address of the page to be scraped in the variable URL for readability
Const url = “http://books.toscrape.com/catalogue/category/books/mystery_3index.html”
Axios has a method get() that sends the HTTP GET request and it’s an asynchronous method, thus will need to await prefix:
Const response = await Axios.get (url) ;
If you need to pass more headers like User-Agent, this can be sent as the second parameter:
const response = await axios.get(url, {
\ headers:
\ {
\ "User-Agent": "custom-user-agent string",
\ }
\ });
This site doesn’t need any unique header, making learning easier.
Axios is compatible with both the Promise pattern and the async-await pattern. In this tutorial, we will focus on the async-await pattern. Its response has some characteristics like headers, data, and so on. The HTML we need is in the data attribute and can be loaded into an object to be queried using cheerio.load() method.
Const $ = cheerio.load(response.data) ;
The cheerio’s load method returns a reference to the document to be stored as a constant under any name. To enhance the look of our web scraping code so it can be more like jQuery, a $ can be used rather than a name.
This specific element can be found easily within the document just by writing $(“
The method text() would be used generally when writing the scraping code with java as it can get the text inside any required element. This can be extracted and saved in a local variable:
Const genre = $(“h1”) . text() ;
Lastly, the console.log() will print the value of the variable on the console
Console.log(genre) ;
For the avoidance of errors, the code will have a try-catch block around it. Remember that it's good practice to use console.error for any errors, and console.log for other messages.
Below is the complete code put together. Save the code as genre.js in the folder you created at the beginning, where you ran the command npm init.
const cheerio = require("cheerio");
const axios = require("axios");
const url = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
async function getGenre() {
try {
\ const response = await axios.get(url);
\ const document = cheerio.load(response.data);
\ const genre = document("h1").text();
\ console.log(genre);
} catch (error) {
\ console.error(error);
}
}
getGenre();
Finally, you will need to run this JavaScript web scraping using Node.js. open the terminal and run the following command:
Node genre.js
The code output will be the genre name:
Mystery
Scrape Target Book Listings
Let’s try scraping book listings from the web page of the mystery genre:
http://books.toscrape.com/catalogue/category/books/mystery_3/index.html
the first thing to do is to analyze the page and understand the HTML structure that is used. Open the page using a chrome browser and tap on F2 to examine the elements.
Each book is in
Here is the code to begin with extracting books titles:
const books = $("article"); //Selector to get all books
books.each(function ()
\ { //running a loop
title = $(this).find("h3 a").text(); //extracting book title
console.log(title);//print the book title
});
By analyzing the code, you will see that the extracted details need to be saved inside the loop. It would be best if the values are stored in an array, other book attributes can also be extracted and stored in JSON format as an array.
Below is the complete code. Begin by creating a new file, paste the code, and save it as books.js in the same folder from before where you ran npm init.
const cheerio = require("cheerio");
const axios = require("axios");
const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
const books_data = [];
async function getBooks(url) {
try {
\ const response = await axios.get(url);
\ const $ = cheerio.load(response.data);
\ const books = $("article");
\ books.each(function () {
\ title = $(this).find("h3 a").text();
\ price = $(this).find(".price_color").text();
\ stock = $(this).find(".availability").text().trim();
\ books_data.push({ title, price, stock }); //store in array
\ });
\ console.log(books_data);//print the array
} catch (err) {
\ console.error(err);
}
}
getBooks(mystery);
From the terminal, run the following file using Node.js
node books.js
doing this should print an array of books on the console. This is however limited by the number of pages it scrapes; just one.
How to Handle Pagination
Books listings like this are usually spread across different pages. Websites use different ways to paginate, but the most common is having a ‘next button’ on each page with the exception being the last page.
In handling pagination in a situation such as this, you create a selector for the next page link. If it gives a value, take the href attribute value and call the getBooks function using the new URL repeatedly. Add the following after the books.each() loop:
if ($(".next a").length > 0) {
\ next_page = baseUrl + $(".next a").attr("href"); //converting to absolute URL
\ getBooks(next_page); //recursive call to the same function with new URL
}
Notice that in the above code, the href returned is a relative URL. To change it to an absolute URL you would need to concatenate a fixed part to it. The fixed part of the URL would be stored in the base URL variable:
Const baseUrl =” http://books.toscrape.com/catalogue/category/books/mystery_3/”
once the scraper gets to the last page which has no next button, the repeated call will stop and the array would have had all the information you need from every page. The final step of scraping data using Node.js is to save the extracted data.
Save Extracted Data to a CSV File
Javascript with web scraping is so far an easy process, and it’s even easier to save the data into a CSV file. You can use two packages to save; –fs and json2csv. The fs package is inbuilt and represents the file system whereas the json2csv wou9ld require installation using the following command:
npm install json2csv
after installation, create a constant to store the package’s parser
const j2cp = require(“json2csv”).parser;
you will need to provide access to the file system to write the file on the disk. Initialize the fs package to do this:
const fs = require(“fs”);
in the code, find the line that has an array with the scrapped data, and insert the following codes to create the CSV file:
const parser = new j2cp();
const csv = parser.parse(books_data); // json to CSV in memory
fs.writeFileSync("./books.csv", csv); // CSV is now written to disk
below is the complete script. You can save it as a .js file in the node.js project folder. once you run the command with node command on terminal, data from every page will be extracted and made available in books.csv file.
const fs = require("fs");
const j2cp = require("json2csv").Parser;
const axios = require("axios");
const cheerio = require("cheerio");
const mystery = "http://books.toscrape.com/catalogue/category/books/mystery_3/index.html";
const books_data = [];
async function getBooks(url) {
try {
\ const response = await axios.get(url);
\ const $ = cheerio.load(response.data);
\ const books = $("article");
\ books.each(function () {
\ title = $(this).find("h3 a").text();
\ price = $(this).find(".price_color").text();
\ stock = $(this).find(".availability").text().trim();
\ books_data.push({ title, price, stock });
\ });
\ // console.log(books_data);
\ const baseUrl = "http://books.toscrape.com/catalogue/category/books/mystery_3/";
\ if ($(".next a").length > 0) {
\ next = baseUrl + $(".next a").attr("href");
\ getBooks(next);
\ } else {
\ const parser = new j2cp();
\ const csv = parser.parse(books_data);
\ fs.writeFileSync("./books.csv", csv);
\ }
} catch (err) {
\ console.error(err);
}
}
getBooks(mystery);
Run the following using node.js from the terminal:
node books.js
by this, we now have a new file books.csv and it holds all desired data. You can view it on any spreadsheet program of your choice.
HTTP Clients That Assist With The Web Scraping Process
Axios
Axios is a client that runs both in the browser and NodeJS. It comes with promise support and is straightforward when making HTTP requests.
Download Axios on GitHub and install using the following command; npm install axios
Request
The Request client is one of the most popular HTTP clients due to its features and simplicity. It’s available at GitHub and can be installed using the following command: npm install request
Superagent
This is another HTTP client that assists web scraping. It supports promises and also a sync/await syntax sugar and even though it’s fairly straightforward, it’s not as proper as the others.
It’s available on GitHub and you can install it with the following command: npm install superagent
Interesting Read : The Internet of Things – Technology of the future
Basic JavaScript Web Scraping Steps
Most JavaScript web scraping or Node.js web scraping would be made of three basic steps:
- The HTTP request would be sent
- The HTTP response would be parsed and data extracted
- The resulting data would be saved in a persistent storage such as a database
We will now see how Axios can be used to send the HTTP request, Cheerio to parse the response and for data extraction, and then saving the extracted data to CSV with json2csv.
Send The HTTP Request
JavaScript web scraping begins with finding the package that can send the HTTP request and also return the response. The request and response promise of a package used to be popular, but is now being discredited.
Various examples and old codes use these packages, and Axios is a good alternative. It’s compatible with promise syntax and async-await syntax.
Parse The HTTP Response with Cheerio
You also get the Cheerio package from Node.js. Its conversion of raw HTML data format as captured by Axios into something that can be worked with using jQuery-like syntax is what makes it valuable.
Since most JavaScript developers are familiar with jQuery, it makes Cheerio a very good choice in data extraction from HTML.
About the author
Rachael Chapman
A Complete Gamer and a Tech Geek. Brings out all her thoughts and Love in Writing Techie Blogs.
Related Articles
Guide to Ranking High and Appearing First in eCommerce Search Results
When a consumer needs to buy a product, they mostly end on the first-page search results. Marketplace SEO: How to Appear First in eCommerce Search Results
What You Need To Know Now About Encryption
The process of protecting sensitive information or encryption dates back to 700BC and slowly we saw “Alberti cipher” in 1467, “Jefferson wheel” in 1797, the famous world war two “Enigma machine”, then Data Encryption Standard (DES) was introduced in 1979, followed by Advanced Encryption Standard (AES) in 1997.