Skip to content
Logo
JavaScript Web Scraping Guide (Node.js, Puppeteer, Playwright) 2026

JavaScript Web Scraping Guide (Node.js, Puppeteer, Playwright) 2026

JavaScript web scraping with Node.js Puppeteer Playwright 2026

TL;DR: JavaScript is one of the best languages for web scraping in 2026. Use Cheerio for static HTML (fast, lightweight), Playwright for JavaScript-rendered pages (recommended over Puppeteer), and rotating residential proxies to avoid IP bans at scale. Key stack: Node.js + Playwright + got + cheerio + rotating-proxy endpoint.


Why JavaScript for Web Scraping?

JavaScript (Node.js) has unique advantages for web scraping:

  1. Same language as the web — most websites run JavaScript; scraping with it means you understand the target's code
  2. Async-first architecture — Node.js handles thousands of concurrent requests efficiently without threads
  3. Native browser automation — Playwright and Puppeteer are built primarily for Node.js
  4. Network interception — capturing XHR/fetch API responses is cleaner with the same event model the browser uses
  5. npm ecosystem — hundreds of scraping-related packages (got, axios, cheerio, playwright, puppeteer, p-limit)

Prerequisites and Setup

What You Need

  • Node.js (v18+ recommended — LTS as of 2026)
  • npm or pnpm (package manager)
  • A code editor (VS Code recommended)
# Verify Node.js installation
node --version    # Should show v18+ or v20+
npm --version

# Create project directory
mkdir amazon-scraper && cd amazon-scraper
npm init -y

Installing Core Dependencies

# For static HTML scraping
npm install got cheerio

# For dynamic/JavaScript-rendered pages
npm install playwright

# Install browser binaries for Playwright
npx playwright install chromium

# For concurrent request management
npm install p-limit

# For data export
npm install fast-csv

Method 1: Static HTML Scraping with Cheerio

Best for: Pages where content is fully rendered in the initial HTML response.

Cheerio implements jQuery's API on the server, giving you familiar $('.selector').text() syntax without a browser.

Basic Scraping Example

import got from 'got';
import * as cheerio from 'cheerio';

async function scrapeProductPage(url) {
  try {
    const { body } = await got(url, {
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
        'Accept-Language': 'en-US,en;q=0.9',
      },
    });

    const $ = cheerio.load(body);

    const product = {
      title: $('h1.product-title').text().trim(),
      price: $('.price-current').first().text().trim(),
      rating: $('.rating-value').text().trim(),
      reviews: parseInt($('.review-count').text().replace(/\D/g, '')),
      description: $('.product-description').text().trim(),
    };

    return product;
  } catch (error) {
    console.error(`Failed to scrape ${url}:`, error.message);
    return null;
  }
}

// Run it
const data = await scrapeProductPage('https://example.com/product/123');
console.log(data);

Scraping Multiple Pages in Parallel (with Rate Limiting)

import got from 'got';
import * as cheerio from 'cheerio';
import pLimit from 'p-limit';

// Limit to 5 concurrent requests maximum
const limit = pLimit(5);

const urls = [
  'https://example.com/product/1',
  'https://example.com/product/2',
  'https://example.com/product/3',
  // ... hundreds more
];

async function scrapeWithDelay(url) {
  // Random delay 1-3 seconds between requests
  await new Promise(resolve => setTimeout(resolve, Math.random() * 2000 + 1000));
  return scrapeProductPage(url);
}

// Scrape all URLs with rate limiting
const results = await Promise.all(
  urls.map(url => limit(() => scrapeWithDelay(url)))
);

const validResults = results.filter(Boolean);
console.log(`Scraped ${validResults.length} products`);

Method 2: Dynamic Content Scraping with Playwright

Best for: React, Vue, Angular, Next.js pages, infinite scroll, JavaScript-triggered content.

Basic Playwright Setup

import { chromium } from 'playwright';

async function scrapeDynamicPage(url) {
  const browser = await chromium.launch({ headless: true });
  const context = await browser.newContext({
    userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
    viewport: { width: 1366, height: 768 },
  });

  const page = await context.newPage();

  try {
    await page.goto(url, { waitUntil: 'networkidle', timeout: 30000 });

    // Wait for specific element to confirm page is ready
    await page.waitForSelector('.product-title', { timeout: 10000 });

    // Extract data
    const data = await page.evaluate(() => ({
      title: document.querySelector('.product-title')?.textContent?.trim(),
      price: document.querySelector('.price')?.textContent?.trim(),
      rating: document.querySelector('.rating')?.textContent?.trim(),
    }));

    return data;
  } finally {
    await browser.close();
  }
}

Playwright with Proxy Integration

import { chromium } from 'playwright';

async function scrapeWithProxy(url, proxyConfig) {
  const browser = await chromium.launch({
    headless: true,
    proxy: {
      server: proxyConfig.server,     // 'http://gate.limeproxies.com:5432'
      username: proxyConfig.username,
      password: proxyConfig.password,
    },
  });

  const page = await browser.newPage();

  try {
    await page.goto(url, { waitUntil: 'domcontentloaded' });
    // ... extraction logic
    return await page.title();
  } finally {
    await browser.close();
  }
}

// Use a rotating residential proxy endpoint
const proxyConfig = {
  server: 'http://gate.limeproxies.com:5432',
  username: 'your-username',
  password: 'your-password',
};

const result = await scrapeWithProxy('https://amazon.com/dp/B08N5WRWNW', proxyConfig);

Anti-Detection: Stealth Mode

Playwright can be detected as a headless browser. Use these techniques to evade detection:

import { chromium } from 'playwright';

const browser = await chromium.launch({
  headless: true,
  args: [
    '--disable-blink-features=AutomationControlled',  // Remove automation flag
    '--disable-features=site-per-process',
  ],
});

const context = await browser.newContext({
  userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/121.0.0.0 Safari/537.36',
  viewport: { width: 1366, height: 768 },
  locale: 'en-US',
  timezoneId: 'America/New_York',
  permissions: ['geolocation'],
});

const page = await context.newPage();

// Override automation detection properties
await page.addInitScript(() => {
  // Remove webdriver flag
  Object.defineProperty(navigator, 'webdriver', { get: () => undefined });

  // Fake plugins (real browsers have these)
  Object.defineProperty(navigator, 'plugins', {
    get: () => [{ name: 'PDF Plugin' }, { name: 'Chrome PDF Viewer' }],
  });
});

Method 3: API Interception (Best for SPAs)

Many modern websites load data via API calls (XHR/fetch). Capturing these is often faster and more reliable than parsing HTML.

import { chromium } from 'playwright';

async function captureApiResponse(pageUrl, apiPattern) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();

  const capturedData = [];

  // Listen for network responses matching our pattern
  page.on('response', async (response) => {
    if (response.url().includes(apiPattern) && response.status() === 200) {
      try {
        const json = await response.json();
        capturedData.push(json);
      } catch (e) {
        // Not JSON — skip
      }
    }
  });

  await page.goto(pageUrl, { waitUntil: 'networkidle' });

  // For infinite scroll pages — scroll to trigger more API calls
  for (let i = 0; i < 5; i++) {
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(2000);
  }

  await browser.close();
  return capturedData;
}

// Example: capture Amazon product API responses
const products = await captureApiResponse(
  'https://www.amazon.com/s?k=laptops',
  '/api/s?'
);

Rotating Proxies in Node.js

For large-scale scraping, rotating proxies are essential to avoid IP bans.

Option 1: Rotating Residential Proxy Endpoint (Recommended)

The simplest approach — use a single endpoint that automatically rotates IPs:

import got from 'got';

// LimeProxies rotating residential endpoint
const proxyUrl = 'http://username:password@gate.limeproxies.com:5432';

const response = await got('https://target-site.com/products', {
  https: { rejectUnauthorized: false },
  agent: {
    https: new HttpsProxyAgent(proxyUrl),
  },
  headers: {
    'User-Agent': getRandomUserAgent(),
  },
});

Option 2: Manual Proxy Pool with Rotation

const proxyPool = [
  'http://user:pass@proxy1.example.com:8080',
  'http://user:pass@proxy2.example.com:8080',
  'http://user:pass@proxy3.example.com:8080',
  // ... more proxies
];

let proxyIndex = 0;

function getNextProxy() {
  const proxy = proxyPool[proxyIndex % proxyPool.length];
  proxyIndex++;
  return proxy;
}

async function scrapeWithRotation(url) {
  const proxy = getNextProxy();

  return got(url, {
    agent: { https: new HttpsProxyAgent(proxy) },
    timeout: { request: 30000 },
    retry: {
      limit: 3,
      statusCodes: [503, 429],
    },
  });
}

See LimeProxies rotating residential proxies and SOCKS5 proxies for proxy plans suited to large-scale Node.js scraping.


Handling Pagination

URL-Pattern Pagination

async function scrapeAllPages(baseUrl, totalPages) {
  const limit = pLimit(3); // 3 concurrent requests

  const pageUrls = Array.from({ length: totalPages }, (_, i) =>
    `${baseUrl}?page=${i + 1}`
  );

  const results = await Promise.all(
    pageUrls.map(url => limit(() => scrapeProductPage(url)))
  );

  return results.flat().filter(Boolean);
}

Infinite Scroll Pagination

async function scrapeInfiniteScroll(url) {
  const browser = await chromium.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(url);

  const allItems = new Set();
  let previousHeight = 0;

  while (true) {
    // Extract currently visible items
    const items = await page.$$eval('.product-card', elements =>
      elements.map(el => ({
        title: el.querySelector('.title')?.textContent?.trim(),
        price: el.querySelector('.price')?.textContent?.trim(),
      }))
    );

    items.forEach(item => allItems.add(JSON.stringify(item)));

    // Scroll to bottom
    await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
    await page.waitForTimeout(2000);

    const newHeight = await page.evaluate(() => document.body.scrollHeight);

    // Stop if page didn't grow (end of content)
    if (newHeight === previousHeight) break;
    previousHeight = newHeight;
  }

  await browser.close();
  return [...allItems].map(JSON.parse);
}

Exporting Scraped Data

Save to CSV

import { createObjectCsvWriter } from 'csv-writer';

const csvWriter = createObjectCsvWriter({
  path: 'products.csv',
  header: [
    { id: 'title', title: 'Title' },
    { id: 'price', title: 'Price' },
    { id: 'rating', title: 'Rating' },
    { id: 'url', title: 'URL' },
  ],
});

await csvWriter.writeRecords(products);
console.log('CSV written: products.csv');

Save to JSON

import { writeFileSync } from 'fs';

writeFileSync('products.json', JSON.stringify(products, null, 2));
console.log(`Saved ${products.length} products to products.json`);

JavaScript vs Python for Web Scraping

| Feature | JavaScript (Node.js) | Python | |---|---|---| | Async concurrency | Excellent (native event loop) | Good (asyncio, but more complex) | | Browser automation | Best-in-class (Playwright native) | Excellent (Playwright Python port) | | Data processing | Good (lodash, streams) | Excellent (pandas, NumPy) | | ML integration | Limited | Extensive (scikit-learn, TensorFlow) | | Learning curve | Moderate (async syntax) | Low (beginner friendly) | | Production ops | Good (Node.js ecosystem) | Excellent (mature tooling) | | Speed (static scraping) | Very fast (got + cheerio) | Fast (httpx + BeautifulSoup) |

Verdict: For pure scraping with browser automation, JavaScript/Node.js and Python are essentially equivalent. Choose based on your team's existing expertise. JavaScript has a slight edge for React-heavy SPAs since engineers understand the runtime natively.


Complete Production Scraper Example

import { chromium } from 'playwright';
import * as cheerio from 'cheerio';
import got from 'got';
import pLimit from 'p-limit';
import { createObjectCsvWriter } from 'csv-writer';
import { HttpsProxyAgent } from 'https-proxy-agent';

const CONFIG = {
  concurrency: 5,
  delayMs: { min: 1500, max: 4000 },
  proxy: 'http://user:pass@gate.limeproxies.com:5432',
  outputFile: 'products.csv',
};

function randomDelay() {
  const ms = Math.random() * (CONFIG.delayMs.max - CONFIG.delayMs.min) + CONFIG.delayMs.min;
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function scrapePage(url) {
  await randomDelay();

  try {
    const { body } = await got(url, {
      agent: { https: new HttpsProxyAgent(CONFIG.proxy) },
      headers: {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ...',
        'Accept': 'text/html,application/xhtml+xml',
        'Accept-Language': 'en-US,en;q=0.9',
      },
      timeout: { request: 30000 },
    });

    const $ = cheerio.load(body);

    return {
      url,
      title: $('h1').first().text().trim(),
      price: $('.price').first().text().trim(),
      rating: $('.rating').first().text().trim(),
      scrapedAt: new Date().toISOString(),
    };
  } catch (error) {
    console.error(`Error scraping ${url}:`, error.message);
    return null;
  }
}

async function main(urls) {
  const limit = pLimit(CONFIG.concurrency);

  console.log(`Scraping ${urls.length} URLs with concurrency ${CONFIG.concurrency}...`);

  const results = await Promise.all(
    urls.map(url => limit(() => scrapePage(url)))
  );

  const validResults = results.filter(Boolean);

  const csvWriter = createObjectCsvWriter({
    path: CONFIG.outputFile,
    header: Object.keys(validResults[0]).map(id => ({ id, title: id })),
  });

  await csvWriter.writeRecords(validResults);
  console.log(`Done. Saved ${validResults.length}/${urls.length} records to ${CONFIG.outputFile}`);
}

// Run
const targetUrls = ['https://example.com/page/1', /* ... */];
main(targetUrls);

Resources for JavaScript Web Scraping


Last updated: March 2026

Post Quick Links

Jump straight to the section of the post you want to read:

    FAQ's

    About the author

    Rachael Chapman

    A Complete Gamer and a Tech Geek. Brings out all her thoughts and Love in Writing Techie Blogs.

    View all posts
    Icon NextPrevWeb Scraping for Stock Market Data
    NextHow to Use a Proxy in Microsoft EdgeIcon Prev
    No credit card required · Cancel anytime

    Start scaling your operations today

    Join 5,000+ businesses using LimeProxies for competitive intelligence,
    data collection, and growth automation — at any scale.

    Setup in under 2 minutes
    99.9% uptime SLA
    24/7 dedicated support
    G2 CrowdTrustpilot