Web Scraping with Playwright

In preparing to tackle a new task involving Playwright I was doing some research into how to scrape webpages using this tool. However, I noticed some discrepancies between the various tutorials’ way of doing things and Playwright’s own documentation. So I launched into an exploratory project to try to solve these discrepancies.

In this article, I will start out by examining the steps in this tutorial (representative of the first four or five Google results for “playwright web scraping”), and review whether or not they worked for me. Next, I’ll take note of the methods in the tutorial that Playwright’s docs claim are deprecated. I’ll explore the suggested alternatives in the docs and review whether or not they worked as substitutes. If something works as a substitute, that means it is able to achieve the same outcome as the tutorial.

First steps: tutorial

The first step was to try things as laid out in the tutorial. The tutorial’s goal was to scrape data from github JavaScript issues. To make things fun for myself I decided to scrape the issues relating to Vue instead of JavaScript.

Since I’ve linked the tutorial above I won’t write out all the steps here. Instead, I’ll limit myself to a brief summary and writing out the actual code which I changed.

The process was as follows: make a new project directory, start a new project with npm init, install Playwright, and then write the code. The code imports the chromium browser from Playwright, launches the browser, has the browser navigate to a new page (GitHub ), clicks on the text “Load more” to load more topics, and then waits for the page to load more than 20 results. This is where I ran into my first problem.

Does it work?

Right away I got an error:

page.waitForFunction: EvalError: Refused to evaluate a string as JavaScript because 'unsafe-eval' is not an allowed source of script in the following Content Security Policy directive: "script-src github.githubassets.com".

After a little research into the error, I found a solution in a github issue which suggested replacing the waitForFunction with evaluate. This worked for me. I was able to scrape the data from github issues on Vue with no further problems.

However, there were still a few methods I wanted to look into, specifically $eval and $$eval, since according to Playwright’s documentation these methods are discouraged. What do the methods do? $eval takes two arguments - a selector and a callback function - then the method finds an element that matches the selector and passes it to the pageFunction as the first argument. The method returns the value of pageFunction. $$eval does something similar for all the elements that match the selector.

Regarding $eval, the Playwright’s docs read:

This method does not wait for the element to pass actionability checks and therefore can lead to the flaky tests. Use locator.evaluate(), other Locator helper methods or web-first assertions instead.

Next steps: improve on tutorial

Knowing that I was able to scrape the expected data partially using the methods suggested in the tutorial, I moved on to try to substitute the methods suggested as alternatives to $eval and $$eval. The goal at this point in the exploration was to maintain the efficacy of the code while adhering to good practices.

Does it work?

The result seemed to be as successful as the original code. I used a locator to locate the element on the page and then used the evaluateAll method to extract data from that element.

// locator 
const repos = page.locator('article.border') 
// extract data with evaluateAll 
const reposResults = await repos.evaluateAll((repoCards) => { 
    return repoCards.map(card => { 
        // Do stuff 
        }; 
    }); 
});

The last thing to change was the variable passed inside the console.logs:

console.log(`We extracted ${reposResults.length} repositories.`); 
console.dir(reposResults);

…and then run the program inside the terminal with node scraping.js, which resulted in the same output as the tutorial.

Conclusion

In this article, I reviewed my process for trying out some new Playwright capabilities and updating them to adhere to Playwright’s standards. Thanks for joining me in this exploration, and I hope you learned something helpful or interesting!

Code

Here is the complete code, after my revisions:

import { chromium } from "playwright"; 
const browser = await chromium.launch({ headless: false }); 
const page = await browser.newPage(); 
await page.goto('https://github.com/topics/vue'); 
await page.click('text=Load more'); 
await page.evaluate(() => { 
    const repoCards = document.querySelectorAll('article.border'); 
    return repoCards.length > 20; 
}); 
const repos = page.locator('article.border') 
const reposResults = await repos.evaluateAll((repoCards) => { 
    return repoCards.map(card => { 
        const [user, repo] = card.querySelectorAll('h3 a'); 
        const stars = card.querySelector('#repo-stars-counter-star').getAttribute('title');
        const description = card.querySelector('div.px-3 > p'); 
        const topics = card.querySelectorAll('a.topic-tag'); 
        const toText = (element) => element && element.innerText.trim(); 
        const parseNumber = (text) => Number(text.replace(/,/g, '')
        ); 

        return { 
            user: toText(user), 
            repo: toText(repo), 
            url: repo.href, 
            stars: parseNumber(stars), 
            description: toText(description), 
            topics: Array.from(topics).map((t) => toText(t)), 
    }; 
    }); 
}); 

console.log(`We extracted ${reposResults.length} repositories.`); 
console.dir(reposResults); 

await page.waitForTimeout(10000); 
await browser.close();

Madeline's Machine Learning Log

Madeline's Machine Learning Log

Web Scraping with Playwright

Exploring alternative syntax

Table of contents

First steps: tutorial

Does it work?

Next steps: improve on tutorial

Does it work?

Conclusion

Code