As an ultra marathon enthusiast, I often look a communal challenge: really do I estimate my decorativeness clip for longer races I haven’t attempted yet? When discussing this pinch my coach, he suggested a applicable approach—look astatine runners who’ve completed some a title I’ve done and the title I’m targeting. This relationship could supply valuable insights into imaginable decorativeness times. But manually searching done title results would beryllium incredibly time-consuming.
This led maine to build Race Time Insights, a instrumentality that automatically compares title results by uncovering athletes who’ve completed some events. The exertion scrapes title results from platforms for illustration UltraSignup and Pacific Multisports, allowing runners to input 2 title URLs and spot really different athletes performed crossed some events.
Building this instrumentality showed maine conscionable really powerful DigitalOcean’s App Platform could be. Using Puppeteer pinch headless Chrome successful Docker containers, I could attraction connected solving the problem for runners while App Platform handled each the infrastructure complexity. The consequence was a robust, scalable solution that helps the moving organization make data-driven decisions astir their title goals.
After building Race Time Insights, I wanted to create a guideline showing different developers really to leverage these aforesaid technologies—Puppeteer, Docker containers, and DigitalOcean App Platform. Of course, erstwhile moving pinch outer data, you request to beryllium mindful of things for illustration complaint limiting and position of service.
Enter Project Gutenberg. With its immense postulation of nationalist domain books and clear position of service, it’s an perfect campaigner for demonstrating these technologies. In this post, we’ll research really to build a book hunt exertion utilizing Puppeteer successful a Docker container, deployed connected App Platform, while pursuing champion practices for outer information access.
Project Gutenberg Book Search
I’ve built and shared a web exertion that responsibly scrapes book accusation from Project Gutenberg. The app, which you tin find successful this GitHub repository, allows users to hunt done thousands of nationalist domain books, position elaborate accusation astir each book, and entree various download formats. What makes this peculiarly absorbing is really it demonstrates responsible web scraping practices while providing genuine worth to users.
Being a Good Digital Citizen
When building a web scraper, it’s important to travel bully practices and respect some method and ineligible boundaries. Project Gutenberg is an fantabulous illustration for learning these principles because:
- It has clear position of service
- It provides robots.txt guidelines
- Its contented is explicitly successful the nationalist domain
- It benefits from accrued accessibility to its resources
Our implementation includes respective champion practices:
Rate Limiting
For objection purposes, we instrumentality a elemental complaint limiter that ensures astatine slightest 1 2nd betwixt requests:
const rateLimiter = { lastRequest: 0, minDelay: 1000, async wait() { const now = Date.now(); const timeToWait = Math.max(0, this.lastRequest + this.minDelay - now); if (timeToWait > 0) { await new Promise(resolve => setTimeout(resolve, timeToWait)); } this.lastRequest = Date.now(); } };This implementation is intentionally simplified for the example. It assumes a azygous exertion lawsuit and stores authorities successful memory, which wouldn’t beryllium suitable for accumulation use. More robust solutions mightiness usage Redis for distributed complaint limiting aliases instrumentality queue-based systems for amended scalability.
This complaint limiter is utilized earlier each petition to Project Gutenberg:
async searchBooks(query, page = 1) { await this.initialize(); await rateLimiter.wait(); } async getBookDetails(bookUrl) { await this.initialize(); await rateLimiter.wait(); }Clear Bot Identification
A civilization User-Agent helps website administrators understand who is accessing their tract and why. This transparency allows them to:
- Contact you if location are issues
- Monitor and analyse bot postulation separately from quality users
- Potentially supply amended entree aliases support for morganatic scrapers
Efficient Resource Management
Chrome tin beryllium memory-intensive, particularly erstwhile moving aggregate instances. Properly closing browser pages aft usage prevents representation leaks and ensures your exertion runs efficiently, moreover erstwhile handling galore requests:
try { } finally { await browserPage.close(); }By pursuing these practices, we create a scraper that’s some effective and respectful of the resources it accesses. This is peculiarly important erstwhile moving pinch valuable nationalist resources for illustration Project Gutenberg.
Web Scraping successful the Cloud
The exertion leverages modern unreality architecture and containerization done DigitalOcean’s App Platform. This attack provides a cleanable equilibrium betwixt improvement simplicity and accumulation reliability.
The Power of App Platform
App Platform streamlines the deployment process by handling:
- Web server configuration
- SSL certificate management
- Security updates
- Load balancing
- Resource monitoring
This allows america to attraction connected the exertion codification while App Platform manages the infrastructure.
Headless Chrome successful a Container
The halfway of our scraping functionality uses Puppeteer, which provides a high-level API to power Chrome programmatically. Here’s really we group up and usage Puppeteer successful our application:
const puppeteer = require('puppeteer'); class BookService { constructor() { this.baseUrl = 'https://www.gutenberg.org'; this.browser = null; } async initialize() { if (!this.browser) { console.log('Environment details:', { PUPPETEER_EXECUTABLE_PATH: process.env.PUPPETEER_EXECUTABLE_PATH, CHROME_PATH: process.env.CHROME_PATH, NODE_ENV: process.env.NODE_ENV }); const options = { headless: 'new', args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-gpu', '--disable-extensions', '--disable-software-rasterizer', '--window-size=1280,800', '--user-agent=GutenbergScraper/1.0 (+https://github.com/wadewegner/doappplat-puppeteer-sample) Chromium/120.0.0.0' ], executablePath: process.env.PUPPETEER_EXECUTABLE_PATH || '/usr/bin/chromium-browser', defaultViewport: { width: 1280, height: 800 } }; this.browser = await puppeteer.launch(options); } } async searchBooks(query, page = 1) { await this.initialize(); await rateLimiter.wait(); const browserPage = await this.browser.newPage(); try { await browserPage.setExtraHTTPHeaders({ 'Accept-Language': 'en-US,en;q=0.9', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'X-Bot-Info': 'GutenbergScraper - A instrumentality for searching Project Gutenberg' }); const searchUrl = `${this.baseUrl}/ebooks/search/?query=${encodeURIComponent(query)}&start_index=${(page - 1) * 24}`; await browserPage.goto(searchUrl, { waitUntil: 'networkidle0' }); } finally { await browserPage.close(); } } }This setup allows america to:
- Run Chrome successful headless mode (no GUI needed)
- Execute JavaScript successful the discourse of web pages
- Safely negociate browser resources
- Work reliably successful a containerized environment
The setup besides includes respective important configurations for moving successful a containerized environment:
- Proper Chrome Arguments: Essential flags for illustration --no-sandbox and --disable-dev-shm-usage for moving successful containers
- Environment-aware Path: Uses the correct Chrome binary way from situation variables
- Resource Management: Sets viewport size and disables unnecessary features
- Professional Bot Identity: Clear personification supplier and HTTP headers identifying our scraper
- Error Handling: Proper cleanup of browser pages to forestall representation leaks
While Puppeteer makes it easy to power Chrome programmatically, moving it successful a instrumentality requires due strategy limitations and configuration. Let’s look astatine really we group this up successful our Docker environment.
Docker: Ensuring Consistent Environments
One of the biggest challenges successful deploying web scrapers is ensuring they activity the aforesaid measurement successful improvement and production. Your scraper mightiness activity perfectly connected your section instrumentality but neglect successful the unreality owed to missing limitations aliases different strategy configurations. Docker solves this by packaging everything the exertion needs - from Node.js to Chrome itself - into a azygous instrumentality that runs identically everywhere.
Our Dockerfile sets up this accordant environment:
FROM node:18-alpine # Install Chromium and dependencies RUN apk adhd --no-cache \ chromium \ nss \ freetype \ harfbuzz \ ca-certificates \ ttf-freefont \ dumb-init # Set situation variables ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \ PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \ PUPPETEER_DISABLE_DEV_SHM_USAGE=trueThe Alpine-based image keeps our instrumentality lightweight while including each basal dependencies. When you tally this container, whether connected your laptop aliases successful DigitalOcean’s App Platform, you get the nonstop aforesaid situation pinch each the correct versions and configurations for moving headless Chrome.
Development to Deployment
Let’s locomotion done getting this task up and running:
1. Local Development
First, fork the example repository to your GitHub account. This gives you your ain transcript to activity pinch and deploy from. Then clone your fork locally:
# Clone your fork git clone https://github.com/YOUR-USERNAME/doappplat-puppeteer-sample.git cd doappplat-puppeteer-sample # Build and tally pinch Docker docker build -t gutenberg-scraper . docker tally -p 8080:8080 gutenberg-scraper2. Understanding the Code
The exertion is building astir 3 main components:
-
Book Service: Handles web scraping and information extraction
async searchBooks(query, page = 1) { await this.initialize(); await rateLimiter.wait(); const itemsPerPage = 24; const searchUrl = `${this.baseUrl}/ebooks/search/?query=${encodeURIComponent(query)}&start_index=${(page - 1) * itemsPerPage}`; } -
Express Server: Manages routes and renders templates
app.get('/book/:url(*)', async (req, res) => { try { const bookUrl = req.params.url; const bookDetails = await bookService.getBookDetails(bookUrl); res.render('book', { book: bookDetails, error: null }); } catch (error) { } }); -
Frontend Views: Clean, responsive UI utilizing Bootstrap
<div class="card book-card h-100"> <div class="card-body"> <span class="badge bg-secondary downloads-badge"> <%= book.downloads.toLocaleString() %> downloads </span> <h5 class="card-title"><%= book.title %></h5> <!-- ... much UI elements ... --> </div> </div>
3. Deployment to DigitalOcean
Now that you person your fork of the repository, deploying to DigitalOcean App Platform is straightforward:
- Create a caller App Platform application
- Connect to your forked rep
- On resources, delete the 2nd assets (that isn’t a Dockerfile); this is auto-generated by App Platform and not needed
- Deploy by clicking Create Resources
The exertion will beryllium automatically built and deployed, pinch App Platform handling each the infrastructure details.
Conclusion
This Project Gutenberg scraper demonstrates really to build a applicable web exertion utilizing modern unreality technologies. By combining Puppeteer for web scraping, Docker for containerization, and DigitalOcean’s App Platform for deployment, we’ve created a solution that’s some robust and easy to maintain.
The task serves arsenic a template for your ain web scraping applications, showing really to grip browser automation, negociate resources efficiently, and deploy to the cloud. Whether you’re building a information postulation instrumentality aliases conscionable learning astir containerized applications, this illustration provides a coagulated instauration to build upon.
Check retired the project connected GitHub to study much and deploy your ain instance!