johnpfeiffer
  • Home
  • Engineering (People) Managers
  • John Likes
  • Software Engineer Favorites
  • Categories
  • Tags
  • Archives

Web scraping and crawling as a rite of passage

Contents

  • High Level Diagram in ASCII
  • Good old C
  • Firecrawl
  • firecrawl scrape a single page
  • Known Limitations

The internet and HTTP has had an "agent" concept long before AI. https://www.rfc-editor.org/rfc/rfc9110.html#name-user-agent

Connecting to a server (with TCP) underpins FTP and HTTP; a simple client can download resources.

Sometimes they are commonly called a "scraper" or "crawler", I can remember one of my first joys in coding was writing a "download a page" program.

High Level Diagram in ASCII

This utility will query a DNS server to convert the domain name into an IP address: dig example.com

;; ANSWER SECTION:
example.com.        159 IN  A   172.66.147.243

Client Server Interactions

CLIENT                                  [DNS SERVER]
  |   What IP Address is example.com ->    |
  |     <- example.com = "172.66.147.243"  |
  |
  |      TCP connect to 172.66.147.243 ->            | [WEB SERVER at 172.66.147.243]
  |      HTTP GET / Host: example.com  ->            |
  |     <- HTTP 200 OK, response: headers + HTML     |

Good old C

#include <stdio.h>       // printf, perror
#include <string.h>      // memset, memcpy
#include <unistd.h>      // close
#include <netdb.h>       // gethostbyname, herror, struct hostent
#include <sys/socket.h>  // socket, connect
#include <netinet/in.h>  // sockaddr_in, htons

int main(void) {
    struct hostent *h = gethostbyname("example.com");
    if (!h) {
        herror("gethostbyname");
        return 1;
    }
    int s = socket(AF_INET, SOCK_STREAM, 0);
    if (s < 0) {
        perror("socket");
        return 1;
    }
    struct sockaddr_in addr;
    memset(&addr, 0, sizeof(addr));
    addr.sin_family = AF_INET;
    addr.sin_port = htons(80);
    memcpy(&addr.sin_addr, h->h_addr, h->h_length);

    if (connect(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
        perror("connect");
        close(s);
        return 1;
    }
    printf("successful CONNECT\n");

    const char *req =
        "GET / HTTP/1.0\r\n"
        "Host: example.com\r\n"
        "\r\n";

    // send to the socket the HTTP string
    write(s, req, strlen(req));
    char buf[4096];
    ssize_t n;
    while ((n = read(s, buf, sizeof(buf))) > 0) {
        fwrite(buf, 1, n, stdout);
    }

    printf("\ndone\n");
    close(s);
    return 0;
}

Oh the joys of hours of syntax and compilation errors - to finally successfully reach across the network =]

Each popular programming language has this foundational piece:

  • Java
  • Python (thank you BeautifulSoup4 and Requests https://requests.readthedocs.io/en/latest/)
  • Golang https://pkg.go.dev/net/http

Slowly providing examples: https://github.com/johnpfeiffer/web-scrapers

And of course there are some excellent dedicated tools:

  • curl https://curl.se/docs/tutorial.html curl -H "Host: example.com" 172.66.147.243:80
  • wget https://www.gnu.org/software/wget/manual/wget.html

Firecrawl

There is a new cool kid on the block, an open source project, you can even run it locally with docker https://github.com/firecrawl/firecrawl

Firecrawl API
Redis
Postgres / queue backing store
Playwright service
workers
queue UI

vim .env

# Minimal local Docker Compose Firecrawl config

PORT=3002
HOST=0.0.0.0
USE_DB_AUTHENTICATION=false

# Queue admin:
# http://localhost:3002/admin/localdev/queues
BULL_AUTH_KEY=localdev

# prefer explicit even though these are in docker compose
REDIS_URL=redis://redis:6379
REDIS_RATE_LIMIT_URL=redis://redis:6379
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000/scrape

# Reasonable laptop defaults
NUM_WORKERS_PER_QUEUE=4
CRAWL_CONCURRENT_REQUESTS=4
MAX_CONCURRENT_JOBS=2
BROWSER_POOL_SIZE=2

LOGGING_LEVEL=INFO

gotcha do not build everything locally in docker containers: you also have to modify docker-compose.yaml

grep -n "build:" docker-compose.yaml

comment out those build statements, and make active the "image" statements

x-common-service: &common-service
  image: ghcr.io/firecrawl/firecrawl:latest
  # build: apps/api

docker compose build

docker compose up

firecrawl scrape a single page

Use the Scrape endpoint - it only retrieves a single page:

curl -X POST http://localhost:3002/v1/scrape -H 'Content-Type: application/json' -d '{"url": "https://example.com","formats": ["html"]}' > temp.json

{
  "success": true,
  "data": {
    "metadata": {
      "title": "Example Domain",
      "sourceURL": "https://example.com",
      "url": "https://example.com",
      "statusCode": 200,
      "contentType": "text/html",
    },
    "html": "<!DOCTYPE html><html lang=\"en\"><body><div><h1>Example Domain</h1><p>This domain is for use in documentation examples without needing permission. Avoid use in operations.</p><p><a href=\"https://iana.org/domains/example\">Learn more</a></p></div>\n</body></html>"
  }
}

Some jq fun to just see the html jq -r '.data.html' temp.json

If you just want the links from a given page:

curl -X POST http://localhost:3002/v1/scrape -H 'Content-Type: application/json' -d '{"url": "https://example.com","formats": ["links"]}' > example-links.json

cat example-links.json | jq

TODO: How to use it crawl a whole website

The local queue system will show you each page as it prepares to scrape/download:

  • http://localhost:3002/admin/localdev/queues

Known Limitations

These approaches all focus on HTML. The rise of JavaScript and React mean that content needs even more advanced tools.

Often people use the python framework Playwright (ideological successor to Selenium) as a headless browser to interact with interactive websites.

TODO: section on playwright


  • « Intention is all you need - Writing with LLMs
  • Building a desktop app with Golang and Wails »

Published

Jan 11, 2025

Category

programming

~573 words

Tags

  • c 12
  • golang 18
  • html 7
  • http 3
  • programming 8
  • python 12