SumYou Logo
SumYou
April 22, 2026·8 min read·SumYou Team

How does website monitoring work? A technical overview

What actually happens when a website monitor detects a change? A practical look at scraping, hashing, diffing and AI summaries.

What is website monitoring?

Website monitoring describes automated processes that regularly check whether the content of one or more websites has changed. Unlike an RSS feed it works even where no feed exists: on modern single-page apps, government portals, shop pages or competitor websites.

Every website monitor consists of four core steps: Fetch, Extract, Detect and Notify. SumYou adds a fifth: Summarize.

Step 1: Fetch — pulling the page

The monitor fetches the URL. That sounds trivial, but it isn't — many sites behave very differently depending on who's calling:

  • Static HTML pages can be fetched with a simple HTTP client (SumYou uses Python's `requests`).
  • JavaScript-rendered pages return a near-empty HTML skeleton on a normal request. The real content is loaded later by JavaScript in the browser. For those the monitor needs an actual browser — SumYou uses Playwright, which spins up Chromium in the background, executes the JS and returns the fully rendered DOM.

In both cases: respect robots.txt, throttle to at most one request per 10 seconds per domain, and use a realistic User-Agent.

Step 2: Extract — isolate the relevant content

An average news page is maybe 20 % article content — the rest is navigation, cookie banners, ads, footers, newsletter popups. If you compared all of that, you'd get constant false-positive change alerts.

Good monitors apply an extraction heuristic:

  1. Parse the HTML into a DOM tree (SumYou uses BeautifulSoup4)
  2. Strip known noise (`script`, `style`, `nav`, `footer`, cookie banners)
  3. Find the main content heuristically — typically the `<main>` or `<article>` element, or the `<div>` with the highest text density
  4. Normalize whitespace and line breaks

What's left is clean plain text representing the editorial content of the page.

Step 3: Detect — find changes

Now the monitor needs to decide: did anything change since the last check?

Most tools use a content hash like SHA-256 or MD5. The cleaned text is run through a cryptographic hash function and the result is a 64-character string. Change a single character and the entire hash changes.

Advantages of this approach:

  • Fast: hash comparisons are microseconds
  • Storage-efficient: 64 characters instead of the whole page
  • Precise: no false positives from whitespace differences (assuming you normalized first)

The downside: a hash only tells you "something changed", not what. SumYou therefore also stores the text and computes a diff on change — a list of changed lines.

Step 4: Summarize — AI makes the diff readable

A raw text diff is hard for humans to read. "Line 145 deleted, line 146 added" — what does that mean?

This is where SumYou's key step comes in: a large language model (default GPT-4o-mini, with Anthropic Claude as fallback) receives the old and new content and produces a 2-3 sentence summary in your chosen language.

Example: instead of "Line 145: 'iPhone 15 Pro $1,299' -> 'iPhone 15 Pro $1,199'" you get:

> "Apple cut the price of the iPhone 15 Pro by $100 to $1,199. Other models remained unchanged."

To prevent hallucinations the LLM only sees the changed regions, not the whole page. Plus: a strict system prompt that only allows observations, not speculation.

Step 5: Notify — inform the user

Once the summary is ready, it goes out:

  • In the updates feed on SumYou.com
  • Optionally by email (Pro / Pro+)
  • Soon by push notification (PWA)

How often do we check?

The polling interval trades freshness against server load and cost. SumYou plans:

  • Free: every 24 hours
  • Pro: every hour
  • Pro+: every 15 minutes

Behind the scenes Celery Beat spreads checks across the day so we don't hammer all sources at once. Sources with persistent errors (server down, captcha) get automatically backed off to a longer interval.

Conclusion

Website monitoring isn't rocket science, but there are many small detail problems: correct content extraction, robust change detection, meaningful AI summaries, friendly crawling behavior. SumYou solves those out of the box so you can focus on the thing that actually matters: knowing what changed.

Try SumYou for free and monitor your first 10 sources with GPT-4o-mini summaries.

Ready to get started?

Start free with 10 sources and get AI summaries.

Start Free