Change detection explained: content hash, diff, and AI
How does a tool know whether a website has changed — without comparing every word? A look at hashing, diffs and their limitations.
The problem: what does "changed" even mean?
Most people intuitively think of change detection like this: compare the old content with the new content, and if something differs, it's a change. Sounds simple. In practice it's a minefield.
An average website has three layers:
- Real content — articles, product info, prices, press releases
- Structural chrome — navigation, footer, cookie banner, ads
- Dynamic noise — timestamps like "3 minutes ago", live counters, rotating ads
Good change detection has to recognise layer 1 and ignore layers 2 and 3. Otherwise you get constant false alarms.
Method 1: visual pixel diff
Tools like Visualping diff screenshots pixel by pixel.
- Works on any page, including JavaScript-heavy apps
- Problem: extremely sensitive. A new ad triggers an alert; a button shifting position triggers another
- Workaround: you can usually exclude regions — but that's manual work per source
Method 2: DOM-based comparison
Compare the HTML tree (DOM) using CSS selectors that target exactly the elements you care about.
- Works very precisely if you know what you want
- Problem: per-page setup is work. As soon as the target site changes its layout, the selectors break
- Tool: Distill.io is the classic example
Method 3: content hash (how SumYou does it)
The page is first reduced to its editorial content (layer 1 above), and that content is then run through a cryptographic hash — typically SHA-256 or MD5.
Example:
```
Original text: "Apple iPhone 15 Pro - $1,299"
SHA-256: a4f8e2c9d1b7...
New text: "Apple iPhone 15 Pro - $1,199"
SHA-256: 9c2e1f6a4d8b... (completely different hash)
```
- Pro 1: fast. Hash comparison is O(1) instead of O(n) like a text diff
- Pro 2: storage-efficient. 64 characters instead of the full page
- Pro 3: precise. With good content extraction there are no false positives from layout changes
- Prerequisite: the content extraction has to be solid. If the cookie banner sneaks in, the hash flips on every banner update
Method 4: hash + diff (the full picture)
A hash only tells you "something changed". It doesn't tell you what changed. So SumYou also keeps the original text and computes a classic text diff on change — a list of added and removed lines.
With hash + diff you get:
- Fast change detection
- A precise description of what changed
- Low storage cost
Method 5: AI as a reading layer
Even a diff is tedious for humans. "Line 145 removed, line 146 added — what does that mean?"
SumYou hands the diff to a large language model (GPT-4o-mini, Claude as fallback) and asks it for a 2-3 sentence summary. The model only receives the changed regions, not the whole page, plus a strict prompt:
> "Describe in at most three sentences what changed. Do not speculate. If unclear, say 'unclear'."
That turns a raw diff into a readable sentence like:
> "Apple cut the price of the iPhone 15 Pro by $100 to $1,199."
Where the method hits its limits
Hash-based detection is robust but not perfect:
- Timestamps in the content — "updated 3 minutes ago" flips the hash without anything substantial changing. SumYou tries to detect and normalise such stamps during extraction.
- Personalised content — if a page returns different content based on geo-IP, the hash can flip between checks even though nobody edited anything
- A/B tests — some pages show variant A to 50 % and variant B to the rest. The hash will flip randomly
- Real but unimportant changes — for example typo fixes. The hash doesn't distinguish between "comma fixed" and "price halved"
Point 4 is exactly where the AI layer helps: it can classify importance (low / medium / high / breaking) so you don't get an email for every typo fix.
Conclusion
Change detection sounds simple but is a tradeoff between sensitivity (catch everything) and precision (only flag what matters). Hash-based detection with good content extraction is today's gold standard for textual change — combined with an AI layer that makes the result readable.
Try SumYou for free and feel for yourself how hash + diff + AI behaves in practice.