CS253 Lecture Summaries: Part VIII: Fingerprinting and Privacy
From Pete Snyder (Brave)
Why do websites track? Advertising primarily. Print world you matched ads to content as a proxy for people. Early web replicated this, but then ads started to attempt to target people directly, not just through the proxy of content.
Databases are built with user profiles, and ads start to target less quality websites (for less money), since they know they can reach the same people. So there’s no value to advertise on more expensive sites.
There’s now a huge internet ad industry of middle parties sitting between marketers and consumers. But if you’re not Google or Facebook, you are almost certainly not profitable, none of the companies are making profit.
Findings from the Engelhard/Naranyan paper - average website has 20 trackers at the time.
Google analytics on its own has a record of your visit to around 70-80% of the websites you view. Google has several views, adding up to over 80% of websites, FB around 40%.
Classic Tracking
Origins of tracking lie in the early decisions around how to authenticate web requests. In the earliest days of HTTP there was no state at all, the client would request a resource, the server would return html, that’s it.
How do we authenticate in a stateless system without the user having to log in all the time, given HTTP auth is terrible and limited.
The answer was that the server would give us a token, the user returns the token in future requests. This is cookies.
Then hosting was expensive, so sites would want to eg include images from third party domains. The question was raised, do we include cookies on those third party requests? The decision was made, yes we do, and that gave birth to tracking.
Cookies + Third party Resources = tracking.
I visit site A. Site A includes a request to a tracking server which returns a cookie. I visit site B. That site includes a request to the same tracking server. The cookie is sent. And there you are, I am tracked.
Browsers started pushing back on this. Safari was the first, now browsers like Brave block third party cookies, or otherwise restrict third party cookies.
So then the trackers fought back again - they moved ID information out of cookies, and into query parameters, or local storage, cache, etc…
Check out HSTS tracking for a nasty example.
Fingerprinting / “passive tracking”
In the classic mode, the website stores an id on the client and finds a way to return it to the server.
In fingerprinting, the website finds things different about each visitor and the difference allow re-identification.
How is to done? Through a large number of semi-identifiers - browser size, extra fonts, audio hardware, plugins, colour depth, etc…
Each identifier might not be that discriminatory (Windows users, Firefox users, Office fonts etc) but it can successfully identify individuals if you have:
Sufficient breadth - large number of semi-identifiers
Sufficient depth - how uniquely each identifier identifies
In terms of breadth here are some well-used identifiers:
User Agent String
The User Agent string is the Katamari Damacy of identifiers, check out the history for more.
Fonts
Installed fonts is another critical one - there are three categories of fonts - system, local, and web.
Local is the tricky part where you can fingerprint people - Office, Photoshop, Goofery - web pages can read those back out. One oddball font you’ve installed can really help identify you.
Websites will have an extremely long list of fonts, for each font it will try to apply the font to a span and see if it changes (measure its width).
What’s the defence to this? It’s a hard problem.
Canvas / WebGL
See the Pixel Perfect paper.
To exploit this you draw a line on a canvas, then you read it back out and subtle differences will be revealing. There will be micro differences in the image.
Hardware Identifiers
Many Web APIs leak capabilities.
Number of cores, audio channels, number of shaders and similar, device memory, network.
Semi identifying and easy to extract.
Height and Width
Check out https://firstpartysimulator.org/ for a simulator that tracks how easy you are to identify.
Check out https://github.com/fingerprintjs/fingerprintjs for the most popular fingerpint javascript library.
Countermeasures
Remove the functionality (not great)
Make the functionality consistent
Restrict access (permissions, first party vs third party, user interaction)
Noise
‘Privacy budget’
Brave for example, or uBlock origin will check against a list of known trackers and block the request. Third party requests are blocked. Turns off finger printing vectors.