CS253 Lecture Summaries: Part VI: XSS
From Web Security - Stanford CS 253
Cross Site Scripting (XSS)
In XSS the goal of the attacker is to get their code to run in the context of a site that they are trying to attack.
The Same Origin Policy prevents me from, for example, embedding a site and then reaching into it and adding code. So the attacker needs some other means.
The approach is based on code injection. Code injection is caused when untrusted user data unexpectedly becomes code.
Any code that combines a command with user data is susceptible.
In cross site scripting, the unexpected code is JavaScript in an HTML document. In SQL injection, the unexpected code is extra SQL commands included in a query string.
If XSS is successful, the attacker can do anything the target can do through their browser - view and exfiltrate cookies, send HTTP requests to the site, with the user’s cookies…
one good test string for looking for XSS vulnerabilities is <script>alert(document.cookie)</script>
This won’t cause any harm since you’re just showing yourself your own cookies.
So for example, let’s say that you have a search box, that takes the user’s input string and then on the server interpolates it into something like <p> showing results for ${input}</p>
That could be interpolated as <p> showing results for <script>alert(...)</script></p>
which would cause the script to run.
An attacker could include malicious code in a query parameter in the url of your site, which if you don’t handle could then execute.
This is very prevalent - data can be used in many contexts, html has at least 5 contexts to understand. Each context has different control characters, some contexts have complex rules. If you slip up once, you’re vulnerable.
Reflected vs. Stored XSS
In reflected XSS the attack code is placed into the HTTP request itself. Attacker goal: find a URL that you can make target visit that includes your attack code. Limitation being that the code must be added to the URL path or query parameters.
In stored XSS the attack code is persisted in the DB. Attacker goal is to use any means to get the attack code in the DB. Once there, the server includes it in all pages sent to clients.
Attack vectors
HTML elements:
<p>USER_DATA_HERE</p>
Fix? change <
to <
and &
to &
HTML attributes:
<img src='avatar.png' alt='USER_DATA_HERE' />
user input : Alex' onload='alert(document.cookie)
Result: <img src='avatar.png' alt='Alex' onload='...'/>
Fix? change '
to '
and "
to "
.
HTML attributes without quotes:
You don’t have to have quotes in your html attributes, you just lose the ability to include spaces in them.
fix? always quote attributes.
NB beware attributes like src
and href
letting your user set the src of a script can never be safe.
Be ware of data:
and javascript:
urls on hrefs.
data:
urls allow you to specify a mimetype and then a comma, and then data of that type. Eg: data:text/html,<html contenteditable></html>
will create an editable page.
They let you save an HTTP request in an html page, eg the logo of a website. Or in CSS you can inline an image. Only use this for small images though (no caching).
javascript:
urls will allow you to execute javascript in the context of the page you are on. This was a legacy way to execute JS in response to a click.
So don’t let users choose arbitrary urls, you could end up with a javascript url and code executin. Don’t let users choose a page to iframe, you can get the same thing.
Another issue on attributes:
let’s say we’re adding an event listener and want to add user data to the listener like this:
<div onmouseover='handleHover(USER_DATA_HERE)'>
Escaping '
and "
are not enough here, what if the user’s data is ); alert(document.cookie)
then we’ve lost control of the JS.
Another gotcha - the id
attribute on a DOM node will automatically create, via the DOM API a global variable with the id name referring to the node. So if a user sets it to some variable you’re relying on they can change the script behaviour.
Script Elements:
This is very common - a site wants to use some dynamic user string like this:
<script>
let username = 'Bob Dole'
alert(`Hi there ${username}`)
</script>
What if the user input is Bob'; alert(document.cookie);//'
Then we get:
username = 'Bob'; alert(document.cookie); //
Could we just escape string terminators by changing '
to \'
and "
to \"
?
Not so naively as if the user has put \'
it would become \\'
ie escape the backslash, not the quote. We’d have to escape the backslash too to end up with \\\'
.
We could try html entities like '
and "
.
But this doesn’t preserve the user’s input, the JS string parser has no knowledge of html entities so those would not be parsed back to the original character.
Also it’s still insecure.
What if the user input is </script><script>alert(document.cookie)</script></script>
The the JS becomes let username = '</script><script>alert(document.cookie)...'
This looks fine from a JS parser perspective since we’re in a string. But the HTML parser runs first. So we can see this from an HTML parser point of view we get:
<script>
let username= '
</script>
<script>
alert(document.cookie)
</script>
<script>
'
alert(`Hi there, ${username}`)
</script>
So we have two scripts with syntax errors, and the malicious script in the middle, which executes. Critical to understand the parser sequence here:
First the HTML parser runs greedily - searching for HTML tags and producing a DOM tree.
Second the JavaScript and CSS parsers run - JS parser on content inside
<script>
tags, CSS parser on content in<style>
tags.
So what’s the actual fix?
One is to Hex encode user data to produce a string with characters 0-9, A-F. Include it in a JS string, then decode the hex string.
let username = hexDecode('HEX_ENCODED_USER_DATA')
Another way is to put the data in the <template>
tag which won’t visibly render, then we can just encode strings as if they were html text (so encode <
and &
)
Then we grab the contents of that template tag in our script and get its text content.
Unsafe contexts
Contexts that are never a safe place for user data:
In script tags:
<script>USER_DATA_HERE</script>
In comments:
<!-- USER DATA HERE -->
As elements:
<USER_DATA_HERE href='/'>Link</a>
As attributes:
<div USER_DATA_HERE='some value'></div>
In style tags:
<style>USER_DATA_HERE</style>
HTML parsers are very forgiving so these are ‘valid’ html
<script/XSS src="myevilsite.com">
will be parsed as
<script XSS src=...
So we can’t just naively search for
<script ...
tags.
<body onload!#$%&()*~+-_.,:;?@[/|]^`=alert(document.cookie)>
In firefox the characters after onload before equals will just be ignored
so again we can’t naively search
<img """><script>alert(document.cookie)</script>">
- this will run the script for some reason
<iframe src=myevilsite.com/xss.js <
It was designed this way based on the robustness principle - be conservative in what you send, liberal in what you accept.
Known as Postel’s law, who wrote the TCP spec. This is bad for security.
Summary
User data can be safely escaped in HTML element bodies, HTML attributes (surrounded by quotes) and JavaScript strings, but avoid the others.
Beware nested parsing chains, take this code:
div onclick="setTimeout('doStuff(\'USER_DATA_HERE\')', 1000)"></div>
There are three rounds of parsing here:
HTML parser extracts the
onclick
attribute and adds it to the DOMWhen the button is clicked, the JS parser extracts the setTimeout syntax and executes it.
One second later, the string passed as the argument to setTimeout is parsed as JS, and then executed.
If the user data isn’t double-encoded with JS backslash sequences and then html encoded you’re in trouble.
Just avoid this type of code.
Another example:
<script>
let someValue = 'USER_DATA_HERE'
setTimeout("doStuff('" + someValue + "')", 1000)
</script>
It’s easy to forget to further escape the setTimeout constructin. Better to avoid this type of code.