One of the major things that really bugs me about the web is how poor the average web programmer handles strings. Here we are, changing the way the world works on top of text based protocols and languages like HTTP, MIME, JavaScript and CSS, yet some of the biggest issues that still plague us are cross-site scripting and mangled text due to aggressive filtering, mismatched encodings or overzealous escaping.
Almost two years ago I said I'd write down some formal notes on how to avoid issues like XSS, but I never actually posted anything. See, once I sat down to actually try and untangle the do's and don'ts, I found it extremely hard to build up a big coherent picture.
But here we are now, and I'm going to try anyway. The text is aimed at people who have had to deal with these issues, who are looking for a bit of formalism to frame their own solutions in.
Update: Google's DocType wiki has an excellent section with instructions for escaping for various contexts.
The problem
At the most fundamental level, all the issues mentioned above come down to this: you are building a string for output to a client of some sort, and one or more pieces of data you are using is triggering unknown effects, because it has an unexpected meaning to the client.
For example this little PHP snippet, repeated in variations across the web:
<?php print $user->name ?>'s profile
If $user->name contains JavaScript, your users are screwed.
What this really comes down is concatenation of data, or more literally, strings. So with that in mind, let's take a closer look at...
The humble string
What exactly is a string? It seems like a trivial question and I'm sure I'll come across as slightly nutty and overly analytic, but I really think a lot of people don't really know a good answer to this. Here's mine:
A string is an arbitrary sequence (1) of characters composed from a given character set (2), which acquires meaning when placed in an appropriate context (3).
This definition covers three important aspects of strings:
- They have no intrinsic restrictions on their content.
- They are useless blobs of data unless you know which symbols it represents.
- The represented symbols are meaningless unless you know the context to interpret them in.
This is a much more high-level concept than what you encounter in e.g. C++, where the definition is more akin to:
A string is an arbitrary sequence of bytes/words/dwords, in most cases terminated by a null byte/word/dword.
I think this latter definition is mostly useless for learning how to deal with strings, because it only describes their form, not their function.
So let's take a closer look at the three points above.
1. Representation of Symbols
They are useless blobs of data unless you know which symbols it represents.
This issue is relatively well known these days and is commonly described as encodings and character sets. A character set is simply a huge, numbered list of characters to draw from. An encoding is a mechanism for turning characters into sequences of bits. Theoretically they are independent of eachother, but in practice, they are coupled together and the two terms are used interchangeably to describe a particular encoding/character set pair.
You can't say much about them these days without delving into Unicode. Fortunately, Joel Spolski has already written up a great crash course on Unicode, which explains much better than I could.
For the purposes of security though, encodings are mostly irrelevant, as the problems occur regardless of the encoding used. So below, we'll talk about strings above the encoding level, as sequences of known characters. Like so:
2. Arbitrary content
They have no intrinsic restrictions on their content.
The second point seems self-evident, but can be rephrased into an important mantra for coding practices: there are no restrictions on a string's contents except those you enforce yourself. This makes strings fast and efficient, but also a possible carrier of unexpected data.
The typical response to this danger is to apply a strict filtering to any textual inputs your program has and before doing anything else to the data. The idea is to remove anything that may be interpreted later as unwanted mark-up or dangerous code. On the web, this usually means stripping out anything that looks like an HTML tag, doing funky things with ampersands and getting rid of quotes. While this is an approach that is often advocated as an effective and bulletproof solution, it is rather short-sighted, inflexible and restricted in scope, and I strongly oppose it.
This is of course very different from regular input validation, like ensuring a selected value is one of a list of given options, or checking if a given input is numeric and in the accepted range. These are different from regular textual inputs, because the desired result is in fact not a string, but either a more restricted data type (like an integer) or a more abstract reference to an existing, internal object.
To understand why textual strings are such poor candidates for input validation, we need to look at the third point.
3. Different contexts
The represented symbols are meaningless unless you know the context to interpret them in.
Context, or the lack of it, is essentially the cause of issues such as SQL injection, XSS and HTTP hijacking. And, I think it is exactly because it is so essential to processing strings, that it is often taken as self-evident and forgotten.
Let's go back to our example string:
Everyone will see this string represents two English words. That's because people are great at deriving context from free floating pieces of data. However even with natural languages, confusion can arise. Take for example this string:
Is it a French greeting? Sure. But it is also the name used by Apple for its zero-configuration network stack. We can only know which one is meant, by knowing more about the context it is used in.
Now why bother with this trivial exercise? Because the web is all about textual protocols and languages. While people are great at deriving contexts automatically, computers aren't, and generally rely on strict semantics.
Imagine a discussion forum, and people post topics with the following subjects:
Each string contains the character < in a slightly different context. The first uses it as part of intended bold tags. The second seems to use the same bold tag, but is actually just talking about the tag instead of using it for markup. More formally, we can say the first string is written in an HTML context, the second in a plain-text context.
If we were to try and display these strings in the wrong context, we'd see tags printed when they should be interpreted, or text marked up when it should be shown as is.
Context conversion
To unify the two strings above, we can convert the plain-text string to HTML without loss of meaning, like so:
This kind of context conversion is commonplace under the term escaping and in this case, will replace any character that has a special meaning in HTML with its escaped equivalent. This ensures the resulting string still means the same thing in the new context.
On the Web, Contexts happen
Usually, the lesson above of escaping input to HTML-safe text is where the discussion about XSS ends. However, armed with only the knowledge that HTML-special characters must be escaped to be safe, it can be hard to see why in fact you should not just filter all your data on input to ensure it contains none of these pesky characters in the first place. After all, how many people really need to use angle brackets and ampersands anyway?
Well, first of all, I think that's underestimating certain users. The following subject might not be so rare on a message board, yet would be mangled by typical aggressive character stripping:
More fundamentally though, it implies that there is only one kind of string context used on the web. Nothing could be further from the truth. Let's look at three different, common contexts.
HTML
We take a simple snippet of HTML by itself with some assumed user-generated text in it:
We look at some different segments of the snippet, and look at what 'forbidden characters' would break or change the semantics of each.
| Snippet | Forbidden | Escaped as |
|---|---|---|
| attribute text | "& | " & |
| inline text | <& | < & |
For example, quotes are disallowed in attribute text values, because otherwise a string with a quote could alter the meaning of the HTML snippet considerably:
Note:
- All ampersands need to be escaped (including those in URLs) for it to validate. HTML's stricter cousin XML will refuse to parse unescaped ampersands as well, and even requires that apostrophes be escaped too.
- XSS does not necessarily require using angle-brackets. In the attribute context, all you need is a ".
URLs
The situation is more complicated with URLs. The common HTTP URL for example:
| Snippet | Forbidden | Escaped as |
|---|---|---|
| (all) | <>"#%{}|\^~[]` (and non-printables) | %3C %3E %22 %23 ... |
| user | @: | %40 %3A |
| password | : | %3A |
| host.com | /@ | disallowed |
| path | ? | %3F |
| variable | &= | %26 %3D |
| value | &+ | %26 %2B |
Note:
- Many forget that a + in a query value actually means a space, not a plus.
- Even completely valid URLs can still be malicious, through the
javascript://protocol. - Defined by RFC 1738 and RFC 2616.
MIME headers
Several protocols such as HTTP and SMTP employ the same mechanism of providing metadata for pieces of content. This includes data such as e-mail subjects, senders, cookie headers or HTTP redirects, likely to contain user-generated data.
Content-Type: text/html; charset=utf-8
| Snippet | Forbidden | Escaped as |
|---|---|---|
| message subject | CRLF (if not followed by space), ()<>@,;:\"/[]?= + any non-printable |
=?...?= |
- CRLF sequences without trailing space start a new field and can be used for header injection.
- Lines should be wrapped at 80 columns with CRLF + space.
- Defined by RFC 2045.
Lolwut?
If the above three tables seem complicated and confusing, that's normal. It should be obvious that each of the three contexts is unique and has its own special range of 'forbidden characters' for user input (and even some sub-contexts). From this perspective, it would be impossible to define a safe input filtering mechanism for text on the web that didn't destroy almost all legitimate content.
You would have to filter or escape only for a single context, which would create a situation where the exact same approach to a problem can be safe in some cases, but unsafe in others, thus promoting bad coding practices.
With the selection above, I also ignored other important contexts (notably JS/JSON or SQL). However, the fact that I was able to make my point using only old school Web 1.0 techniques should show how this problem becomes even hairier in today's Web 2.0.
So what then?
The right way around string incompatibilities is to use appropriate conversions to change content from one context to another without changing its meaning, and do so when outputting text in a particular instance. We already did it above for the plain-text example, but similar conversions can be made in almost every other instance. Most web languages (like PHP) contain pre-built and tested functions for doing this.
Whenever you put strings together, you need to ask yourself what context the strings are in. If they are not the same, an appropriate conversion needs to be made, or you can run into bugs or worse, exploits.
For the snippet at the very beginning, the appropriate fix is:
<?php print htmlspecialchars($user->name) ?>'s profile
(or an appropriate wrapper from your development platform of choice)

Why htmlspecialchars?
Just curious why you suggested that over the check_plain function?
Any thoughts on the meta process of escaping this post?
The display of different characters within one big string (the content) had to deal with the issue of using html markup and displaying html tags, and using and displaying the escaped version of some special characters as well.
Was this cumbersome? You have the codefilter module in place for such display– what do you feel about the way this leaves the data stored internally?
Steven, Your articles that
Steven,
Your articles that deal with high level concepts are always well done. You do a good job at setting up the problems, and the tone is appropriate (from my pov). I would like to see more such articles on other high level subjects, whether they touch drupal or not. In this case, its nice that there is some drupal context, but its not always necessary for the subject/article to be worthwhile.
If there is any gripes, I would like to see expand last section about what to do, and when to use one method vs. another.
Thanks for taking the time to think this out.
regards!
Filtering vs. interpreting
Thanks Steven for raising more awareness for this important topic. I once read (can’t remember where anymore) that filtering is not the best solution for tackling input validation problems, at least not for HTML. While it’s probably to create a filtering function that filters out all malicious elements recognized by today’s browsers, it is not future proof since browsers might accept other forms of elements (e.g. external XSL stylesheets could embed custom JavaScript code, ) in the future which could be exploited in some way.
Instead, the article suggested using a parser/tokenizer that tries to make sense of the user input (e.g. HTML) and rebuilds the input based on the gathered information. By using this method, it’s less likely that malicious code still could go through the “filter” due to bad regexes, unknown elements, unforeseen uses etc. The additional advantage of this approach is that it can be somewhat permissive in terms of syntax since the tokenizer may be able to parse malformed input as well and construct a valid output.
This technique is for example used in the YUICompressor, a Java application for compressing JavaScript. Contrary to many other JavaScript compression solutions, it does not remove unneeded elements from the script but feeds it into Mozilla’s Rhino parser and retrieves the parsed JavaScript and reassembles the script based on this information. Rhino does not filter but interpret the JavaScript filter.
The advantage is obvious: JavaScript has a permissive syntax (e.g. you can leave out the semicolon if your command is terminated by a
\n) which causes all kinds of problems when compressing it. When using a conventional regex compressor, you have to feed in completely valid and syntactically immaculate files so that your file is still parseable after the compression took place. Using a parsing engine in the first place to make sense of the input and using the syntactically correct result removes that problem.@Nicholas Thompson: This post is not specifically geared to Drupal, not even to PHP. Additionally, Steven mentioned that there might be wrapper/alternative functions for your development platform.
Marking up posts
@Benjamin: actually I hardly used codefilter here... most of it was just done by hand. Given the subject matter, escaping a few characters here and there is hardly difficult.
Of course, being the original author of codefilter, I'm a big proponent of it, even though it is kind of a contradiction or at least an inconsistency with HTML. But generally it causes a lot less problems than it solves.
The codefilter code tag does what most people think the pre tag does. Pre only changes the way whitespace and newlines are processed, but it does not negate the need to escape angles and ampersands.
I've even seen numerous posts before from people who think something is wrong with their vanilla Drupal install without codefilter even.
JS parsing for compression
@Konstantin: the JS parsing/compressing is a very interesting idea and an interesting reuse of code. But, for JS compression, all you need to do is tokenize the code and then reassemble it with appropriately compact expansions for everything.
This is already part of what the Kses-derived XSS filter in Drupal does: it goes through the HTML and uses a very conservative parser to normalize everything into nicely formed tags, as well as normalizing escaped characters.
But then it also uses a very strict white-list of tags and attributes, with an additional rule that anything that remotely looks like a URL has to start with a valid protocol.
There are two things it doesn't do... enforcing a properly nested tag structure (this is what htmlcorrector does afterwards), and ensuring the resulting mark-up actually is valid or at least meaningful HTML.
I don't think the issue of meaningful HTML can really be solved in a filter. In theory, HTML can only be validated as a whole document (i.e. a themed page), and in practice, both people and browsers go beyond the spec.
The equivalent in the JS situation would be to verify if the resulting code actually runs and doesn't try anything harmful (e.g. by exploiting ActiveX objects in IE). This is a step up from just making it syntactically valid, and not at all obvious to do, given the endless amount of obfuscation one can employ in a programming language.
You'd not only need a JS parser, but a full JS virtual machine that emulates a browser with honeypots for all possible exploits.
'Why htmlspecialchars()'
@Nicholas / Anonymous: As I said in the beginning, the point of this post was to take a more theoretical approach to the subject that is (mostly) platform agnostic... I think there are already countless recipes for safe code around, including the Drupal documentation that lists its custom wrappers. But merely applying them on a case by case basis, you rarely get the big picture.
What's interesting for me is the story between the snippets at the beginning and the end.
I do want to do a follow up that describes the nesting of contexts that often happens online though... from a GET query variable to a URL, to an HTML attribute, etc. But this would really have to be specific to a particular platform/environment.
The equivalent in the JS
I know you were pointing out the holes of the original commenter's idea but that still wouldn't do the trick... it's been a while since CS311 but I'm pretty sure that if you could determine if code is malicious, then you could solve the halting problem. The halting problem is undecidable and therefore malicious code detection is undecidable.
Thanks for the article
Cool article. I hope you will write more about safe string theory in your coming posts.
Regards,
Sudheer
Halting
@Andrew: You're right of course. I was more thinking about the fact that "honeypots for all possible exploits" was by itself a pretty unrealistic thing. We have to go beyond mere browser exploits and think about privacy violations (such as history tracking) and sticky trojan JS code, which doesn't necessarily try to break out of the browser sandbox.
There is a slight twist though, which is that we know that it is a virus' goal to exploit a vulnerability in as many environments as possible. Given such a (hypothetical) set of perfect honeypots, while it would be impossible to prove that code is safe, I think you could make the chance of it being malicious arbitrarily small, by running enough trials in enough randomized environments. Furthermore, the chance of detection would be proportional to the chance of infection on a real target, meaning only very ineffective exploits would sneak past.
Prohibit print
Prohibiting or deprecating the direct use of print without string conversion would promote safe programming. E.g. by redefining print to require an additional argument $context, and adding automatic conversion based on that context. This would force the programmer to think about the context.
<?php drupal_print('html', $user->name) ?>'s profileMollom?
I see a business opportunity here ;-)
Prohibiting print
Actually this is a topic I've also thought about before and want to follow up with. If we have context information, how can we avoid XSS errors?
One can imagine two implementations of this: 1) a run-time mechanism that actively chooses the appropriate context conversion and ensures safe code automatically, or 2) a code verification tool that merely identifies bad concatenations for the programmer to resolve.
However, trying to do this in Drupal/PHP, in the way you proposed, would not work, because PHP itself has no way of tagging strings with metadata. Your
drupal_print()only knows about the desired target context, not the source context.In theory you could build your own custom string object that carries this information, but the performance hit would probably be pretty bad, not to mention you'd have to rewrite all your code.
People have actually been experimenting with some string tagging mechanisms in other languages, but afaik it only works by marking them as either 'dirty' or clean (i.e. a binary flag). This is only useful when targeting a single context.
Nicely said
Really nice article, Steven. Thanks for writing it.
From a strictly writing point of view, to write a more persuasive article, it might be more effective to amplify and emphasize this point at the end, so that the article ends on a stronger note: "Whenever you put strings together, you need to ask yourself what context the strings are in. If they are not the same, an appropriate conversion needs to be made, or you can run into bugs or worse, exploits." It's very clear as is -- I just felt there was room for a bit more forcefulness or something at the end.
You might make techniques for handling context conversions the subject of another essay, and simply reference the next article from this one -- if you are interested in writing more on the subject.
Post new comment