Acko.net

Safe String Theory for the Web

2008-04-03T00:00:00+02:00

One of the major things that really bugs me about the web is how poor the average web programmer handles strings. Here we are, changing the way the world works on top of text based protocols and languages like HTTP, MIME, JavaScript and CSS, yet some of the biggest issues that still plague us are cross-site scripting and mangled text due to aggressive filtering, mismatched encodings or overzealous escaping.

Almost two years ago I said I'd write down some formal notes on how to avoid issues like XSS, but I never actually posted anything. See, once I sat down to actually try and untangle the do's and don'ts, I found it extremely hard to build up a big coherent picture.

But I'm going to try anyway. The text is aimed at people who have had to deal with these issues, who are looking for a bit of formalism to frame their own solutions in.

The Problem

At the most fundamental level, all the issues mentioned above come down to this: you are building a string for output to a client of some sort, and one or more pieces of data you are using is triggering unknown effects, because it has an unexpected meaning to the client.

For example this little PHP snippet, repeated in variations across the web:

name ?>'s profile

If $user->name contains malicious JavaScript, your users are screwed.

What this really comes down is concatenation of data, or more literally, strings. So with that in mind, let's take a closer look at...

The Humble String

What exactly is a string? It seems like a trivial question, but I really think a lot of people don't really know a good answer to this. Here's mine:

A string is an arbitrary sequence (1) of characters composed from a given character set (2), which acquires meaning when placed in an appropriate context (3).

This definition covers three important aspects of strings:

They have no intrinsic restrictions on their content.
They are useless blobs of data unless you know which symbols it represents.
The represented symbols are meaningless unless you know the context to interpret them in.

This is a much more high-level concept than what you encounter in e.g. C++, where the definition is more akin to:

A string is an arbitrary sequence of bytes/words/dwords, in most cases terminated by a null byte/word/dword.

This latter definition is mostly useless for learning how to deal with strings, because it only describes their form, not their function.

So let's take a closer look at the three points above.

1. Representation of Symbols

They are useless blobs of data unless you know which symbols it represents.

This issue is relatively well known these days and is commonly described as encodings and character sets. A character set is simply a huge, numbered list of characters to draw from; an 'alphabet' like ASCII or Unicode. An encoding is a mechanism for turning characters—i.e. numbers— into sequences of bits. For example, Latin-1 uses one byte per character, and UTF-8 uses 1-4 bytes per character. Theoretically encodings and character sets are independent of each other, but in practice the two terms are used interchangeably to describe one particular pair.

You can't say much about them these days without delving into Unicode. Fortunately, Joel Spolski has already written up a great crash course on Unicode, which explains much better than I could. The main thing to take away is that every legacy character set in use can be converted to Unicode and back again without loss. That makes Unicode the only sensible choice as the internal character set of a program, and on the web.

For the purposes of security though, encodings and character sets are mostly irrelevant, as the problems occur regardless of which you use. All you need to do is be consistent, making sure your code can't get confused about which encoding it's working with. So below, we'll talk about strings above the encoding level, as sequences of known characters. Like so:

String Theory

2. Arbitrary Content

They have no intrinsic restrictions on their content.

The second point seems self-evident, but can be rephrased into an important mantra for coding practices: there are no restrictions on a string's contents except those you enforce yourself. This makes strings fast and efficient, but also a possible carrier of unexpected data.

The typical response to this danger is to apply a strict filtering to any textual inputs your program has and before doing anything else to the data. The idea is to remove anything that may be interpreted later as unwanted mark-up or dangerous code. On the web, this usually means stripping out anything that looks like an HTML tag, doing funky things with ampersands and getting rid of quotes. While this is an approach that is often advocated as an effective and bulletproof solution, it is rather short-sighted and limited in scope, and I strongly oppose it.

This is of course very different from regular input validation, like ensuring a selected value is one of a given list of options, or checking if a given input is numeric and in the accepted range. These are different from regular textual inputs, because the desired result is in fact not a string, but either a more restricted data type (like an integer) or a more abstract reference to an existing, internal object.

To understand why textual strings are such poor candidates for input validation, we need to look at the third point.

3. Different Contexts

The represented symbols are meaningless unless you know the context to interpret them in.

Context, or the lack of it, is essentially the cause of issues such as SQL injection, XSS and HTTP hijacking. And, I think it is exactly because it is so essential to processing strings, that it is often taken as self-evident and forgotten.

Let's go back to our example string:

String Theory

Everyone will see this string represents two English words. That's because people are great at deriving context from free floating pieces of data. However even with natural languages, confusion can arise. Take for example this string:

Bonjour

Is it a French greeting? Sure. But it is also the name used by Apple for its zero-configuration network stack. We can only know which one is meant, by knowing more about the context it is used in.

Now why bother with this trivial exercise? Because the web is all about textual protocols and languages. While people are great at deriving contexts automatically, computers aren't, and generally rely on strict semantics.

Imagine a discussion forum, and people post topics with the following subjects:

<b> is deprecated

Each string contains the character < in a slightly different context. The first uses it as part of intended bold tags. The second seems to use the same bold tag, but is actually just talking about the tag instead of using it for markup. More formally, we can say the first string is written in an HTML context, the second in a plain-text context.

If we were to try and display these strings in the wrong context, we'd see tags printed when they should be interpreted, or text marked up when it should be shown as is.

Context Conversion

To unify the two strings above, we can convert the plain-text string to HTML without loss of meaning, like so:

<b> is deprecated

This kind of context conversion is commonplace under the term escaping and in this case, will replace any character that has a special meaning in HTML with its escaped equivalent. This ensures the resulting string still means the same thing in the new context.

On the Web, Contexts Happen

Usually, the lesson above of escaping input to HTML-safe text is where the discussion about XSS ends. However, armed with only the knowledge that HTML-special characters must be escaped to be safe, it can be hard to see why in fact you should not just filter all your data on input to ensure it contains none of these pesky characters in the first place. After all, how many people really need to use angle brackets and ampersands anyway?

Well, first of all, I think that's underestimating certain users. The following subject might not be so rare on a message board, yet would be mangled by typical aggressive character stripping:

<_< so sad

More fundamentally though, it implies that there is only one kind of string context used on the web. Nothing could be further from the truth. Let's look at three different, common contexts.

HTML

We take a simple snippet of HTML by itself with some assumed user-generated text in it:

attribute text">inline text

We look at some different segments of the snippet, and look at what 'forbidden characters' would break or change the semantics of each.

Snippet	Forbidden	Escaped as
attribute text	"&	" &
inline text	<&	< &

For example, quotes are disallowed in attribute text values, because otherwise a string with a quote could alter the meaning of the HTML snippet considerably:

attribute with injected" property="doEvil() ">inline text

Note:

All ampersands need to be escaped (including those in URLs) for it to validate. HTML's stricter cousin XML will refuse to parse unescaped ampersands as well, and even requires that apostrophes be escaped too.
XSS attacks do not necessarily involve angle-brackets. In the attribute context, all you need is a " to wreak havoc.

URLs

The situation is more complicated with URLs. The common HTTP URL for example:

http://user:password@host.com/path/?variable=value&foo=bar#

Snippet	Forbidden	Escaped as
(all)	<>"#%{}\|\^~[]` (and non-printables)	%3C %3E %22 %23 ...
user	@:	%40 %3A
password	:	%3A
host.com	/@	disallowed
path	?	%3F
variable	&=	%26 %3D
value	&+	%26 %2B

Note:

Many forget that a + in a query value actually means a space, not a plus.
Even completely valid URLs can still be malicious, through the javascript:// protocol.
Defined by RFC 1738 and RFC 2616.

MIME Headers

Several protocols such as HTTP and SMTP employ the same mechanism of providing metadata for pieces of content. This includes data such as e-mail subjects, senders, cookie headers or HTTP redirects, likely to contain user-generated data.

Subject: message subject Content-Type: text/html; charset=utf-8

Snippet	Forbidden	Escaped as
message subject	CRLF (if not followed by space), ()<>@,;:\"/[]?= + any non-printable	=?UTF-8?B?...?=

Note:

CRLF sequences without trailing space start a new field and can be used for header injection.
Lines should be wrapped at 80 columns with CRLF + space.
Defined by RFC 2045.

Lolwut?

If the above three tables seem complicated and confusing, that's normal. It should be obvious that each of the three contexts is unique and has its own special range of 'forbidden characters' for user input (and even some sub-contexts). From this perspective, it would be impossible to define a safe input filtering mechanism for text on the web that didn't destroy almost all legitimate content.

You would have to filter or escape only for a single context, which would create a situation where the exact same approach to a problem can be safe in some cases, but unsafe in others, thus promoting bad coding practices. In particular, any framework that has a concept of tainted strings that are auto-escaped has it wrong: it's marking strings as safe/unsafe, when it should be treating them as plain text, unsafe HTML, filtered HTML, MIME value, etc. Whether that happens implicitly through data flow, or explicitly through typing depends on the architecture.

With the selection above, I also ignored other important contexts (notably JS / JSON or SQL). However, the fact that I was able to make my point using only old school Web 1.0 techniques should show how this problem becomes even hairier in today's Web 2.0.

So What Then?

The right way around string incompatibilities is to use appropriate conversions to change content from one context to another without changing its meaning, and do so when outputting text in a particular instance. We already did it above for the plain-text example, but similar conversions can be made in almost every other instance. Most web languages (like PHP) contain pre-built and tested functions for doing this.

Whenever you put strings together, you need to ask yourself what context the strings are in. If they are not the same, an appropriate conversion needs to be made, or you can run into bugs or worse, exploits.

The best way is to mentally color in all your variables by their type: text, HTML, SQL, JS, MIME, etc. You should aim to keep data in the most natural form possible, until right before output. Storing possible XSS code in your database might seem like it's tempting fate, but if you apply proper escaping on the way out, you can be sure that it's safe, no matter where the data comes from or what protocol it's being sent on. It turns out to be a far more paranoid approach than input filtering, because you don't have to trust anything you didn't hardcode.

So, use prepared SQL statements to leave the escaping to your database interface. Make your HTML templates escape dynamic variables implicitly if possible. Never compose JSON manually. Finally, be careful when tying web code into other systems: a Bash command containing a user-supplied filename is a exploit waiting to happen.

For the example snippet at the very beginning, the appropriate fix is:

name) ?>'s profile

The trick is in understanding why that call goes there, and not somewhere else. Ideally however, the framework has been constructed so that context issues simply cannot happen, by passing data through interfaces, and letting well-tested code handle the concatenation for you.

Update: Google's DocType wiki has an excellent section with instructions for escaping for various contexts.

Because there are too many serious websites around

2008-02-07T00:00:00+01:00

I finished designing and building this year's edition of LeuvenSpeelt.be, a site that promotes student theater at my old university. You can read about the background in my previous blog posts.

The site is a simple Drupal installation with heavy content and theme work. The design is heavy on graphics and built as an experimental semi-fluid layout that adapts to different screen resolutions. Peripheral design elements are shifted in or out of the browser frame to make more space for content as needed.

Tools used: Photoshop, Illustrator, 3D Studio Max, TextMate. Uses the beautiful Fontin font available freely from Jos Buivenga's exljbris foundry.

And no, no easter eggs this year.

On not doing Drupal anymore

2008-01-17T00:00:00+01:00

Various people have prodded me to explain my recent involvement in Drupal, or rather the lack of it. Unfortunately, I haven't found a way to do so in a way that is constructive and tactful, especially not when it comes to other contributors. Like Soylent Green, Open Source is made of people, and it's these people who are at the basis of a mountain of frustration that has driven me off. At the end of the day, I feel that the vast majority of contributors is not willing or not able to apply the level of diligence that I apply to my own Drupal work. This is both in terms of technical background and research, as well as in the actual execution and quality assurance. I find that too little effort is spent on polishing things so they really shine, e.g. in the actual development (back-end and UI), but equally in, say, outreach and marketing. It also seems that any exceptional efforts that go beyond this typical fare are often wasted, because the author invariably has to fight a prolonged (and sometimes never-ending) battle to keep the polish from being obliterated by someone else's refactoring.

This is further aggravated by the fact that a certain group of people always seems ready to chime in their two cents (or more) in long, repetitive e-mail threads or project issues, while not actually contributing to the end result or even bringing solid, technical arguments to the table. Said persons seem more interested in maintaining the business revenue that Drupal provides to them, rather than producing a better CMS.

This leads to a culture where actual expertise becomes a burden rather than a benefit, because whoever does something first is often expected to keep doing it indefinitely, for the benefit of everyone else. Rather than contributors having a symbiotic relationship with each other, it becomes more and more parasitic and unidirectional.

After many years in this environment, I find myself utterly drained and unmotivated to participate in that sort of charade anymore.

Ensuring your contributed code gathers dust

2007-05-01T00:00:00+02:00

8 tips for the aspiring Drupal developer

Open source is really great. You get to cherry pick from some of the best software out there and build neat stuff with it, fast. Most open source projects will also encourage you to contribute your own work back to the project. Supposedly, so others can benefit from your work.

While that's often an easy, karma-scoring move, it can have some unintended, annoying consequences. For example, people might start sending in bug reports for your code or may offer suggestions on how to improve it. Even worse, meddling know-it-alls may even offer to 'help' with development and do things with it that you never intended. Some projects, like Drupal, even trick you into such forced participation, by automatically supplying issue trackers, RSS feeds, revision control and other, undesirable community interaction.

Luckily, there are several things you can do to keep those pesky contributors away altogether. You can participate in open source without suffering any of its extended effects. Here are some concrete tips for the aspiring Drupal developer. Of course, a true creative will find novel ways of keeping their open source involvement to an absolute minimum!

Using an existing platform can be daunting. If it does something weird that you really can't explain, just work around it. You have better things to do. Warn others away with a comment that explains that you "Don't know why, but with this, the problem goes away." Avoid getting help from the original coders: they were the ones that broke it, remember?
Lift pieces from the core and change them. Drupal core provides a rich platform, but the core developers are often short-sighted. For example, several key API functions contain so-called 'security measures' that often just get in your way. Feel free to copy code from core and tweak as needed. Slap a mymodule_ prefix on the function name and you're good to go. Comments about what you tweaked are unnecessary—people can just use diff. And, unlike core, your copy will never be broken by 'fixes' from the community.
Include everything but the kitchen sink. If your project needs several related features, group them together into one big module. This ensures that others will always see something they don't need and can't easily get rid of. The idea of serious remodeling in foreign code is an easy turn-off.
Forget about backwards compatibility and upgrade paths. If it works for you, that's good enough. Besides, if someone else needs to keep their data, they can still figure out the changes on their own if they're determined. Extra points scored for ensuring that your module can only be installed correctly on a blank site, as this ensures low uptake.
Avoid conforming to coding conventions. There's nothing like non-standard names and confusing syntax to make that first read through of the code just that little bit more annoying. And really, why should you conform to the whims of a bunch of like-minded zombies?
Skip code comments when possible. As the original author, you already know what the code does. Plus, leaving out Doxygen sections for functions is a great way to make it harder for someone to 'get the big picture' and see how your code fits together. If you must leave a comment as a reminder, don't use sentences, but just slap some related keywords together. Avoid verbs, as they easily expose data relationships and control logic.
Don't make it localizable. If you don't speak English, you can even code directly in your native language, and save yourself a lot of trouble translating later. This means you also don't have to deal with those annoying Localization APIs that are over-engineered only to handle the moon language of some godforsaken mountain pygmies anyway.
If you must, obfuscate and bloat. Some projects are just too large, trendy or otherwise high-profile to not attract a crowd. In this case, going through the trouble of making your code more verbose and complicated is often needed to keep these nosy people at bay. Go through the extra effort—it's worth it.

Some of the better methods include:
- Clustering pieces of your code together in large mega-functions with no clear delimiters.
- Writing out all your control flow in long, repeated logic structures that contain as many squares from the truth table as possible.
- Keeping local data in static and global variables, for much longer than needed.
- Subtly misnaming functions, methods and variables for maximum effect: a _check() that doesn't, a _save() that deletes or a _form() that returns a table can wreak havoc in the most unexpected places.
- Creatively using APIs for accessing internal or otherwise unrelated data. The API reference is a great resource for finding such useful backdoors.

With these simple guidelines, you'll keep even the most seasoned contributor far away before they ever think of submitting a patch!

jQuery OSCMS presentation slides

2007-03-23T00:00:00+01:00

Update: a raw video is now available of (almost) the entire session. Thanks to Jon F Hancock for recording it.

Today I did my second session at OSCMS, which was basically a repeat of the jQuery talk I did at DrupalCon Brussels.

You can download a PDF (2.2MB) of the (slightly tweaked) presentation slides.

Design presentation slides

2007-03-22T00:00:00+01:00

I did my OSCMS talk Designer eye for the geek guy today. My main plan for this talk was to blast as much basic graphical design concepts into people's heads as possible and sort of teach some of the principles, vocabulary and methods that a lot of designers take for granted.

The response was great as far as I could tell. I also got the inevitable "How do we deal with Internet Explorer?" spin-off discussion in the questions round at the end ;).

Steven Peck recorded my session on video.

You can download the slides as PDF (36.5 MB), though because of all the graphics it's quite large. I think some sections will not be clear at all without the spoken explanation to go along with it though.

Going to the US sucks

2007-03-17T00:00:00+01:00

Ugh. So, I just found out that I will be fingerprinted and photographed when I go to the United States next week for the OSCMS conference. For some reason, I was under the impression that visitors with a visa waiver (like EU citizens) were not subject to this rule, and this was why I was still going. Perhaps they changed it recently.

Today was the first time I heard this issue mentioned, so I want to give a clear signal to the US members of the Drupal community: this is unacceptable and hinders development and business. For sure, it's the last time I'll be going into the United States while this continues, and if this trip wasn't so important to me I'd have cancelled my flight already.

Your country's government is messed up beyond belief, and its policies scares away the entire world. It wouldn't be so bad if the US didn't perpetually tout its everlasting respect for personal rights and freedom. As a European who has just moved to North America, I can say for sure that I constantly feel as though my privacy is under assault, because corporate interests, advertising and other annoyances take precedence over the right to be left alone here. It's a huge cultural difference.

This news immediately set in motion plans for subverting the system. I'd be having a little kitchen vegetable slicing accident if it were to help. Unfortunately, my prior experience with US customs a couple years back leads me to believe that any sort of irregularity would only lead to hours of delay and a much more thorough printing and photographing.

For now, the US is reduced to a bizarre, quirky, sad laughing stock for me. My visit will be bathed in the surreal air of stepping into an asylum.

Drupal.org Explosion and Trends

2007-03-11T00:00:00+01:00

One of the things I do occasionally is collect some of my own statistics on Drupal.org. An interesting one is to look at the size, expressed in nodes, and the growth rate of Drupal.org, expressed as nodes per hour.

Just like rings on a tree trunk, you can see Drupal.org evolve.

Now while the nodes graph (red, left axis) looks exponential at first, it turns out that its growth (blue, right axis) is actually quite stair-stepped. In fact, a remarkable trend can be observed: with every major release of Drupal, Drupal.org's growth rate doubles nearly instantly.

This has been true mostly since Drupal 4.5 and happened recently again with Drupal 5.0. The only exception was a relatively unexpected, but sustained linear growth burst throughout the fall of 2005 and the spring of 2006. There were many high profile projects at that time as well as a lot of anxious waiting for 4.7 (with a long train of betas and release candidates), so it is possible our unnaturally long dev cycle for 4.7 smeared out the growth to a more regular line.

Note that the first year of Drupal is missing, as the statistics were too small to show up.

What's also interesting is that you can clearly see the times when project.module has changed. The first tiny bump is the conversion of projects to nodes, while the other two spikes are the recent conversion of project releases to nodes. For some reason, a huge amount of existing releases got a creation timestamp a while ago, possibly due to some change in project.module housekeeping at that time.

Other noticeable events include the holidays (when activity slows down a lot) as well as the server crash of July 2005, and the subsequent move to OSUOSL.

I've attached the raw Excel Spreadsheet for anyone interested. See if you can match up any other events to the graph.

ComicJuice gets even better

2007-03-09T00:00:00+01:00

I finished some more tweaks to ComicJuice:

IE6 and 7 are now supported, thanks to the amazing ExplorerCanvas by Google. It emulates the in IE, meaning that client-side scriptable vector graphics are now available on all the major browsers (IE, Firefox, Safari, Opera). I doubt Konqueror will be far behind.

This opens up some cool abilities, like dynamic in-page graphs, mini-widgets (sliders, dials, maps, ...) and even pure JS games. There's a bunch of examples linked on Wikipedia (though most don't use ExplorerCanvas yet).
I added support for uploading your own images rather than using pictures on the web. It uses a customized and themed version of core's JS uploader.
I improved the clipping of speech bubbles so there should be less useless whitespace around comics, especially when embedding them.

Announcing.... ComicJuice!

2007-03-06T00:00:00+01:00

Announcing.... ComicJuice!

I'm proud to announce the start of ComicJuice, a web 2.0 social mashup tool that lets you create comics in your browser and share them with others.

Update: Now with Internet Explorer support! Thanks to Google's ExplorerCanvas. Viewing comics works in IE6 and 7, while editing still requires IE7.

The crazy part is that I started working on this only friday evening (that's 4 days ago). Once I had the initial idea and a rough plan, I simply couldn't not code it.

A lot of jQuery and JavaScript later, with some

You can also embed comics with iframes, and copy/pastable code is provided. Like this lame example:

(No longer available)

I figured a Web 2.0 mash-up would not be complete without a fitting design to go in, so I designed icons, sliders and toolbars for the editor, as well as a theme for the website. The theme is a Garland knock-off: I guess I'm proving myself wrong that it's a bad base theme. It's actually quite good as it has fluid/fixed 1-3 column layouts in it.

I'm curious to see if ComicJuice takes off and what people do with it. It was a blast to code in any case. Check it out.

Drupal's Designer Future

2007-03-01T00:00:00+01:00

In the past months I've been doing a lot more graphical design, and it's caused me to think about how it relates to Drupal. This prompted me to write a rather long blog piece with some insights and a call to action. If you are interested in the future of Drupal, please read on.

The trigger was that I noticed that I'm getting less and less motivated to do graphics work for Drupal. It's not that I don't like design... I loved designing and building that LeuvenSpeelt.be site last month for example. But when it comes to Drupal graphics, the personal reward that I feel from doing it doesn't seem proportional to the effort I put in. This includes designing little banners for the Drupal.org spotlight, doing a t-shirt, making ad buttons, doing the association theme and more.

The most recent big example was the Garland theme. When Stefan Nagtegaal showed a work-in-progress version of his 'Themetastic' theme (as it was then called) in September, I was instantly charmed and knew that this was our new default theme in the making and said so clearly.

Many others were not convinced though and hammered on details, even though the basic design for the theme was rock solid. Some were not convinced of the theme's potential, or simply didn't see that we needed a theme that was graphically smashing rather than a good base to develop on.

At that point, I essentially said "screw the community, this is going to be our default theme" and started refining the theme so it was perfect for core. This took several weeks.

Until then, the rest of the community put its eggs in the wrong baskets and got a lot of useless design-by-committee done. These designs, which were in my opinion mediocre at best, were being pushed for inclusion. This may sound a bit harsh, but I honestly believe that if the most popular candidate theme, Deliciously Zen, had become the new default core theme, we'd have been ridiculed for still not 'getting' design after 6 years and Drupal 5 would not have been such a big release. Just like 4.7, most people would not stick to Drupal long enough to discover how good it is.

Now, when the Garland theme was finally done, everyone suddenly changed their opinion and congratulated the community on its excellent work. I have to admit this hit a nerve, especially after I'd been spending countless days and nights the two weeks before fixing annoying IE rendering bugs, redoing the CSS layout and adding a whole new layer of Glitz und Glanz to Drupal core.

Only three people did serious work on what became the Garland theme: Stefan Nagtegaal did the original design from scratch and worked with Adrian Rossouw to come up with a proof-of-concept of the recolorable theme. I wrote the color picker, improved the theme and coded what became the color.module based on Adrian's stuff.

Only a handful of people helped with testing of the theme during its development and only after the main theme was finished — most of the bugs were in the recoloring mechanism. How can such a vital piece of Drupal 5 have only have 3 serious contributors, when the whole release had almost 500 people submitting patches?

To me, this shows that we have a problem in the Drupal community, or rather a knowledge void. Not enough Drupal people are savvy enough about theming and design to help out with even small tasks (like a banner) or even give quality tips and feedback on other work. The result is that theming and design receives little attention. Most contributed themes and sites could look a lot better, if they just themed it some more. And getting patches into core that give the defaults a little more oomph is tough, as they are often considered to be useless embellishments.

Still, ever since Drupal started, there has been the recurring cry of doing more to attract great designers to the platform. The overall effects of this have been minimal. However, something similar did happen before.

Before Drupal 4.0 was released, the focus was mainly on features and Drupal was a highly experimental project. After a while, as more people started using it, many users complained that Drupal was too hard or confusing to use. Because of this, 4.0 was the first of many releases that contained significant usability improvements, in this case in the administration area. Many small and large usability features were added. With the menu system and tabs having been added to core by Drupal 4.5, even contributed modules started using the same UI concepts as core. Drupal's UI ended up much more consistent across configurations and it became easier to learn and document. Now with Drupal 5.0, we have undoubtedly produced the most usable release yet.

How did this happen? Over the years, the idea has popped up many times to bring usability experts on board to do a review, and the hope has lived that a usability expert or two will magically pop up in the community and solve all our problems (sound familiar?). Neither has happened so far.

What did happen is that usability became a big priority for the project, and as a result, many people started educating themselves about it. The community quickly identified those in its ranks knowledgable about usability and listened to their advice. Soon, big UI gurus were being quoted on the mailing lists and "-1 isn't usable" became a valid reason to dismiss a feature. Sure, this process took time, but it definitely happened. Plus, the combined usability knowledge and effort of the community, though individually not at expert level, had a much larger effect in the long run than any single expert could have.

The same needs to happen for design. For years now, the Drupal community has been hoping for a group of prodigy designers to magically appear and design a set of jaw dropping themes and UI. They have not shown up. Talking and maintaining a high quality of design across Drupal still often feels like swimming upstream, because most community members don't care much for design unless it is delivered in front of their noses on a silver platter. For many, design is still something to only be enjoyed, not something to be created.

Now, I really want to see this change. For one thing, the shortage of design talent means Drupal is generally perceived to be ugly. It's quite demotivating, because we put a lot of time into it. Unfortunately people illogically, but consistently, assume a relation between how something looks and how good it is built. With Drupal 5 we've done a lot to improve this, but we could still do a lot better. Drupal is no OS X (yet).

For another, when only a handful people are always doing the same jobs, the passion tends to slip out and the challenge becomes a chore. I honestly have no ideas left for a spotlight banner at the moment. That's why the scalability banner is so mind-numbingly boring, though I made plenty of cool ones before.

This is also why I'm holding that OSCMS talk about design this month: I want more people to realize that if your site and/or module is ugly, people aren't going to like it or use it. It's as simple as that. If you mess up something as basic as text formatting, your message simply doesn't get through (hello MySpace users). The only way to change that is to put in the effort to make things look clean and nice. Nice products and nice sites tend to cause happy, dedicated and long-term users.

The community not only needs to realize this, but also needs to teach itself the knowledge and skill to do something about it. Drupal has infinite potential, but it only goes where the community takes it. If the majority remains allergic to design and graphics, very little will change and only at a glacial pace.

LeuvenSpeelt another year

2007-02-15T00:00:00+01:00

Update: check out the poster I did for the event as well.

Just like last year, my (now) ex-university, the Catholic University of Leuven, still has a theaterfestival for and by students. Friends of mine organise it and I'm the resident web monkey and designer for their site and poster. The site's domain name means "Leuven plays" and is a pun on theater and plays (it works in english too). So, every year we try to base ourself on some playful theme when coming up with the promotional material.

In the past, there's been blackjack and chess. Last year's design came out really well, so I posed myself the challenge of doing even better.

I redesigned the site from scratch, this time using a 1930's carnival/fair as the theme. A lot of Google Image researching by the organisers found lots of tents, wagons, clowns, snake ladies, acrobats and bearded women. I modelled the carnival setting in 3D Studio Max entirely from scratch, using free textures found on the web as well as custom painted textures. We also found some very nice ticket designs, which lead to the typography and framing of the page.

The renders were cut up into layers and composed into a semi-fluid CSS-based layout (no flash). Then I turned that into a Drupal theme and installed it on the site. The result is very heavy on images (with several hundred of KB's of PNGs/JPEGs), but in this case it's acceptable as the site's audience will consist 99% of students browsing through the university network or someplace close.

To make the ticket metaphor even better, I used some JavaScript and cookie magic to issue each visitor a unique ticket with a printed serial number. Each page you visit punches a hole in the paper, though of course you can still browse the site as much as you want. If you wait an hour or more and come back to the site, you'll get a fresh ticket with a new number. The serial number is in fact just a hit counter in disguise.

Just like the previous years, there are two easter eggs hidden around the site. You'll know when you've found them.

The only hiccup was getting IE6 and IE7 to play nice. In the end, there are only a few minor issues that don't damage the design overall, except for one IE7 issue left to solve. All other browsers played nice from the start.

The site can be found at LeuvenSpeelt.be (Dutch).

Next up is finishing the work on this year's poster, which uses the same overall setting.

OSCMS Talk: Designer Eye for the Geek Guy/Gal

2007-02-14T00:00:00+01:00

Update: I've posted the presentation slides and a video is available as well.

I'll be attending the OSCMS conference in Sunnyvale CA at Yahoo next month. Aside from a repeat of my DrupalCon jQuery talk, (though with a bit more examples) I just submitted another proposal for a talk. It's something that I've wanted to do for a while now:

In meetings and lectures across the globe, people are made to endure hideous presentation slides featuring some of the wildest colors, clip art and typography. Many websites are so confusingly laid out, that you get dizzy from the overload of boxes, images or links. And every day, people receive resumés, invoices and ads ... *cue lightning and thunder* set in the Comic Sans font.

It's enough to make the average designer's hair turn blue, fall out, morph into a ninja and stab him/her in the eyes.

But, all hope is not lost! Contrary to popular belief, graphical design is not some arcane voodoo magic, but a straightforward discipline that values experience, reusability, elegance and good tools just like programming. Just like code, there are plenty of objective ways to measure the quality of a design. However, just like art is subjective, so may two programmers disagree on which implementation is the best. No designer is born with a genetic sense of proportion... it's just that while you were busy writing BASIC code on your C64, they were busy drawing superheroes.

I myself am an engineering geek who's never had any sort of formal design or art training, but has earned the title of "design nazi" on numerous occasions.

This session will teach geeks some basic principles about graphical design (especially on the web), from a geek perspective. This means we won't talk about "visually balanced design" but "here's a good approach to spacing". Soon, you'll be hearing the oooh's and aaah's when you don your designer hat.

You can vote on the session page if you're interested.

Vancouver PHP Conference

2007-02-12T00:00:00+01:00

Ahoy from the Vancouver PHP conference. I gave a talk titled "A Closer Look at Drupal 5" earlier. Overall response was positive, although according to Boris I wouldn't have managed to squeeze everything in 1 hour if I hadn't put on my zippy fast presentation speaking voice, so there might have been some information overload at times.

Oh well.. I figure anyone generally only remembers at most 50% of a talk, so I might as well blast you with a bunch of things and hope some of it sticks ;).

Thanks to Dries and James for letting me use their earlier presentations as a base.

The slides are no longer available by Dries' request, as he has had problems with people stealing slides without permission before. Sorry.

Authenticated Distributed Search (OpenSearch, OpenID)

2007-02-06T00:00:00+01:00

I've been working on Drupal distributed search for a while now, releasing a beta of the OpenSearch Aggregator as well as a release of the OpenSearch feed module. The aggregator has a friendly UI for setting up any number of sources and the feed contains relevance information from the Drupal search system. Results are also cached on the aggregator for performance reasons.

More information about these modules can be found in my earlier blog posts about OpenSearch.

The ultimate goal however is to set up distributed search for a Bryght client between a network of secure Drupal sites. The searches for logged-in users should include content that is visible to them across all the different Drupal sites.

OpenID is the obvious choice as an identity mechanism for the users, but it does not immediately help us with the authentication. I've written a document after some research that details possible approaches and solutions. Because we're talking about frontier technology here, it seemed best to repost it publicly to sollicit feedback from anyone interested. I could certainly use some extra opinions on this, as it is all very new to me.

Essential requirements

The most basic requirement can be summed up as follows:

Given the search aggregator (master) and a group of sites that return results (slaves). Whenever a search is performed by a user logged in on the master, the master contacts all the slaves and passes along the identity of the user making the request. This information needs to be unambiguous and secure. Slaves will then return results respecting the user's access permissions back to the master, who aggregates and caches the results for browsing.

All communication to the slaves is done by the master. For practical and performance reasons, none of this can be implemented on the browser/client-side (other than maybe a trivial form/redirect to log-in once).

1. Naive implementation

A very simple implementation could use each user's e-mail address as the global identifier for identity, and map it to the local Drupal uid on each site. The request could be signed using a simple keyed hash (HMAC) with a shared secret key that is set on all the participating sites (thus encoding trust). A query from the server to a slave might look like:

GET http://example.com/opensearch/node/keywords?user=name@example.com&hmac=123456789abcdef

There are several problems with this:

The user needs to register manually on each of the participating sites using the same e-mail address.
A shared secret needs to be manually set on each of the participating sites.
All participating sites must be part of a completely trusted, closed network.

2. Improvements: OpenID log-in

Some first low-hanging fruit is using distributed OpenID for the end users. This avoids explicit registering on each of the participating sites, as the Drupal OpenID module does this for us when you log-in the first time. It also gives us a globally unique identifier (the user's OpenID) which is verified (by DNS), to cross reference the local numeric uid's with.

However, OpenID is distributed in nature, and does not immediately help us when we want to have the master prove to the slave that it is allowed to fetch content for a certain user. The only entity that knows and holds the keys/cookies to which sites a user is logged in to is the user's browser (User-Agent), while the home site knows only which sites are allowed and have been logged in to in the past.

The trust between master and slave would still have to be implemented through some other means, for example again using a shared secret key and HMAC verification:

GET http://example.com/opensearch/node/keywords?user=user.openid.com&hmac=123456789abcdef

Instead of authenticating every request, we could also implement an authenticated 'back door' through which the master can force log-ins on the slaves without doing actual OpenID authentication with the home server. The result would be a session cookie for each slave that can be used normally by the master:

POST http://example.com/trust/?user=user.openid.com&hmac=123456789abcdef => Cookie is set GET http://example.com/opensearch/node/keywords Cookie: PHPSESSID=123456789abcdef123456789abcdef

This backdoor login would be provided by another module, and would have to rely on e.g. a DNS or IP-based whitelist of allowed hosts, optionally with SSL to ensure confidentiality.

Access control would be respected by Drupal using the normal session mechanism and the OpenSearch client module would not need to be altered. The cookies can be stored along with the local user account on the master, and aggregated OpenSearch data would be cached on the master per user.

3. Search master as home server

A possible solution is to restrict the OpenID home server to be the search master.

This would allow the master to log-in to the slaves directly, as it can produce all the necessary cryptographic tokens without needing user action. No modifications are needed to the slaves. Once the master has logged in once to each site, it has a valid session cookie for each (like in (2)).

4. Multi-login extension to OpenID

Another solution would require an extension to OpenID, both to the client module for Drupal and the code that runs the home server. Still, it would allow the home server to be any other server (or even a set of servers), provided the home server supports the custom extension.

When the user logs in to the search master, the module would know that it will need access to each of the search slaves. When it sends a request to the home server, it would not only ask for a log-in to itself, but also for each of the slaves. The user would get a single screen on the home site to log-in to (with correct notification that this is a multi-login), and is returned to the search master.

The search master logs in the user locally, and uses the cryptographic tokens for each of the slaves to log into them. The slaves can verify the log-in tokens using direct communication with the OpenID provider. Like in (3), normal Drupal cookies are returned to the master and used to perform the searches.

5. Other ideas

Of the above methods, only (2) is really immediately practical. Using OpenID at the base does not help if you still need proprietary extensions, or if you take away the ability to choose a home site. And if we do want to do (4) properly, then we need to develop an actual spec that respects the principles and security of OpenID. Not an easy job.

The downside of (2) is that there is no actual proof that the log-in took place, and that we rely on the shared key to ensure trust.

However, because the whole Identity 2.0 space is still developing, I think it would be silly to try and build something elaborate that implements some sort of utopian federation/whatchamacallit model. It would be an insane amount of work and would not be future proof or even useful today as very few real-world services would support it. I think we just have to wait here to see what develops, once OpenID gains some more widespread use and people get more comfortable with these concepts (which should happen in 2007).

5.1. SAML

SAML has been suggested as a standardized way of passing along security assertions. However, SAML is really a competitor to OpenID and started as a way of doing single sign-on between trusted sites.

The main difference is that OpenID is tied to a particular HTTP exchange pattern, while SAML better separates the message from the delivery method. Still, SAML is based on exactly the same principles, so assertions have to be generated and signed by the home server. So, they are still useless if we want the master to securely prove the log-in of a certain OpenID. Of course, we could encapsulate the message from master to slave in an unsigned SAML message, but that would really defeat the point of using SAML in the first place.

SAML itself doesn't do trust either. The slaves would still have to have a whitelist that includes only the master and which would be verified again by DNS/IP or SSL.

5.2. OpenID Proof-of-login token

The big functionality hole in OpenID is that the only one who can verify the cryptographic tokens for a log-in is the intended recipient (the relying party). There is no way in the specs to 'forward' the assertion by the home server that the user logged in to the relying party in a way that makes the information verifiable for everyone.

An extension to OpenID for this would be good. It's essentially a variant on the multi-login idea, where, instead of asking the user to log-in to the cloud of sites, the user only logs in to the master, and the master sends proof of this log-in to the slaves:

POST http://example.com/trust/?user=user.openid.com&....crypto-here.... => Cookie is set

The slaves would again have a whitelist based on IP/DNS or SSL. The main difference with (4) is that here, trust is again set by the administrator rather than explicitly given by the user through OpenID.

Conclusion

(4) sounds like the cleanest solution, as it does not rely on explicit whitelisting for trust. The user simply has to check that the master is not asking for a log-in to an external site. However, it would require an extension to the OpenID log-in process on the home server, including the UI (as the extra log-ins have to be communicated somehow), so it is unlikely that it would be implemented easily.

(5.2) sounds better in this light, because only changes under the hood need to be made. The home server would simply send along extra cryptographic tokens that can be forwarded to other parties. Depending on security issues, these tokens could be requested in general (by amount) or for specific recipients (for each specific slave).

Both solutions require serious crypto and spec work to do properly.

Expanding Textareas

2007-01-20T00:00:00+01:00

See the Drupal.org issue.

Quicktime embedding doesn't seem to be working. Try downloading the file.

Making Drupal smarter

2007-01-19T00:00:00+01:00

One of the things I worked on for Drupal 5 was to make Drupal smarter about itself. For example, the new status report tells you about security or maintenance problems, and can alert you if you need to run update.php after updating a module. These warnings are also pushed out to the main administration page.

This is a significant improvement in usability, as we try to make site set up and maintenance as painless as possible. The user can focus on the things they care about. If there is something that needs the admin's attention, it is clearly indicated.

The inspiration for this came from OS X. There are some really great examples of this in both Apple and third-party applications. For example, Mail.app automatically figures out if your mail server requires SSL or not: it simply tries both and sees which works when you set up the account. It happens transparently.

NewsFire does something similar. When you click the "Add feed" button, it automatically takes any URL on the clipboard, pings it and fetches the feed title (provided it points to a valid feed). All you need to do is press ok on the pre-filled form. If there is no URL on the clipboard, you get a blank form.

With our new _install and _enable hooks, and _requirements to examine the server environment, Drupal 5 modules have a lot of more opportunities to 'do the right thing' transparently.

We should encourage this practice as much as possible. Modules that require Drupal's configuration to match certain settings should make sure they are set correctly. If the setting is relatively benign and related to the module's purpose, it can be set automatically. Otherwise, it can be a requirement for the module to be installed/enabled.

We do need to walk a fine line between information overload and too much magic. But I think we've managed to do fine so far.

Updated Drupal TextMate Bundle

2007-01-14T00:00:00+01:00

I've updated my Drupal TextMate bundle script to also generate snippets for all PHP internal functions, including correct placeholders for the function arguments. It's a lifesaver when navigating PHP's bizarre Array or String APIs.

The script fetches the PHP function list straight from PHP CVS, but it still needs a Drupal tree to work. You can also copy in the contrib documentation to get snippets for hooks too (which even auto-fill in the module name). All PHP files within the given path are parsed.

To use it, place it in ~/Library/Application Support/TextMate/Bundles and run it:

php generate.php.txt [path to drupal]

In TextMate, go to Bundles › Bundle Editor › Reload Bundles to activate it.

Download

(License: GPL)

On Breaking Things

2007-01-05T00:00:00+01:00

See the Drupal.org issue.

Quicktime embedding doesn't seem to be working. Try downloading the file.

Drupal OpenSearch Aggregator

2006-12-16T00:00:00+01:00

I just committed a working version of my new OpenSearch Aggregator module to Drupal Contrib CVS.

OpenSearch is a standard by Amazon which allows you to share search results through RSS. The feeds are valid RSS, they just contain extra meta-data for searching. So, you can use OpenSearch with any RSS reader to set up feeds to track tags or keywords for example.

We also have an OpenSearch client module that provides these feeds, and I just updated it to send search relevance information along. So, you could set up 5 Drupal sites with OpenSearch module, and a sixth site with the OpenSearch aggregator. Now, you can search all 5 sites simultaneously, and get a single, ordered list of global results.

However, because OpenSearch is an open standard, it can be used for anything. Amazon's A9 search already offers media search for example. The possibilities really are endless.

The best part? The OpenSearch Aggregator presents its results through the normal search system. So, if you install the OpenSearch module on top of this, you automatically provide OpenSearch feeds for the aggregated search. In other words, Drupal is now a complete OpenSearch processing suite! There is no other CMS out there that can claim this.

More info is on the Drupal.org project page.