From: Marc Slemko [marcs@ZNEP.COM] Sent: Wednesday, February 23, 2000 11:39 AM To: VULN-DEV@SECURITYFOCUS.COM Subject: Re: [imp] sanitizing html On Wed, 23 Feb 2000, Mikael Olsson wrote: > Stuart Henderson wrote: > > > > Not sufficiently global, since an attacker can still use, > > for example hrEf=script:foo -- however, this is tricky to > > filter without hitting some legitimate addresses, for example > > http://foo.bar.com/womble.cgi?user=someone&page=something. > > Correct. And you can also use UTF-7 (Unicode) chars to make > script tags and everything look like something else altogether. > > This means that > > ## $data = preg_replace('|<([^>]*)[Ee][Mm][Bb][Ee][Dd]|', ' > wouldn't protect you at all. Well, no it won't but that is for other reasons. To avoid charset issues, as long as you don't actually have to worry about pages really using that charset (yea, which is a problem for a a fair chunk of people), then you simply have to specify the charset explicitly. If you don't know what charset the client will use, you have no way to know what has to be encoded. This is the reason that the patches released for Apache allow you to force a charset on all pages that don't have an explicit charset in the HTTP headers. Some of the other things to worry about, some of which work only in IE or only in Navigator: &{alert('foo')}; (in navigator in an attribute value) foo foo foo foo foo can be used (note no quotes around userentered.gif) if userentered.gif is entered as something like: xxx.gif onmouseover="alert('foo') If you are putting user data inside javascript, then there are other characters to be wary of. If you are outputting a text/plain page with user content embedded, then you can't because IE has a major hole (yea, yea, MS calls it a "feature", but it should be more and more obvious why it isn't) that it will try to guess the MIME type. So if you send a text/plain page, then you can't encode any characters since there is no encoding defined. Yet, if IE feels like it, it will go ahead and interpret it as HTML anyway. Perhaps having this brought up as a security bug in IE will make MS fix it. Probably not. It isn't like this horribly broken behaviour is anything new. The list goes on and on. And that doesn't even include all the random HTML tags that are obviously dangerous. The only thing that I can almost guarantee is that any list you make won't be complete. There is no way to safely filter HTML by specifying what not to allow. Even if you somehow did create a filter that magically worked 100% with one or two or three browsers today, your filter will break tomorrow or the next day. You need to be explicit about what you do allow, and make sure that it is in a very restricted form. Things like php's strip_tags function that only allow certain tags through are not stringent enough, because they allow arbitrary attributes. In addition, remember that this problem isn't just about scripting. Say you have an auction site that lets people bid by viewing an item, then entering their username and password at the bottom of the page. All the attacker needs is the ability to insert a form tag and associated stuff to exploit this. No scripts involved; this is not a scripting problem. The name is unfortunate. As a real life example of this problem, take ebay. They are wide open to almost this exact example (they do allow scripting languages, and do have the enter username/password bit on a second page, which changes little), have known about it for a long time, and just don't give a damn.