A type-based solution to the "strings problem": a fitting end to XSS and SQL-injection holes?

Posted by Tom Moertel Thu, 19 Oct 2006 01:40:00 GMT

Even skilled programmers have a hard time keeping their web applications free of XSS and SQL-injection vulnerabilities. And it shows: a sobering portion of web sites are open to some scary security threats.

Why are so many sites vulnerable to these well-known holes? Probably because it’s insanely hard for programmers to solve the fundamental “strings problem” at the heart of these vulnerabilities. The problem itself is easy to understand, but we humans aren’t equipped to carry out the solution. Simply put, we just plain suck at keeping a bazillion different strings straight in our heads, let alone consistently and reliably rendering their interactions safe whenever they cross paths in a modern web application. It’s easy to say, “just escape the little buggers,” but it’s hard to get it right, every single time.

Computers, on the other hand, are pretty good at keeping track of details by the bucket-full. Wouldn’t it be nice, then, if our programming languages gave us the power to delegate this nasty “strings problem” to our computers, which could then devote their unwavering mechanical precision to grinding the problem out of existence? Isn’t that the kind of thing modern programming languages are supposed to be good at?

I’d like to think the answer to that question is a big, you betcha.

So let’s grab a modern programming language and solve the strings problem.

Let’s solve the strings problem in Haskell

In this article, we will look at one way (among many) to solve the strings problem: by adding Ruby-style string templates to Haskell. These templates support “interpolation” via the usual, convenient #{var} syntax, but here interpolation is type safe. Haskell’s type system will prevent us from inadvertently mixing incompatible string types, and it will detect mistakes at compile time, before they can become live XSS or SQL-injection holes. Further, our solution will offer us these benefits without making us jump through hoops or pay some onerous syntax penalty.

To be more specific, the system offers the following benefits:

  • It provides a string-management kernel that lets you create “safe strings” by certifying a regular string as representing either text or a fragment of a known language.
  • It allows you to conveniently define new language types for any string-based language that you can provide an escaping rule for (e.g., XML, URLs, SQL, untrusted user input).
  • It provides compile-time syntactic sugar (via Template Haskell) that makes working with safe strings as convenient as working with string interpolation in languages like Ruby and Perl.
  • It catches and reports (at compile time) the following commonly made programming errors:
    • failing to escape a plain-old-text string before mixing it into a string that represents a language fragment
    • mixing strings that represent fragments of incompatible languages
    • mixing strings that represent fragments of compatible languages in an ambiguous way (the system will force you to disambiguate)

(This is a long one, so grab an espresso, lean back, and read on in style. Also, if you have a smoking jacket, you might want to get it now.)

Before I describe this Haskell-based solution, let’s take a closer look at the strings problem and review why a type-based approach makes sense. (If you already understand the strings problem and are convinced that it is both important and tricky to solve, feel free to skim the first third of this article.)

Examining the “strings problem”

Most web applications are just business-logic-driven string processors. They take strings from user-submitted forms, database queries, web-service responses, templates, and myriad other sources, and they combine the strings to generate yet more strings, which they emit as output and fling across the Internet, into your web browser.

For example, consider this snippet of Ruby (on Rails) code that I used to add submit-to-Reddit and submit-to-del.icio.us buttons to articles on my blog:

def submit_this_article_links(article)
  site_list(article).map do |submit_title, submit_url, image_tag|
    %(<a href="#{h submit_url}" 
         title="#{h submit_title}: &#x201C;#{h article.title}&#x201D;" 
      >#{image_tag}</a>)
  end.join("&#160;")
end

def site_list(article)
  u_title = u(article.title)
  u_url = u(url_of(article, false))
  [  # I really belong in a database table
    [ "Submit to Reddit.com",
      "http://reddit.com/submit?url=#{u_url}&title=#{u_title}",
      image_tag("reddit.gif", :size => "18x18", :border => 0)
    ],
    [ "Save to del.icio.us",
      "http://del.icio.us/post?v=2&url=#{u_url}&title=#{u_title}",
      image_tag("delicious.gif", :size => "16x16", :border => 0)
    ]
  ]
end

When writing this code, I had to keep track of at least three different kinds of strings:

  • Plain-old text, e.g., article titles
  • URLs, e.g., article permalinks
  • XHTML fragments, e.g., the hypertext link to Reddit’s submission form

In code like this, each type of string must conform to the requirements of its own little language, and it’s the programmer’s job – your job – to make sure that differences in these requirements are accounted for when combining strings. Getting it right is a difficult trick to pull off, and getting it right consistently is something even the best developers have difficulty doing.

In the tiny snippet of code above, for example, I had to remember to do all of these things:

  1. URL-escape (using the u helper method) the article’s title before inserting it into the submit-URL template
  2. URL-escape the URL for the article’s permalink before inserting it into the submit-URL template
  3. HTML-escape (using the h helper method) the final, expanded submit-URL template before inserting it into the hypertext-link template
  4. HTML-escape the submit-title (e.g., “Submit to Reddit”) before inserting it into the hypertext-link template
  5. HTML-escape the article’s title before inserting it into the hypertext-link template

That’s a lot to keep track of when coding.

But that’s not all. I also had to know not to escape the result of calling image_tag, because that helper method returns an HTML fragment, which is already in the language of the hypertext-link template into which it is inserted. Escaping it would have turned the image-element markup into embedded text that happens to look a lot like HTML markup.

And that’s not the worst of it. If you screw up any one of these steps for the typical web application, you open the door to a host of nasty problems. If you’re lucky, the damage will be contained to broken links or a rendering problem that most people won’t notice, maybe a weird database error now and again. In the worst case, however, you’re screwed: Your application’s customers become vulnerable to cross-site-scripting (XSS) attacks and your database is opened to injected SQL, through which enterprising crackers might steal your customers’ account data or do even nastier things.

Clearly, the strings problem is common enough and nasty enough to merit our attention. Many of our favorite problem-stomping practices, however, have not proved effective on the ever-tricky strings problem.

Unit testing is an inefficient solution to the strings problem

Unit testing is one of the most efficient programming practices for increasing the quality of software. If you write unit tests pervasively as you code, you are likely to nip many kinds of programming problems in the bud, saving time and effort, which you can then re-invest in your code. Further, unit-testing suites make for swell regression-detection nets and thus free you to refactor crufty code without fear of introducing breakage elsewhere. As a result, you’re more likely to keep your code lean and mean.

Despite its general effectiveness, unit testing is an inefficient way to defend against the perils of the strings problem. That’s because the strings problem is caused by knowledge deficits, which you can’t test for. If you don’t realize that you must escape one URL before you stuff it into another URL, you probably won’t think to write tests for that requirement.

Moreover, if you do think to write the tests, it’s expensive to get them right. In most unit testing scenarios, getting the tests right is usually easier or at least comparable in difficulty to getting the code that’s being tested right. That’s why unit testing is usually so efficient. For the strings problem, however, getting the tests right is often much more expensive than writing typical string-handling code. In my code sample above, for example, there are at least six ways the strings problem can cause trouble. How do you test for them all without making a mistake? It’s not easy.

In sum, unit testing probably isn’t the answer to the strings problem.

Other solutions to the strings problem

If unit testing isn’t the answer, what is?

Joel Spolsky wrote about the strings problem and suggested that using Hungarian notation was an effective solution. It might work, but it’s clunky.

In the database-programming world, many programmers have adopted the convention of never inserting a string into a SQL template by hand. Instead, they insert placeholders, typically question marks, into a template to indicate where they would like strings to be inserted. The template and the strings are then given to a special function that safely inserts the strings, escaping them as necessary. In Ruby on Rails, which has a fairly typical implementation, template expansion looks like this:

Post.find_by_sql \
  [ "SELECT * FROM posts WHERE author = ? AND created > ?",
    author_id, start_date ]

The question-marks-in-the-template solution is effective, but it’s also clunky, especially when you’re trying to insert a lot of strings. By comparison, Ruby’s native string-interpolation feature, in which the syntax #{...} lets us inject strings into a string template, is unsafe but much easier to follow:

chunkiness = "extra chunky" 
"I love #{chunkiness} bacon!" 
# ==> "I love extra chunky bacon!" 

In sum, the Hungarian-notation solution and the question-marks solution are reasonable responses to the strings problem, but both are clunky, especially when compared to the straightforwardness of good-old string interpolation.

Perhaps we can do better.

Eating and having one’s cake: a type-based solution

An ideal solution would combine the safety of the question-marks solution with the straightforward convenience of string interpolation, and it would work for all kinds of strings, not just SQL, and, because I’m implementing it in Haskell, it would lovingly nestle into Haskell’s type system and gain the full benefits of type-inferencing goodness.

How would it work? Well, let’s back up and think about strings for a moment. We can divide strings into two classes: (1) those that represent text, in which every character represents literally itself; and (2) those that represent fragments of interpreted languages, such as XML or SQL, where each character’s interpretation depends on the rules of the associated language. In text, for example, an ampersand (“&”) represents an ampersand, but in XML an ampersand represents the start of a character-entity reference.

It doesn’t make sense, then, to join text strings directly with language-fragment strings. If you did join them, text characters could be misinterpreted as language characters. For the same reason, it doesn’t make sense to join fragments of different languages together. (It does make sense, however, to escape text strings or language fragments “into” a target language and then join them with strings in the target language.)

A sound solution, therefore, should enforce the following fundamental, safe-string-handling rule: Do not allow strings that represent fragments of one language to be directly joined with strings that represent either plain text or fragments of another language.

The trick is making the computer enforce this rule for us. As it turns out, modern type systems absolutely love to do this kind of thing.

A solution to the strings problem in Haskell

Making the computer enforce our safe-string-handling rule in Haskell is fairly easy. All it takes is a little code. (As we go through the following code, remember that we’re writing a library. Normally, as users of the library, this code would be invisible to us.)

To begin, we create a module for our code and export the essential types and functions that make up our about-to-be-written safe-string kernel:

module SafeStrings
(
  Language(..),
, SafeString -- we export the data type but not the constructors
, empty, frag, text
, cat, (+++)
, render, renders, lang
, q
, declareSafeString
)
where

In order to create safe strings that correspond to particular languages, we need to tell the computer what we mean by Language:

class Language l where
    litfrag  :: String -> l   -- String is a literal language fragment
    littext  :: String -> l   -- String is literal text
    natrep   :: l -> String   -- Gets the native-language representation
    language :: l -> String   -- Gets the name of the language

Here we’re saying that Language is the class of languages, i.e., all data types l for which we can provide four functions:

  1. litfrag – converts a string that represents a language fragment into a language fragment
  2. littext – converts a string that represents plain text into a language fragment that represents the text (via escaping)
  3. natrep – converts a language fragment, verbatim, into a string that represents the language fragment
  4. language – returns the name of the language associated with a given fragment

Further, we need to declare a few “language laws” that conforming Language types must obey. These laws are for us. They will keep us honest when teaching the computer about new languages. Here are the two laws we will require language types to satisfy:

  • natrep (litfrag s) == s
  • natrep (littext s) == (escapeL s)

The first law requires that (natrep . litfrag) be equivalent to the identity function for strings. The second law requires that (natrep . littext) be equivalent to the text-escaping function for a given language L. For example, for the language XML:

natrep (litfrag "<em>wow!</em>") ==> "<em>wow!</em>" 
natrep (littext "ham & eggs")    ==> "ham &amp; eggs" 

Next, let’s construct a type-safe container for strings having a known language:

data Language l => SafeString l
    = SSEmpty
    | SSFragment l
    | SSCat (SafeString l) (SafeString l)

This data-type definition says that if l is a language, we can construct SafeString values for that language. Each value can represent an empty fragment of the language (via SSEmpty), a non-empty fragment of the language (via SSFragment), or the concatenation of two other SafeString values for the language (via SSCat).

Now comes the interesting part. We are going to use the type system to enforce the safe-string-handling rule for us.

We will do this using the SafeString data type we just defined. We have already placed the data type’s definition into a module that does not export the type’s data constructors. That means we will not be able to create SafeString values for ourselves. Instead, we must ask a small set of kernel functions, which are exported, to create the values on our behalf.

These kernel functions, which we are about to write, will create SafeString values only in accordance with our safe-string-handling rule. In particular, they will require us to certify that an existing string represents either text or a language fragment before creating a corresponding SafeString value for us. From then on, the type system will know which language the string is associated with and prevent us from joining it to regular strings or to SafeString values associated with other languages.

Let’s write these constructor functions now:

empty      :: Language l => SafeString l
empty       = SSEmpty

frag, text :: Language l => String -> SafeString l
frag f      = SSFragment (litfrag f)
text s      = SSFragment (littext s)

Here’s what the functions do:

  • empty – creates an empty SafeString in the Language l
  • frag f – takes a string that you certify as representing a fragment in the Language l and returns a corresponding SafeString
  • text s – takes a string that you certify as representing text and returns a corresponding SafeString in the Language l

Once the kernel creates SafeString values for us, we need some way to combine them safely. Thus we define the (+++) operator and the cat function:

-- join two SafeStrings of the same language
(+++) :: Language l => SafeString l -> SafeString l -> SafeString l
(+++)  = SSCat

-- join a list of same-language SafeStrings
cat   :: Language l => [SafeString l] -> SafeString l
cat    = foldr (+++) empty

Finally, we need a way to convert SafeString values into normal strings so that we can pass them through the boundaries of our safe-string-protected code and into the outside world. For this, we write the render function:

render ss = renders ss ""

renders SSEmpty        = id
renders (SSFragment a) = (natrep a ++)
renders (SSCat l r)    = renders l . renders r

(Don’t worry about the renders stuff. It implements a Haskell idiom for fast string concatenation.)

As a convenience, let’s round out our kernel with a Show instance that tells Haskell how to format SafeString values for display.

instance Language l => Show (SafeString l) where
    showsPrec _ ss =
        (lang ss ++) . (":\"" ++) . renders ss . ('"':)

lang ss =
    let SSFragment e = ss in language (undefined `asTypeOf` e)

And that’s our SafeStrings kernel.

Another look at the SafeStrings kernel

The following illustration, complete with poorly chosen colors, provides a visual summary of our system:

Stunning visual interpretation of the SafeStrings kernel and its relationship to the evil outside world

(Don’t worry about the $(q ...) stuff for the moment, we’ll talk about it later.)

Activating our mad art-interpretation skillz, we can now decipher the illustration:

Regular strings gain “admittance” to the SafeStrings kernel only via the text and frag certification functions, which we use to create corresponding safe strings for a given language. Once created, the safe strings live their entire lives in the fleshy-colored, egg-shaped protective sac that is the kernel, whose safe-string functions and operators use Haskell’s type system to prevent us from accidentally mixing the strings in unsafe ways. Further, because the kernel does not export its underlying data structures, we can’t screw around with the innards of our safe strings to break the kernel’s promises. When our safe strings have finally reached their ultimate, beautiful state, we can render them into regular strings and pass them bravely into the cruel outside world – where, most likely, somebody else’s broken code will screw them up anyway. But at least we tried.

Our first SafeString module: SafeXml

Now that we have written our SafeStrings kernel, let’s use it to create a SafeXml module that we can use for working with XML. Again, we will be writing library code that under normal circumstances would be hidden from view.

First, we will create a new module that uses the SafeStrings kernel:

module SafeXml
( Xml, xml, renderXml, module SafeStrings )
where
import SafeStrings

Next, we will create a wrapper type to testify that a string represents a fragment of XML:

newtype XmlString
    = XmlString { unXmlString :: String }
    deriving Show

If you go back and look at the export list for the module, you’ll see that the XmlString data type is not exported. It is internal to the module, and thus we, as clients of the module, can’t create values of that type. That means we can’t “forge” XML strings into existence. We can create them only through the safe-string kernel, and even then only by certifying a regular string as representing text or a language fragment. (The kernel, in turn, will create the needed values through the Language interface, which we now discuss.)

Like all good language types, XmlString needs to be a member of the Language type class, so we provide the necessary instance functions:

instance Language XmlString where
    litfrag  = XmlString
    littext  = XmlString . escapeXml
    natrep   = unXmlString
    language = const "xml"

Note that the functions satisfy the language laws we defined earlier. (The proof follows immediately from the definitions of XmlString, unXmlString, and escapeXml.)

Next, we need to write a function to implement the escaping rule for XML:

escapeXml xs =
    concatMap esc xs
  where
    esc '<'  = "&lt;"
    esc '>'  = "&gt;"
    esc '&'  = "&amp;"
    esc '"'  = "&#34;"
    esc '\'' = "&#39;"
    esc x    = [x]

Next, because we expect to work with XML frequently, we will create a convenient type synonym, Xml, for SafeString values that represent XML:

type Xml = SafeString XmlString

Finally, we will create a few convenience functions to create and render XML fragments. These functions are identical to the SafeString kernel’s frag and render functions but for the Xml type exclusively. When we use these functions, we won’t need to provide additional type annotations; the computer will know we are dealing with XML strings:

xml :: String -> Xml
xml = frag

renderXml :: Xml -> String
renderXml = render

And we’re done.

Before going on, let me point out two things:

  1. If you think the code we have written so far is long or perhaps confusing, please remember that it is library code. Typically, you would never see it. All you would do is import SafeXml and start using the library.
  2. The SafeXml implementation is formulaic, and we can replace all of it except for the escaping function’s definition with a single line of code, something we will do later.

A quick test drive of our SafeXml module

Let’s give our SafeXml module a spin in the GHC interactive shell.

We can create an XML fragment by certifying that a regular string represents a language fragment (via the frag function) and telling Haskell that we expect a result of type Xml.

Ok, modules loaded: SafeXml, SafeStrings.
*SafeXml> frag "<em>wow!</em>" :: Xml
xml:"<em>wow!</em>" 

Note how the output is prefixed with the label “xml:” to tell us that our kernel certifies this value to represent an XML fragment.

Because entering type annotations can be inconvenient, we can instead use the xml function, which certifies a string not just as a fragment but as an XML fragment:

*SafeXml> xml "<em>wow!</em>" 
xml:"<em>wow!</em>" 

If we want to represent text in XML, the kernel will automatically escape it for us:

*SafeXml> text "ham & eggs" :: Xml
xml:"ham &amp; eggs" 

Now let’s try to do something naughty. Will the type system let us?

*SafeXml> let someXml = xml "<em>Hi!</em>" 
*SafeXml> let plainOldText = "ham & eggs" 
*SafeXml> someXml ++ plainOldText

<interactive>:1:0:
    Couldn't match `[a]' against `Xml'
      Expected type: [a]
      Inferred type: Xml
    In the first argument of `(++)', namely `someXml'
    In the definition of `it': it = someXml ++ plainOldText

In Haskell, the (++) operator is used (among other things) to join strings. In the code above, we tried to use this operator to join an XML fragment to a plain-old string, which would have violated our safe-string-handling rule. Fortunately, we were unable to fool the type system into allowing this ill-conceived union to occur.

In fact, the union was never even attempted: our mistake was caught at compile time, before the code was ever converted into executable form. This is a big deal. Mistakes like this are programming errors that open security holes. Being able to catch these errors at compile time means you have the opportunity to track the errors to their source and fix them there. If you caught ill-conceived string unions only at run time, the logical errors that led to the attempted unions could have been in upstream code that has already executed – launching the missiles, perhaps. By then, it may be too late to undo the consequences.

Returning to our example, if we certify that the plain-old string represents text, we can make a safe union, so the type system lets us go ahead:

*SafeXml> someXml +++ text plainOldText
xml:"<em>Hi!</em>ham &amp; eggs" 

And that’s basically all there is to it.

Syntactic sugar for safe strings

Not having to worry about the strings problem is fabulous and all, but having to type in frag, text, and +++ is kind of clunky. Let’s get rid of the clunkiness by introducing some syntactic sugar.

The common case when dealing with strings in web applications is templates. For example, here’s a simplified version of the link_to method from the deservedly popular Ruby on Rails. The method wraps a hypertext link around some content by “interpolating” the content and a URL into a link template:

# NOTE: this example is in Ruby

def link_to(content_xhtml, url)
  "<a href=\"#{h url}\">#{content_xhtml}</a>" 
end

In this code, we need to HTML-escape the URL (via the h helper) before interpolating it into the template. We do not need to escape the content, however, because it is already in the template’s language, XHTML.

Now, to introduce our syntactic sugar, here’s link_to rewritten in Haskell and using safe strings:

-- Haskell code

link_to :: Xhtml -> Url -> Xhtml
link_to content url =
    $(q "<a href=\"#{r url}\">#{=content}</a>")

The type signature makes clear to everybody that the content parameter is XHTML, the url parameter is a URL, and the result is XHTML. The signature isn’t needed, but link_to is the stuff of libraries, and so annotations are good form.

The interpolation syntax is like Ruby’s, but with slightly different modifiers:

  • The template-quoting syntax is $(q "this is a template"). (Mnemonic: q for quote).
  • Within a template, we can interpolate variables using the familiar #{var} syntax.
  • If an interpolated variable holds a plain string, it will be escaped into the template automatically.
  • If an interpolated variable holds a safe string, we must use an interpolation modifier to specify how it should be interpolated (to avoid ambiguity):
    • #{r var} renders the safe string in var into text, and then interpolates the text into the template, escaping as necessary (mnemonic: r for render).
    • #{= var} inserts the safe string in var directly into the template, which must be of the same language (mnemonic: = for equal language types).
  • As a bonus, #{s var} interpolates any Show-able value in var into the template as text, escaping as necessary.

It’s pretty easy to tell which interpolation option is right for any situation, but late-night coding sessions make fools of us all. That’s why the type system is there to catch us when we make a dumb mistake.

Let’s try out the sugary link_to method:

> link_to (text "Tom's Weblog") (url "http://blog.moertel.com/")
xml:"<a href="http://blog.moertel.com/">Tom's Weblog</a>" 

Let’s take advantage of type inferencing in the next example:

> link_to $(q "<em>Espresso!</em>")
          $(q "http://google.com/search?q=espresso&oe=utf-8")

xml:"<a href="http://google.com/search?q=espresso&amp;oe=utf-8">
     <em>Espresso!</em></a>" 

In the above example, we supplied templates as input parameters. Haskell figured out their types and took care of the escaping (or not escaping) for us.

Now that we know what the syntactic sugar looks like, let’s see how to implement it.

Implementing the syntactic sugar using Template Haskell

We implement the SafeString library’s syntactic sugar using Template Haskell. A small function q (for “quote”) parses the sugared syntax at compile time and emits equivalent code using our safe-string functions frag, text, and so on. For example, the following sugar:

$(q "<em>#{mystr}</em>")

becomes the following code:

cat [frag "<em>", text mystr, frag "</em>"]

The code that makes it happen is fairly straightforward if you know Template Haskell, so I’ll skip the explanation because this article is already way too long. As usual, it’s library code, so normally we wouldn’t see it or care about it. All we care about is the $(q "...") sugar that the code makes available to us.

Here it is:

import Language.Haskell.TH
import qualified Text.ParserCombinators.ReadP as P

-- Convert template sugar into calls to frag, text, cat, etc.
-- This function is exported by the SafeStrings module.

q spec =
    [| cat $(parts) |]
  where
    parts = case xparse spec of
        []   -> error ("bad template: " ++ show spec)
        ps:_ -> foldr gen [| [] |] ps
    gen p ps' = (\p' -> [| $p' : $ps' |]) $ case p of
        SFrag s  -> [| frag $(litE (stringL s))         |]
        SIFrag s -> [| $(varE (mkName s))               |]
        SIShow s -> [| text (show $(varE (mkName s)))   |]
        SITxt s  -> [| text $(varE (mkName s))          |]
        SIRTxt s -> [| text (render $(varE (mkName s))) |]

-- AST for template-specification parts

data SpecPart
    = SFrag String  -- ^ language fragment
    | SIFrag String -- ^ insert fragment by variable reference
    | SIShow String -- ^ insert rendered variable via show
    | SITxt String  -- ^ insert literal text variable
    | SIRTxt String -- ^ insert rendered safe string var as text
  deriving Show

-- Parse a template specification

xparse spec = do

    (result, "") <- P.readP_to_S templateP spec
    return result
 where
    templateP = do
        P.many ((liftM SFrag (P.munch1 (/= '#'))) P.<++
                interpolationP P.<++
                liftM SFrag (P.string "#"))

    interpolationP = do
        P.string "#{"
        spec <- P.manyTill P.get (P.char '}')
        return $ case spec of
          'r':' ':var -> SIRTxt (strip var)
          's':' ':var -> SIShow (strip var)
          '=':var     -> SIFrag (strip var)
          var         -> SITxt  (strip var)

strip = frontAndBack (dropWhile (== ' '))
frontAndBack f = reverse . f . reverse . f

More sugar: defining additional safe-string types

One additional bit of Template Haskell code, which I won’t reprint here, defines declareSafeString. This function lets us eliminate the boilerplate code when defining new safe-string types. For example, compare our earlier definition of the SafeXml module with the following implementation of a module for safe URL strings:

module SafeUrl (Url, url, renderUrl, module SafeStrings) where
import SafeStrings
import Text.Printf
import Data.Char (ord)

escapeUrl xs =
    concatMap esc xs
  where
    esc x | isReserved x || x > '~' = urlEncode x
          | x == ' '                = "+"
          | otherwise               = [x]

urlEncode x  = '%' : printf "%02x" (ord x)
isReserved   = (`elem` "!#$&'()*+,/:;=?@[]")

$(declareSafeString "url" "Url" [| escapeUrl |])

The final line generates the boilerplate code for the wrapper type, the language definition, the Url type synonym, and the url and renderUrl language-specific convenience functions.

One big example to wrap things up

Because we have been discussing mainly library code, let’s take a step back and see some typical user-level code that uses safe strings. After all, that’s what counts.

Here is a Haskellized, safe-strings version of the Ruby (on Rails) code that I presented at the beginning of the article to add submit-to-Reddit and submit-to-del.icio.us buttons to my blog:

module Example where
import List (intersperse, break)
import SafeXml
import SafeUrl

type Xhtml = Xml

submit_this_article_links :: Article -> Xhtml
submit_this_article_links (Article title url) =
    cat . intersperse nbsp $ do
    (submit_title, submit_url :: Url, image_tag) <- site_list
    return $(q
      "<a href=\"#{r submit_url}\" \
         \title=\"#{submit_title}: &#x201C;#{title}&#x201D;\" \
        \>#{=image_tag}</a>" )

  where

    nbsp = xml "&#160;"

    site_list = [  -- move me into a database table
      ( "Submit to Reddit.com"
      , $(q "http://reddit.com/submit?url=#{r url}&title=#{title}")
      , image_tag "reddit.gif" "18x18" 0
      ),
      ( "Save to del.icio.us"
      , $(q "http://del.icio.us/post?v=2&url=#{r url}&title=#{title}")
      , image_tag "delicious.gif" "16x16" 0
      ) ]

The code looks fairly similar to the original Ruby code, with the exception of some extra backslashes, courtesy of Haskell’s rather-unfortunate syntax for multi-line string constants. (Perl and Ruby’s <<HERE syntax would be a welcome addition.)

The other big difference is that, in this version, the type system has automatically checked the code for strings-problem errors.

For completeness, here is the example’s supporting code (again modeled on Ruby on Rails). This code also makes extensive use of safe-string templates:

image_tag :: String -> String -> Int -> Xhtml
image_tag file_name size border =
    $(q "<img src=\"#{r image_url}\" height=\"#{height}\" \
         \width=\"#{width}\" border=\"#{s border}\"/>")
  where
    image_url         = $(q "#{=site_root}images/#{file_name}")
    (width, _:height) = break (=='x') size

link_to :: Xhtml -> Url -> Xhtml
link_to content url =
    $(q "<a href=\"#{r url}\">#{=content}</a>")

data Article = Article
  { article_title  :: String
  , article_url    :: Url
    -- more fields here
  }

sample_article =
    Article "I love chunky bacon!" $
    url "http://blog.moertel.com/permalink/to/article"

site_root :: Url
site_root =  url "http://blog.moertel.com/"

Have we done it?

Have we rid ourselves of the strings problem? If we use a programming language like Haskell and a library like SafeStrings, I think we can answer yes.

To be clear, the fundamental problem of having to manage different kinds of strings is still with us. As programmers, we still must understand the differences between URLs, XML, SQL, untrusted user input, and so on. But now, we don’t have to be perfect. As long as we can reliably slap the right type on a string when it first appears, we can let the computer worry about it from then on. If we forget to escape the string later, as it winds its way through the twisty code of a large web application and interacts with other strings in potentially dangerous ways, the computer will catch our mistake – at compile time, before it can possibly become a live security hole.

But if slapping the right types on strings – certifying them – is a pain in the neck, we won’t do it. We will happily go back to our days of winging it, where every string interaction becomes an opportunity for a perfectly human mistake to give birth to a nasty security vulnerability.

That’s why syntax matters. That’s why Template Haskell, Lisp macros, and other meta-programming tools are important: they let us craft friendly syntaxes that encourage the use of programming aids like SafeStrings. That’s why type inferencing is important: it lets us do away with redundant annotations and makes working with types convenient, so we can reap the benefits of strong guarantees without having to pay prohibitive costs.

If there is a moral to this story, it’s that modern type systems and macro systems are powerful tools. They let us do things that otherwise would be impractically inconvenient. They extend our reach as programmers and let us solve problems that we couldn’t solve before.
Update: minor edits for clarity.

Posted in , , , , , ,
Tags , , , ,
38 comments
no trackbacks
Reddit Delicious

Comments

  1. Mac said about 3 hours later:

    Since many of us don’t write web applications in Haskell, the utility of this particular implementation is limited for us. What features in Haskell make it possible to write such a system and make it easy to use?

    Given that list of features, what other languages could we apply this to? In particular, what common web application languages could support this?

    I’m most interested personally in a solution for PHP, but I’m sure there are others who would like to see it in Ruby, Python, Java, JSP, ASP.Net, or Perl, as a few examples.

    Great work! Analyses like this are a great way to raise the bar for security in the web application community.

    Mac

  2. Tom Moertel said about 4 hours later:

    Mac,

    You asked some great questions. Let me get to them.

    What features in Haskell make it possible to write such a system and make it easy to use?

    The two things that make this particular solution practical are the two things I singled out at the end of the article: the type system and the compile-time meta-programming support. The first makes the solution possible, and the second makes the solution convenient enough that you would actually want to use it.

    The type system is the backbone upon which everything else rests. The solution, in fact, is just a thin veneer over the type system. Without the type system, there is no solution.

    The meta-programming support reduces the cost – or “pain” – of using the solution. Haskell doesn’t offer friendly, Ruby-like string interpolation, but we were able to add that feature using Template Haskell (TH). If we didn’t have something like TH, we would have needed to write a pre-processor to integrate the friendly syntax.

    Given that list of features, what other languages could we apply this to?

    Any language that has an underlying compile-time type system could, in theory, support a solution like the one presented in the article. Java, C++, and C#, for example, fall into that category. To make the solution’s syntax friendly enough, though, you would probably need to write an external pre-processor.

    Perl (5), PHP, Python, and Ruby don’t offer compile-time type checking (yet). This solution, which effectively is compile-time type checking, would therefore be difficult to adapt to those languages.

    Cheers,
    Tom

  3. Kig said about 13 hours later:

    This isn’t so relevant to the topic at hand, but I’d just like to say it: Tom, your blog is pretty damn great. Good depth and breadth, and always educational.

    Keep it up!

  4. Neil K said about 13 hours later:

    I love this. Thanks for writing!

  5. Tom Moertel said about 14 hours later:

    Mac,

    A quick update on your second question: Via the comments about this article on Reddit, I learned that somebody has, in fact, implemented a simple type-based XSS-prevention solution in a Python web-programming framework. About three years ago, too.

    I don’t think the solution will catch mistakes at compile time, but if you get the type annotations right, it will make sure that strings are properly escaped when combined.

    See nas’s comment for the details.

  6. Mac said about 14 hours later:

    Tom,

    Thanks for your answers. That’s what I was afraid the answer would be, but at least that gives us something to push for in the next version of [insert your favorite web language here]. Hopefully some of the people building PHP6 right now will notice this and at least add an optional static typing system.

    I’m also really glad that you hit the nail on the head about the importance of making a solution easy enough to use that it is worth using. A solution that costs too much (in time, effort, money, or whatever other metric you’re using) doesn’t really solve anything.

    Again, this is great work, and a much-needed look at an area (web development) that the academic Computer Science community often ignores or belittles. The fact of the matter is that web applications are more and more being selected as the most appropriate solution to a wide variety of problems, and it doesn’t seem to be slowing down. It’s about time that web programming tools and methods get some of the same attention that has been paid to other, more traditional methods for building applications. Many of the problems facing web development and web languages have already been solved, but many of those solutions haven’t yet been incorporated into popular new web development languages and tools. If we don’t hurry, we’ll start getting more people saying all our problems would be solved by writing web apps in LISP… and there’s just not room in the web world for all those parentheses.

    Thanks, Mac

  7. Kirit said about 15 hours later:

    I’ve been thinking along exactly the same lines, but doing it in C++. Again the templates, type inference and operator overloading make it all very simple from an application writer’s perspective.

    This sort of type based solution is much better than Joel’s solution (which by co-incidence I was re-reading today).

    One aspect of the problem that I didn’t notice you mention is the complication that the strings are broadly compatible in nature – most escaped strings are exactly the same as the un-escaped strings and rather critically the escaping (and un-escaping) process itself is not idempotent.

    An idempotent escaping scheme would be lovely!

  8. Tom Moertel said about 15 hours later:

    Kirit, on the issue of idempotence, I’m not sure I understand what you mean. Can you describe the properties that your ideal escaping scheme would have?

  9. Greg Buchholz said about 15 hours later:

    I’m a big fan of static typing and Haskell, but I think you can reproduce this in most any language, if you are willing to move the compile time error messages to run time.

    #!/usr/bin/ruby
    
    class SafeString
      attr_reader :str
      def initialize(str); @str = str; end
      def render; self.str; end
    end
    
    class Xml < SafeString
      def +(arg2)
        arg2.kind_of?(Xml) ? Xml.new(self.str + arg2.str) \
                           : Xml.new(self.str + escape(arg2))        
      end
      def escape(s)
        s.gsub(/</,"&lt;").gsub(/&/,"&amp;").gsub(/"/,"&#34;")
      end
    end
    

    Let’s try a few things out…

    irb(main):001:0> load "safe_strings.rb" 
    => true
    irb(main):002:0> someXml = Xml.new("<em>wow!</em>")
    => #<Xml:0xb7d88ac4 @str="<em>wow!</em>">
    irb(main):003:0> plainOldText = "bacon & eggs" 
    => "bacon & eggs" 
    irb(main):004:0> plainOldText + someXml
    TypeError: can't convert Xml into String
            from (irb):4:in `+'
            from (irb):4
            from :0
    irb(main):005:0> (someXml + plainOldText).render
    => "<em>wow!</em>bacon &amp; eggs" 
    
  10. Kirit said about 16 hours later:

    We can pick a pretty stupid scheme like this:

    For an SQL field replace all single quotes with the sequence BOM followed by ’s’ (for single quote). Because BOM can’t be in a valid string we can safely use it to escape and because we have changed the single quote into a valid character (s) that is also fine.

    Now if we call

    SQLEscape( SQLEscape( “Kirit’s string” ) )

    We will get the properly escaped string

    KiritBOMss string

    Unescape this and it is also idempotent. So that we can just keep escaping or unescaping at every opportunity and not have to worry about it.

    The problem with escaping is not really that we lose track of whether or not the strings are escaped, but rather that the escaping must be done exactly once in order to work.

    The normal SQL expansion of replacing a single quote with two single quotes is not idempotent. Not only do we have to remember to escape it, but we also have to ensure that it is escaped only once.

    Similar problems abound in URL escaping and SGML escaping.

  11. Tom Moertel said about 16 hours later:

    Greg,

    The big benefit of this solution is that it provides compile-time guarantees. The type checker will prove at compile time that your program cannot, by some strange set of occurrences, mix strings in unsafe ways at run time. If the type checker finds an error, you can (and, in fact, must) fix it before your code can be deployed.

    A run-time solution, on the other hand, offers you little choice but to abort the current computation if a dangerous string interaction is detected in a live web application. But that’s a horrible thing to do to your users, so relying on a run-time solution requires you to be confident that no such errors slip into a live application.

    In effect, then, you’re back to trying to catch the errors at “compile time,” but via your unit testing suite instead of a compile-time type system. If you adopted the practice of always including something like assert_isa_safe_string_of_type in your tests for final view output, I suspect that you could get reasonably strong assurances against the possibility of run-time string-type errors. It would take a lot more work than borrowing an existing compile-time type system, but I suspect that the effort would still be worthwhile.

    Thanks for your great comment.

    Cheers,
    Tom

  12. Tom Moertel said about 16 hours later:

    Kirit,

    Thanks for the clarification.

    I think the desire for idempotent escaping misses something important: that whether something is sufficiently escaped depends upon intent. If for example, I have some XHTML and I want to insert it into an XHTML template, I have two options:

    1. Insert the XHTML as XHTML, so that it is interpreted as part of the template into which it is interpreted
    2. Insert the XHTML as text, so that it is escaped and interpreted as text within the template

    Which option is correct depends upon my intent. That’s why the following code contains a type error under my solution:

    let markup = xml "<em>Hi!</em>" in $(q "#{markup}") :: Xml
    

    There are two valid ways of inserting markup into the template, so you must supply the r or = modifier to eliminate the ambiguity.

    The following code shows a typical use of both schemes together:

    showMeTheCode :: Xhtml -> Xhtml
    showMeTheCode markup =
        $(q "<p>The XHTML markup <code>#{r markup}</code> \
            \looks like this in your browser: #{= markup}.</p>")
    

    Note that if I call the code like so:

    showMeTheCode (text "rice & beans")
    

    the text must be escaped twice at one point, something an idempotent scheme would not allow.

    > showMeTheCode (text "rice & beans")
    xml:"<p>The XHTML markup
    <code>rice &amp;amp; beans</code>
    looks like this in your browser:
    rice &amp; beans.</p>" 
    

    Cheers,
    Tom

  13. Kirit said 1 day later:

    Tom, you are of course exactly right, but this is still a consequence of the encoding scheme used.

    With SQL we can choose our own encoding for putting and getting strings into/from the database, and we could reserve a character that we make illegal in the strings in order to create an idempotent escape and unescape scheme.

    For XML the problem is that the characters used in escaping are also valid in the text and the structural markers are also legal data too.

    This is great in that it makes it convenient to use normal text processing tools to build XML files, but means we have to use an awkward escape/unescape system.

    But that shouldn’t stop us from imagining another encoding, even if only to see what sort of world it would leave us.

    So what if the XML tags where delimeted with the characters codes 0×16 & 0×17 rather than angled brackets? Now if we wanted to talk about XML we would need to use some other characters to show the correct format. Unicode already defines these and we’d simply drop in U+240e and U+240f. Add in ESC (0×1b) as a proper escape leader for things like quotes and we could now have XML that looked more like this;

    ␎tag attribute=”My ␛”Value␛” here”␏Text part ␎/tag␏

    Note that I can talk about the format without having to worry about double escaping, again because we use illegal characters the whole thing is idempotent.

    It’s also a right royal pain in the wossname to write by hand.

    I’m not really seriously proposing that anybody should have used a scheme like this, but we should recognise that the core reason we have so much trouble with the escaping and unescaping is the lack of idempotence in the transformation schemes, coupled with the actual idempotence of the transformations for the vast majority of text.

  14. Adam Fitzpatrick said 7 days later:

    On a somewhat related note, there’s an article in this week’s LWN about detecting endianness mismatches in the Linux kernel. New typedefs were introduced which allow the programmer to declare the endianness of the value. While the C compiler treats __be32 and __le32 (for example) as equivalent, a separate tool, sparse, can detect incorrect handling of these types.

  15. Oleg Kiselyov and Chung-chieh Shan said 10 days later:

    As you are aware, this approach generalizes beyond strings: it can prevent dereferencing a null pointer, dividing by zero, as well as buffer overflow (especially easily when the buffer is statically allocated). The idea of enforcing an invariant using an abstraction barrier dates back to James Morris’s `Protection in Programming Languages’ (CACM 1973). Thirty years ago, the same idea prompted Robin Milner to design ML as a `scripting language’ for his LCF theorem prover. ML’s static typing guaranteed that scripts (aka tactics) produce theorems only by applying valid inference rules. This guarantee as well as the solution to the present `string problem’ can be achieved in any statically typed language that supports abstract data types—ML, Clean, and even Java and C++. It is hard to prove that the invariant is preserved (and language features such as introspection and coercion obviously break the proof). Currently the formal proof is given only for simple cases.

  16. Tom Moertel said 13 days later:

    Oleg, thanks for sharing the some of the historical context for type-based solutions to common programming problems. It is not surprising that the foundational ideas go back thirty years. What does surprise me, however, is that these ideas are not more widely used, given how effective they are.

    I can only hope that it doesn’t take another thirty years before some of the more recent advances reach the mainstream.

    Note to readers: If you want a glimpse into the future of type-based solutions to interesting programming problems, see Oleg Kiselyov and Chung-chieh Shan’s paper on Lightweight Static Capabilities. In the paper, they show how you can use types as static capabilities to verify safety conditions and, e.g., guard against programming errors such as trying to read beyond the bounds of arrays. They also crystallize the idea of using a trusted kernel to hand out type-based capabilities that are verified at compile time.

  17. paul@cogito.org.uk said 18 days later:

    For those who don’t know, Oleg Kiselyov is the World Grand Master of Haskell type hackery. He is the man who writes factorial functions in the type system. The rest of us normally-brained Haskell programmers measure our type-hackery in milli-Olegs.

  18. Justin Bailey said 27 days later:

    Tom,

    Thanks for the excellent article. I really appreciate your attention to showing the design of the library. As a beginning Haskell programmer, seeing your design and implementation process gave me a lot of insight into designing “in” Haskell, not just slapping some code together.

    There does seem to be one mistake in the link_to example. It appears to have improperly escaped “&” in the URL:

    > link_to $(q "<em>Espresso!</em>")
              $(q "http://google.com/search?q=espresso&oe=utf-8")
    
    xml:"<a href="http://google.com/search?q=espresso&amp;oe=utf-8">
         <em>Espresso!</em></a>" 
    
    

    Unless I misunderstood and that is intentional?

  19. Tom Moertel said 27 days later:

    Justin, thanks for your kind words about the article.

    On your question about the ampersand, in the URL itself, the ampersand is interpreted under the rules of the “URL language,” and thus serves to separate the two query parameters (q=espresso and oe=utf-8), which is the intended interpretation, so escaping is not needed. Nor is escaping needed when the URL is passed as a Url value into the link_to call: Url values represent fragments of the native URL language, and thus can represent our URL naturally. (Nor would escaping be correct here: it would alter the location the URL represents.)

    When the URL is embedded within XHTML as part of a hypertext link, however, its ampersand must be escaped to avoid being misinterpreted as XHTML (where the ampersand would represent the start of an XML character-entity reference).

    Does that clear things up?

  20. Daniel Martin said 39 days later:

    Justin, it’s a common mistake to believe that the ampersand in that context shouldn’t be html-escaped as well. However, it should be – see section B.2.2 of the HTML 4.01 specification . The clumsiness of having to do this double escaping is why many sensible CGI libraries accept ”;” as a separator on a par with as “&”.

    Tom, I’m glad to see this – I’ve suspected since seeing Joel Spolky’s article that this problem was screaming out for a type-based solution.

    One quibble, though: the question marks with SQL are not really addressing the same type of problem as the xml/url/regexp/etc. string-escaping problem.

    First off, depending on your database the strings are not in fact escaped and inserted into the string – instead, a more common model is that the string with its question marks is sent to to the database to be compiled, a “compiled statement” handle is returned, and this handle is then sent to the database with an array of values. This compiled statement can then be cached and re-used carefully to slightly speed things up if the same query is particularly frequent.

    However, even if you only use each precompiled statement once and therefore are using the question marks as a substitute for string escaping, there are at least two other issues:
    1. Your string substitution engine must be extended to handle inserting numeric values in a different fashion from strings. (and you should probably also handle date/time values in a type-safe manner too) This is a twist, but not an unsurmountable one.
    2. There is no general string escape method that works on both MySQL and non-MySQL SQL databases. (Briefly: how do you escape the three character string “\''”?) The advantage of the ? method is that the “escaping” happens in the database or the database driver. Offhand, this would seem to require that you know whether your application will target a MySQL database at compile time.

    Then there’s also the issue of escaping bits of a “LIKE” pattern, where the escaping depends on the query, and again on whether or not your target database is MySQL.

  21. DoubleD said 49 days later:

    You seem to mistake the idea of the ? in SQL statements. You are not passing off the string to a magic function for escaping or templating but compiling the SQL. The database interprets,optimizers, and plans the query. You then have holes in which to insert your data. There is no need to escape the data as the SQL has already been interpreted.

    This has a number of advantages: 1. Even the best escaping routines tend to break large complex objects that need lots of escaping. 2. You overcome command text limits – your SQL statement is limit to a couple of 1000’s characters. The amount of data that you place in your ‘holes’ is unlimited. 3. You can take the performance hit of preparing the statement off stage, as it where, decreasing time spent waiting at time critical stages. 4. You get increase performance as the prepared statement can be reused with out the database firing up its interpreter, planner, and optimizer again. 5. Smart drivers will cache the prepared statement increasing preformance, even if you don’t explicitly reuse it.

  22. Jonathan Allen said 50 days later:

    > Why are so many sites vulnerable to these well-known holes? Probably because it’s insanely hard for programmers to solve the fundamental “strings problem” at the heart of these vulnerabilities.

    You have no idea what you are talking about. It is trivally easy to handle this just by using stored procs or parameterized queries.

  23. Jonathan Allen said 50 days later:

    > The problem with escaping is not really that we lose track of whether or not the strings are escaped, but rather that the escaping must be done exactly once in order to work.

    Then only do it the moment you append it to the sql string. It isn’t hard, we have been doing it for decades in non-web applications just so that varChar fields won’t break.

  24. Tom Moertel said 50 days later:

    DoubleD:

    I didn’t mean to suggest that the only reason ? appears in SQL statements is for escaping. That reason, however, is the only one that’s relevant to the article, and so that’s the only one I discussed.

    You are not passing off the string to a magic function for escaping or templating but compiling the SQL.

    For the record, in some popular database-integration implementations (e.g., RoR’s ActiveRecord), the SQL statements are not compiled but pretty much expanded like a big string template and then sent to the database, as is.

    Jonathan Allen:

    The vulnerabilities caused by the “strings problem” are not limited to SQL injection. Take XSS, for example. You can’t solve that problem with stored procs or parameterized queries. If, then, you want to solve the whole problem, you’ll need a solution that works everywhere, like the type-based approach I presented in the article.

    Cheers,
    Tom

  25. Simon said 53 days later:

    Perl’s Taint provides the “certification” part for Perl CGI apps, by automatically identifying unsafe strings and requiring one to match it to a regular expression to make it safe, but no type checking of course, so one can still mangle the layering of strings. Still half a safety net is better than none.

    Clearly I need a language combining the best parts of Perl and Haskell, we should call it Paskell ;)

  26. Daniel Axelrod said 86 days later:

    Jonathan Allen, you are advocating programmer discipline as a solution to this problem. That’s a legitimate solution, but it’s harder to enforce, especially as the application becomes more complex.

    Programming is all about making the computer worry about things so you don’t have to.

  27. dbt said 118 days later:

    Putting user data in SQL commands is always wrong. All sane database APIs let you send data separate from the SQL, which is what ? placemarks are for.

    Python’s web.py is the best at this by far. It lets you do, say, web.select(“select * from table where column = $value”, {‘value’: userVariable})

    And it will turn that into a compilable SQL statement with arguments. Fantastic stuff.

  28. Joachim Breitner said 121 days later:

    I think we need something like that in Hasekll to distinguish between Bytestreams (such as the content of a file), that might containt encoded text, and encoding-independant Text. I really wish functions like getContent would return [Char8], so people (like me) don’t accidentally mix Utf8 encoded text with unencoded or differently encoded text.

  29. Guillaume said 166 days later:

    Where is the submit button to do searches on your blog please?

  30. Tom Moertel said 167 days later:

    Guillaume, this blog’s search system (unfortunately) requires Javascript and (also unfortunately) doesn’t have a submit button.

    If you want to search my site, the best option is to use Google. Just go to Google, enter your search terms, add the term site:blog.moertel.com, and finally submit your query.

    Cheers,
    Tom

  31. Dave Hinton said 198 days later:

    Very useful :-)

    Is this code packaged up anywhere? I looked using in Hackage and Hoogle but couldn’t find it.

  32. Tom Moertel said 200 days later:

    Dave: No, the code still isn’t packaged yet. It’s on my list of things to do when I get a new shipment of spare time – or a clone. ;-)

    Cheers,
    Tom

  33. Victor Nazarov said 260 days later:

    Joachim Breitner, agree about encoding problem. Char seems to be a unicode charecter in modern Haskell implementations. But it seems that Word8 should be used instead of Char8, wich you’ve mention. There are some implementations of binary I/O and encode/decode functions. Google for it.

    Tom, My opinion is that Haskell way is not to use strings at all. SQL, URL and Xml should be an embedded DSLs. And there are implementations like Haskell DB and lots of XML combinators libraries, especially those to transform XML. XML transforming libraries makes CGI programming with xml templates amazingly easy, TH is not needed in user code. Using DSLs for every languge (not so many for common Web programming) you gain static type safety for many aspects of SQL and XML, that otherwise may be tricky. And String objects will play a modest role of text literals, as they being done for many year.

  34. Jeremy Hughes said 309 days later:

    Always nice to see a type system being put to good use.

    For interest’s sake I slapped together an implementation in Java (sans the template haskell sugar).

    public interface SafeString<L extends SafeString> {
    
        public String natural();
    
        public L join(L l);
    
        public String language();
    
    }
    
    public abstract class AbstractSafeString<L extends SafeString> implements SafeString<L> {
    
        @Override
        public String toString() {
            return language() + ": " + natural();
        }
    
    }
    
    public class SafeXMLFrag extends AbstractSafeString<SafeXMLFrag> {
    
        private final String content;
    
        public SafeXMLFrag(final String s) {
            content = s;
        }
    
        public SafeXMLFrag join(SafeXMLFrag l) {
            return new SafeXMLFrag(content + l.natural());
        }
    
        public String language() {
            return "XML";
        }
    
        public String natural() {
            return content;
        }
    
    }
    
    public class SafeXMLText extends SafeXMLFrag {
    
        public static final String AMP_REPL = "@AMP_REPLACEMENT!";
    
        public SafeXMLText(final String s) {
            super(s.replaceAll("&", AMP_REPL)
                   .replaceAll("<", "&lt;")
                   .replaceAll(">", "&gt;")
                   .replaceAll(AMP_REPL, "&amp;"));
        }
    
        public static void main(String...args) {
            final String ta = "<em>This is XML</em>";
            System.out.println(ta);
            final SafeXMLFrag xa = new SafeXMLFrag(ta);
            System.out.println(xa);
            final String tb = "Ampersands (&) need escaping.";
            System.out.println(tb);
            final SafeXMLFrag xb = new SafeXMLText(tb);
            System.out.println(xb);
            final SafeXMLFrag xc = xa.join(xb);
            System.out.println(xc);
        }
    
    }
    

    Nowhere near as nice as haskell, but it works. Running SafeXMLFrag.main() prints:

    <em>This is XML</em>
    XML: <em>This is XML</em>
    Ampersands (&) need escaping.
    XML: Ampersands (&amp;) need escaping.
    XML: <em>This is XML</em>Ampersands (&amp;) need escaping.
    
  35. Jeremy Hughes said 309 days later:

    Woops. Typo. Should be, “Running SafeXMLText.main()...”

  36. Andy Armstrong said 357 days later:

    Here’s something similar for Perl

    http://search.cpan.org/~andya/String-Smart/

    It does no compile time checking of course. Instead it concentrates on tracking the current encoding of a string and computing and applying the correct transformations when asked for a particular representation of a string.

  37. Andy Armstrong said 357 days later:

    How do you know you won’t find ’@AMP_REPLACEMENT!’ in your strings Jeremy? :)

  38. Jermey Hughes said 594 days later:

    ’@AMP_REPLACEMENT!’ was just a quick-n-dirty hack. Escaping in a single pass would solve the problem :-)

Trackbacks

Use the following link to trackback from your own site:
http://blog.moertel.com/articles/trackback/186

(leave url/email »)

   Comment Markup Help Preview comment