Posterous theme by Cory Watilo

Bourne is not bash (or: read, echo, and backslash)

Bourne is not bash

Anyone who has ever written a shell script is familiar with the incantation, #!/bin/sh. This tells a Unix loader that the file should be interpreted by a Bourne shell. The Bourne shell is an appropriate lightweight interpreter for simple scripts that invoke a couple other programs. When you are using it, though, it's important to remember that bash and Bourne are subtly different shells.

The conflation of the two shells is exacerbated further by many Linux distributions having /bin/sh simply be a symlink to /bin/bash. Debian-based distributions, including Ubuntu, depart from this practice, instead having /bin/sh linked to /bin/dash, a faster, simpler, and POSIX-compliant version of the Bourne shell. In the process, of course, many bash conveniences are lost, and unexpected problems can arise if you are expecting bash behavior.

A catlike example

Let me give a simple illustration of the problem. Let's implement a Bourne version of cat:

If you've never used while read ... in a shell script, it's a very useful command. If you run man builtins, you'll learn that read name... reads a line from standard input, assigning each (whitespace-delimited) word to each variable name. When the last name is reached, the rest of the line is placed into that variable.

In my example above, the loop iterates for each line of standard input, assigning the text to the line variable. The echo command then prints that line out. Simple enough, right?

Problems with backslash

Let's try out our new script:

$ echo 'abcdefghijklmnop
123456789' | /bin/dash bourne-cat.sh
abcdefghijklmnop
123456789

So, exactly what we expect. Now let's break it:

$ echo 'abcdefghijklm\nop
123456789'
abcdefghijklm\nop
123456789

$ echo 'abcdefghijklm\nop
123456789' | /bin/dash bourne-cat.sh
abcdefghijklmnop
123456789

It swallowed the backslash—definitely not what we wanted. So what's going on?

read interprets backslash

Referring back to the builtins man page, "The backslash character (\) may be used to remove any special meaning for the next character read and for line continuations." That is, read considers backslash to be a special character, so the line variable does not contain the backslash. If we do not want that behavior, we have to use read -r:

Now our problem should be fixed:

$ echo 'abcdefghijklm\nop
123456789' | /bin/dash bourne-cat-read-fixed.sh
abcdefghijklm
op
123456789

Hmmm... So what happened there?

Bourne's echo interprets backslash

Since we're reasonably sure that read is now behaving nicely, there must be a problem with echo. Reading the man page, we see that echo accepts a -E option that disables the interpretation of backslash escapes. According to the man page, however, this is the default behavior. Oh well, let's throw the option in there to see if it makes a difference:

$ echo 'abcdefghijklm\nop
123456789' | /bin/dash bourne-cat-E.sh
-E abcdefghijklm
op
-E 123456789

Seriously? Our echo command is ignoring the -E option entirely, happily outputting both it and the interpreted newline. This is getting a bit nutty.

When echo(1) isn't echo

Returning to the echo man page, you will see an important caveat near the bottom, saying "NOTE: your shell may have its own version of echo, which usually supersedes the version described here. Please refer to your shell's documentation for details about the options it supports." Reading on, we see the AUTHORS: Brian Fox and Chet Ramey. Head on over to man bash, where you will see that it has the same authors. In other words, the echo man page we are reading is for bash's echo, and we are using dash's echo in our script (because we are using the dash interpreter). Let's try bourne-cat-read-fixed.sh again, this time using bash instead of dash:

$ echo 'abcdefghijklm\nop
123456789' | /bin/bash bourne-cat-read-fixed.sh
abcdefghijklm\nop
123456789

Using bash, it works exactly as expected. If we had been using Arch Linux or Gentoo, we would not have even realized the differences between putting #!/bin/sh and #!/bin/bash at the top of our script, but there are differences, and it's important for us to remain aware of them.

Two correct examples

The correct way to do it, portable on both bash and dash systems, is to use printf instead of echo. In doing so, note that it is important that the string to be printed is not used as the FORMAT string, but instead as one of the ARGUMENT strings, interpolated with the "%s" sequence:

$ echo 'abcdefghijklm\nop
123456789' | ./bourne-cat-printf.sh
abcdefghijklm\nop
123456789

Another way to do it, if you prefer to use bash's echo instead of printf (as I do), is to ensure that bash is specified as the interpreter:

$ echo 'abcdefghijklm\nop
123456789' | ./bash-cat.sh
abcdefghijklm\nop
123456789

Conclusion

In summary, there are a few things to keep in mind when writing a shell script:

  • Although /bin/sh is often linked to /bin/bash, it is unsafe to assume that /bin/sh is a bash interpreter. On Debian it's /bin/dash, and many BSDs have a dedicated Bourne shell executable.
  • If you prefer using #!/bin/sh (there may be some speed advantages), you must ensure that your script is fully POSIX-compliant. For example, you might need to use printf instead of echo.
  • Much of the time, the options, escaping, and other features provided by bash make it a better choice for scripting. Make sure to specify the bash interpreter on such scripts.
  • When you do want to use bash, you should set the first line to be #!/usr/bin/env bash so that it can be found anywhere on the user's PATH. It's not at /bin/bash for all POSIX systems.
  • Most of the time that you are using read, you really want to use read -r. Without -r, read will strip out backslashes.

Happy scripting!

Foolproof HTML escaping in Javascript

For those who don't want to read a lengthy blog post:

Users can be malicious

As any good web developer knows, it's important to be constantly vigilant with the handling of user data. We avoid buffer overflows and format string exploits (remember those?) by using safer languages or being careful in our C. To avoid SQL injection, we never build database queries by concatenating user-supplied data. These measures protect the integrity of the data on our servers, but what about our (non-malicious) users?

These days, most new websites accept user input which will later be displayed to other users. This brings in a host of new issues, including XSS and CSRF attacks. The problem in both cases is that a user-supplied string might contain arbitrary HTML, including Javascript or carefully-crafted <img> tags, and if we mindlessly dump it to other users' browser, such scripts could hijack user identities or send sensitive data to an attacker. Browsers are starting to offer some defense to these attacks, but historically the onus has been on web developers to ensure that we recognize where we are sending user input to the browser and that we properly escape that data.

Escaping user input

Many developers have opinions on the proper place to escape user input. I'll review three options:

  1. Before storage
  2. On the backend, while building the HTML
  3. On the frontend, in the Javascript that builds HTML

Before storage

One approach is to escape data as soon as you receive it. This is widely used because it's one of the most foolproof methods: just filter the input once and forget about. The core problem with this approach is that in its usual implementation, it's essentially a one-way operation; the data that the user originally submitted is no longer retrievable.

You've probably noticed clumsy implementations on various discussion sites that allow editing: your edit window shows your text all mangled. All your < have been converted to &lt;. Furthermore, if don't change it back, the next edit will be &amp;lt;. This highlights another weakness of this method: it is susceptible to double-escaping, since there is often no bookkeeping to indicate that the data has been escaped already.

I prefer to maintain the data in its original form and handle it with care elsewhere.

On the backend

For years, the languages commonly used for web development have included libraries that properly handle HTML escaping. Good developers clearly indicate in the code and documentation where user-created data exists, and they use appropriate libraries to escape all such data as it is converted into an HTML page. Barring problems in the library implementation or lapses in vigilance, this is a solid approach, and it allows a lot more flexibility than the previously-discussed method. For example, it is now possible to echo the input as-is back to the creator for editing purposes.

This is a valid approach to the problem, but as the popularity of AJAX, JSON, and Javascript widget-based rendering continues to increase, the backend approach is often not an option. We have to perform the escaping on the frontend

On the frontend

A lot of new web development today centers on dynamically-created content. We build a basic HTML frame with some Javascript that pulls in everything else and places it on the page. A user action results in an AJAX request being sent to the server with a JSON response sent back that contains data to be placed on the page. As with HTML escaping, most web languages have excellent libraries for converting arbitrary objects to JSON. All characters that are sensitive in a Javascript context are properly escaped in the conversion to JSON. But what happens when it's received on the frontend?

Escaping in Javascript

Quite often, we're receiving user-created data as part of a JSON response, and eventually we have a string of that data assigned to a Javascript variable. Now we want to build the HTML using that string. Let's call it unsafe_str.

The unsafe way

This approach is vulnerable to every problem I outlined at the outset. Don't do this in your code! It's not much more difficult to do it in a safe way.

The safe way

This uses the browser's own knowledge of which characters are sensitive to properly escape the string. It's fast, and according to quirksmode, it's supported by every browser out there. But when we're concatenating strings or building some widget via a class hierarchy, this isn't always possible. Sometimes we need to escape the string way before we add it to a DOM node. Enter various hacks to make that happen.

Hack #1: inline

I see this all the time, and I've been guilty of similar methods on both the backend and the frontend. You know it's inefficient, but it's trivial and it works and it's basically a one-liner. Then you notice a random bug where you converted < to > by mistake, or maybe you forgot the pesky semicolon, so you decide to create a canonical escape function. Then it turns out that you sometimes need to escape part of an HTML tag attribute. You eventually settle on something like the following.

Hack #2: the catchall

This works pretty well. It handles the most important cases, but you know in the back of your mind that it's wasteful. You've traversed the string five times (creating five new strings!) just to return the escaped version, and you have &quot; characters all over the page where they aren't necessary. Your programming aesthetic takes over and one evening you convert it.

Hack #3: more efficient catchall

Now you're pretty happy. You only traverse the string once. You handle escaping both within and outside of attributes. But eventually you want to un-escape the strings you've escaped. In the process of writing that function, you learn that there are 252 named entities in HTML 4, in addition to literal entities like &#dddd; (decimal) and &xhhhh; (hex). Wow, you think, there must be a better way. And then you think back to "the safe way":

Can't we leverage that? We can!

The best way to escape HTML in Javascript

The browser already knows how to escape strings; the document.createTextNode method does it. We can take advantage of this to make string escaping fast, safe, and dead-simple. You can never beat the browser's builtin (often C++) code with your Javascript. So don't reinvent the wheel; let the browser solve the problem for you!

Acknowledgments

Many thanks to Big Dingus and ceefour who provided the inspiration for most of the code on this page.

Tips for Remote Unix Work (SSH, screen, and VNC)

I am not where the work is

If you are anything like me, you have programs running on all kinds of different servers. You probably have a github account, a free Heroku instance, a work desktop, a couple website instances, and maybe even a home server. The best part is that using common Unix tools, you can connect to all of them from one place.

In this post, I will review some of the more interesting aspects of my workflow, covering the usage of SSH, screen, and VNC, including a guide for getting started with VNC. I’ll give some quick start information and quickly progress to advanced topics (like SSH pipes and auto-session-creation) that even experienced Unix users may not be aware of.

SSH to rule them all

By now you’ve almost certainly used SSH. It’s the easiest way to login to a remote machine and get instant command line access. It’s as easy as ssh user@example.com. You type in your password, and you’re in! But you might not know that it can be even easier (and more secure) than that.

Logging in via SSH without a password

We have only recently seen websites start to offer solutions for logging in without a password. SSH has provided a secure mechanism for this (based on public-key cryptography) since its inception. It’s pretty easy to setup once you know how it works.

1. Generate a public-private key pair

If you haven’t already, run ssh-keygen on your laptop, or whatever computer you will be doing your work from. You can just continue pressing Enter to accept the defaults, and you can leave the password blank (if you secure your laptop with encryption, a locking screensaver, and a strong password, your SSH key doesn’t require a password). This will generate a public key at ~/.ssh/id_rsa.pub and a private key at ~/.ssh/id_rsa. The private key should never leave your computer.

2. Copy the public key to each computer you connect to

For each computer that you connect to, run the following command (edited thanks to a suggestion by rmc in the Hacker News comment thread):

ssh-copy-id user@example.com

Note that you can specify -p PORT or any other SSH arguments before the user@example.com portion of the above command

This should be the last time you ever have to type your login password when connecting to the remote server. From now on, when you SSH to the remote server, its sshd service will encrypt some data using the public key that you appended to authorized_keys, and your local machine will be able to decode that challenge with your private key.

3. There is no step 3

It’s that easy! Don’t you wish you had set this up a long time ago?

SSH and pipes

If you take a look at the ssh-copy-id script, you'll see a line that roughly translates to:

cat ~/.ssh/id_rsa.pub | ssh user@example.com "umask 077; test -d ~/.ssh || mkdir ~/.ssh ; cat >> ~/.ssh/authorized_keys"

When you ran ssh-copy-id above, here's what that line did:

  1. The contents of ~/.ssh/id_rsa.pub were piped into the SSH command
  2. SSH encrypted that data and sent it across the network to your remote machine
  3. Everything in double quotes after the host is a single argument to ssh; this specified that instead of giving you an interactive login, you instead wanted to run a command.
  4. The first portion of that command (umask 077; test -d ~/.ssh || mkdir ~/.ssh ;) created a .ssh directory on the remote machine if it did not already exist, ensuring that it had the proper permissions.
  5. The second portion (cat >> ~/.ssh/authorized_keys) received the standard input via the SSH tunnel and appended it to the authorized_keys file on the remote machine.

This avoids the need to use SCP and login multiple times. SSH can do it all! Here are some more examples to show you some of the neat things you can do with SSH pipe functionality.

Send the files at ~/src/ to example.com:~/src/ without rsync or scp

cd && tar czv src | ssh example.com 'tar xz'

Copy the remote website at example.com:public_html/example.com to ~/backup/example.com

mkdir -p ~/backup/
cd !$
ssh example.com 'cd public_html && tar cz example.com' | tar xzv

See if httpd is running on example.com

ssh example.com 'ps ax | grep [h]ttpd'

Other SSH tunnels

If piped data were the only thing that could be securely tunneled over SSH connections, that would still be useful. But SSH can also make remote ports seem local. Let’s say that you’re logged into example.com, and you’re editing a remote website that you’d like to test on port 8000. But you don’t want just anyone to be able to connect to example.com:8000, and besides, your firewall won’t allow it. What if you could get a connection to example.com, localhost:8000, but from your local computer and browser? Well, you can!

Create an SSH tunnel

ssh -NT -L 9000:localhost:8000 example.com

Using the -L flag, you can tell SSH to listen on a local port (9000), and to reroute all data sent and received on that port to example.com:8000. To any process listening on example.com:8000, it will look like it’s talking to a local process (and it is; an SSH process). So open a terminal and run the above command, and then fire up your browser locally and browse to localhost:9000. You will be whisked away to example.com:8000 as if you were working on it locally!

Let me clarify the argument to -L a bit more. The bit before the colon is the port on your local machine that you will connect to in order to be tunneled to the remote port. The part after the second colon is the port on the remote machine. The “localhost” bit is the remote machine you will be connected to, from the perspective of example.com. When you realize the ramifications of this, it becomes even more exciting! Perhaps you have a work computer to which you have SSH access, and you have a company intranet site at 192.168.10.10. Obviously, you can’t reach this from the outside. Using an SSH tunnel, however, you can!

ssh -NT -L 8080:192.168.10.10:80 work-account@work-computer.com

Now browse to localhost:8080 from your local machine, and smile as you can access your company intranet from home with your laptop’s browser, just as if you were on your work computer.

But my connection sucks, or, GNU screen

Have you ever started a long-running command, checked in on it periodically for a couple hours, and then watched horrified as your connection dropped and all the work was lost? Don’t let it happen again. Install GNU screen on your remote machine, and when you reconnect you can resume your work right where you left off (it may have even completed while you were away).

Now, instead of launching right into your work when you connect to your remote machine, first start up a screen session by running screen. From now on, all the work you are doing is going on inside screen. If your connection drops, you will be detached from the screen session, but it will continue running on the remote machine. You can reattach to it when you log back in by running screen -r. If you want to manually detach from the session but leave it running, type Ctrl-a, d from within the screen session.

Using screen

Screen is a complex program, and going into everything it can do would be a series of blog posts. Instead, check out this great screen quick reference guide. Some of screen’s more notable features are its ability to allow multiple terminal buffers in a single screen session and its scrollback buffer.

What happened to Control-a??

Screen intercepts Control-a to enable some pretty cool functionality. Unfortunately, you may be used to using Control-a for readline navigation. You can now do this by pressing Ctrl-a, a. Alternatively, you can remap it by invoking screen with the -e option. For example, running screen -e ^jj would cause Control-j to be intercepted by screen instead of Control-a. If you do this, just replace references to “C-a” in the aforementioned reference guide with whatever escape key you defined.

Shift-PageUp is broken

Like vim and less, screen uses the terminal window differently from most programs, controlling the entire window instead of just dumping text to standard output and standard error. Unfortunately, this breaks Shift-PageUp and Shift-PageDown in gnome-terminal. Fortunately, we can fix this by creating a ~/.screenrc file with the following line in it:

termcapinfo xterm ti@:te@

And while you’re mucking around in .screenrc, you might as well add an escape ^jj line to it, so that you can stop typing in -e ^jj every time you invoke screen.

Starting screen automatically

It’s pretty easy to forget to run screen after logging in. Personally, any time I am using SSH to login and work interactively, I want to be in a screen session. We can combine SSH’s ability to run a remote command upon login with screen’s ability to reconnect to detached sessions. Simply create an alias in your ~/.bashrc file:

alias sshwork='ssh -t work-username@my-work-computer.com "screen -dR"'

This will automatically fire up a screen session if there is not one running, and if there is one running, it will connect to it. Detaching from the screen session will also logout of the remote server.

Remote graphical work

Even in spite of SSH’s port forwarding capabilities, we still sometimes need to use graphical applications. If you have a fast connection or a simple GUI, passing the -Y flag to SSH could be enough to allow you run the application on your local desktop. Unfortunately, this often is a very poor user experience, and it does not work well with screen (a GUI application started in a screen session dies when you detach from the screen session).

The time-tested Unix solution to this problem is VNC. This is effectively a combination of screen and a graphical environment. Unfortunately, it has several drawbacks.

  • It can be tricky to setup reasonably
  • It is inherently insecure, with unencrypted data and a weak password feature
  • Its performance on a sub-optimal connection is less-than-stellar
  • It doesn’t transfer sounds over the network

I’m going to help you solve all of these problems, except the sound one. Who needs sounds, anyway?

VNC installation and setup

On the remote machine, you’ll need to install a VNC server and a decent lightweight window manager. I chose fluxbox and x11vnc:

sudo apt-get install x11vnc fluxbox

The programs that are started when you first start a VNC session are controlled by the ~/.vnc/xstartup file. I prefer something a bit better than the defaults, so mine looks like this:

#!/bin/sh
[ -x /etc/vnc/xstartup ] && exec /etc/vnc/xstartup
[ -r $HOME/.Xresources ] && xrdb $HOME/.Xresources
netbeans &
gnome-terminal &
fluxbox &

Modify this to suit your own needs; I only invoke netbeans because it’s the only reason I ever use a remote GUI at all. NB: Although it may seem counterintuitive, it’s typically best to put the window manager command last.

You can start a VNC server with the following command (this isn’t the way you should normally do it! Read on…):

vncserver -geometry WIDTHxHEIGHT

where WIDTHxHEIGHT is your desired resolution. For me, it’s 1440x900. The first time you run this, it will ask you to create a password. We are going to ensure security through other means, so you can set it to whatever you want. Running the above command will give a message like “New ‘remote-machine:1 (username)’ desktop is remote-machine:1”. The “:1” is the display number. By adding 5900 to this, we can determine which port the VNC server is listening on. At this point, we can connect to remote-machine:5901 with a vncviewer and log in to the session we’ve created. We don’t want the entire Internet to be able to connect to our poorly-secured session, so let’s terminate that VNC server session:

vncserver -kill :1

Securing the VNC server

Remember how we tunneled ports with SSH? We can do the same thing with VNC data. First, we’ll invoke our VNC server slightly differently:

vncserver -localhost -geometry WIDTHxHEIGHT -SecurityTypes None

This causes the VNC server to only accept connections that originate on the local machine. It also indicates that we will not need a password to connect to our session; simply being logged in locally as the user who created the session is enough. You should now have a VNC server running on a remote machine listening on localhost:5901.

On your local machine, install a VNC viewer. I personally use gvncviewer, though I don’t particularly recommend it. Now, to connect to that remote port, you’ll need to start an SSH tunnel on your local machine:

ssh -NT -L 5901:localhost:5901 remote-machine.com

We can now run the VNC viewer on our local machine to connect via the tunnel to our VNC session:

gvncviewer :1

Speeding up VNC?

When starting an SSH tunnel, we can compress the data it sends by including the -C flag. Depending on your connection speed, it may be worth including the flag in your tunnel command. Experiment with this option and see what works best for you.

If you are really having problems, you might also want to check out the -deferUpdate option, which can delay how often display changes are sent to the client. For more information, man Xvnc.

Automatically starting and connecting to your VNC session

Putting everything together, we can create a script that does all of this for us. Simply set the GEOMETRY and SSH_ARGS variables appropriately (or modify it slightly to accept them as command line arguments).

#!/bin/bash
set -e

GEOMETRY=1440x900
SSH_ARGS='-p 22 username@remote-server.com'

# Get VNC display number. If there is not a VNC process running, start one
vnc_display="$(ssh $SSH_ARGS 'ps_text="$(ps x | grep X[v]nc | awk '"'"'{ print $6 }'"'"' | sed s/://)"; if [ "$ps_text" = "" ]; then vncserver -localhost -geometry '$GEOMETRY' -SecurityTypes none 2>&1 | grep New | sed '"'"'s/^.*:\([^:]*\)$/\1/'"'"'; else echo "$ps_text"; fi')"
port=`expr 5900 + $vnc_display`
ssh -NTC -L $port:localhost:$port $SSH_ARGS &
SSH_CMD=`echo $!`
sleep 3
gvncviewer :$vnc_display
kill $SSH_CMD

The vnc_display line is pretty gross, so I’ll give some explanation. It uses SSH to connect to the remote server and look for a running process named Xvnc: this is the running VNC server. If there’s one running we extract the display number. Otherwise, we start one up with the specified geometry and grab the display number from there. This all happens within a single command executed by ssh, and the resulting output is piped across the network back into our vnc_display variable.

Either way we get the value, we now know which port to connect to in order to reach our VNC server. We start our SSH tunnel and get the resulting PID. Finally, we invoke the vncviewer on that tunneled local port. When the VNC viewer exits, we automatically kill our SSH tunnel as well.

Concluding remarks

One of the best parts of Unix is that it was built to be run remotely from Day 1. Just about anything you can do on your local computer can also be done on a remote one. By leveraging tools like SSH, screen, and VNC, we can make remote work as easy and convenient as local work. I hope this post gave you some ideas for how you can create a productive workflow with these very common Unix tools.

Oneiric Upgrade Breaks Readline (Alt-B and Alt-F) in gnome-terminal

Ubuntu Oneiric Ocelot (11.10) pulls in gnome-terminal 3.0

As with all distribution upgrades, the recently-released Oneiric Ocelot (Ubuntu 11.10) pulls in a slew of package upgrades. One of those upgrades was gnome-terminal, which was bumped from version 2.32.1 to 3.0.1. Unfortunately, this upgrade introduced a major problem for someone used to using Alt+B and Alt+F to move around the command line (for moving one word backward and forward, respectively).

By default, Alt-F opens the File menu. This behavior catches me off-guard every time I begin using the terminal in a freshly-installed environment. The problem was easily fixed, however, by deselecting ViewShow Menubar. This change can be made permanent by going to EditProfile Preferences and deselecting Show menubar by default in new terminals.

The latest gnome-terminal introduces Unity awareness

With the upgrade to Ubuntu 11.10, however, this change is no longer enough. Even with the option set as laid out above, gnome-terminal swallows Alt-B and Alt-F (though it does still allow Alt-Backspace for deleting the previous word). If you are used to using shortcuts like this, you will understand the frustration of having your workflow (and thought process) interrupted when a productive habit no longer works.

The fix

Go to EditKeyboard Shortcuts and deselect Enable menu access keys (such as Alt+F to open the File menu). Breathe easy as your command-line effectiveness returns to you.

Thanks go to Bryan Murdock whose old blog post pointed me in the right direction.

Random Teams: Exploring Statistics with IPython

The problem

My company, EasyESI shares office space with another company, Magoosh. A few times a day, we like to take a break from work to get in a game of foosball. A Magoosh employee even implemented a Heroku app for recording the games and tracking each player’s record.

The hardest part of getting a game going is probably selecting teams. We typically try to split up the best players, but that doesn’t always solve the problem, and besides, they should occasionally get to play together (after all, you can’t have dramatic underdog victories without a few powerhouse teams). The obvious solution is to randomly select teams, but it seems silly to use a program for that. Instead I came up with a solution that uses coinflips to create two teams of two players each.

Flip three coins to create two-on-two teams

The solution I came up with involves three coins; one person doesn’t have one. Each remaining player flips a coin. If they all come up heads or tails, there is a re-flip. Otherwise, there will be either two heads or two tails. Those two players are on a team, and the remaining players (the one who didn’t flip and the one whose flip was unmatched) are on a team, and we can get on with our 2-on-2 match.

Intuitively, it seems like each coin flipper is equally likely to be on the team of the player who didn’t flip a coin. But do the teams actually work out evenly? Let’s use Python interactively to investigate.

IPython

If you program in Python and you don’t have IPython installed, your programming is about to get a lot more enjoyable. IPython automatically indents your code as you’re entering it. It keeps logically-grouped lines (e.g., loops, function definitions) together in history, so modifying and re-entering them is much easier. You can move around the filesystem by using shell commands like cd, ls, and pwd. Forget the name of that string method you need? Just type help str to see its documentation. And best of all, it tab-completes everything: variable names, module names, object methods, file names, and pretty much anything else that could possibly be tab completed. I’ll wait as you install it.

Once you have that installed, check out the ipythonrc file (~/.ipython/ipythonrc on Unix systems). It has several options you may be interested in, but the one that annoys me most is the confirm_exit 1 configuration. Change that to 0 so that pressing Ctrl-D will exit without a confirmation, and let’s get on with this.

A huge strength of a language interpreter is the ability to prototype with it. You can answer questions interactively, defining and refining the problem as you go. This approach was pioneered by Lisp, is a huge part of what makes Unix powerful (the Bash shell is effectively a language interpreter where any executable is a library function), and is an integral part of programming in languages like Ruby and Python.

We’ll get started by firing up IPython and defining a couple functions.

Flipping three coins

from random import randint
def triple_coin_flip ():
    return [randint(0, 1) for i in range(3)]

The randint call above will randomly return either 0 or 1 with equal probability. It is within a Python structure called a list comprehension, a concise method for producing a list. Instead of saying:

list = []
for i in range(3):
    list.append(randint(0, 1))

we can instead achieve equivalent functionality with a list comprehension:

[randint(0, 1) for i in range(3)]

So this function simply returns a list of three integers, where the value of each is either 0 or 1.

Let’s say 0 represents heads and 1 represents tails (though it doesn’t actually matter which one is which). An interesting result of our representation is that four possible outcomes sum to four separate values:

  • 0: Three heads
  • 1: Two heads and one tails
  • 2: Two tails and one heads
  • 3: Three tails

These properties are leveraged in the next function.

Using the coinflips to randomly select a teammate

def get_teammate ():
    while 1:
        flip = triple_coin_flip()
        total = sum(flip)
        if 0 < total < 3:
            break
    if total == 1:
        return flip.index(1) + 2
    else:
        return flip.index(0) + 2

This function performs two separate duties. It uses an infinite (while 1) loop to continue flipping coins until it gets a heterogeneous result. At this point, the loop exits. Its other purpose is implemented in the final ifelse statement; it returns the number of the player (in the range 2-4) that flipped the unique value. That is, when the total is 1, we have a single tails flip, and we return that player by getting the index of the 1 in the three-item sequence. When the total is 2, we have a single heads value, so we get the index of the 0. We add 2 to both results so that our number falls in the range [2, 4] instead of [0, 2].

The idea is to simulate our coin flips, using the return value of 2, 3, or 4 to indicate which player flipped a unique value. That player will be on the team of player 1 (the one who did not flip a coin). Now we want to test whether the result is a uniform distribution of teams.

Determining the distribution of the teams

from collections import defaultdict
team_counts = defaultdict(int)
for i in xrange(1000000):
    t1_mate = get_teammate()
    t1 = (1, t1_mate)
    t2 = [2, 3, 4]
    t2.remove(t1_mate)
    team_counts[t1] += 1
    team_counts[tuple(t2)] += 1

This loop simulates 1,000,000 team selections. Each iteration creates two teams, t1 (player 1 and the randomly-selected teammate) and t2 (the other two players). It then keeps track of how many times each team was together by keeping a count in a dict using the team as the key.

defaultdict

Python’s defaultdict is an extremely useful data structure for writing elegant code. Usually when you try to access a key that is not in a dict, you get a KeyError. With a defaultdict, however, it creates a default element associated with the missing key by calling a function. That function is specified when you create the defaultdict, as in the defaultdict(int) code above. Since calling int() in Python simply gives us 0, we have created a dict that can be used as a counter. Using a regular dict, those team_counts... lines would have instead been:

team_counts[t1] = team_counts.get(t1, 0) + 1
team_counts[tuple(t2)] = team_counts.get(tuple(t2), 0) + 1

tuples, lists, and dict keys

A second item worth addressing is the distinction between a Python tuple and a Python list. A tuple is a sequence of items that is immutable; that is, once it is created, items cannot be added to it, removed from it, or modified within it. A list is a mutable sequence. A Python dict relies on an element being hashable in order to access it quickly within the dictionary. One aspect of hash functions is that they should produce the same value for the same object; this is not possible with a mutable object. Thus, Python tuples are hashable and Python lists are not. Thus, Python tuples can be used as keys in a dict and Python lists cannot. Let’s take another look at the relevant lines:

t1 = (1, t1_mate)
t2 = [2, 3, 4]
t2.remove(t1_mate)
team_counts[t1] += 1
team_counts[tuple(t2)] += 1

Here we create t1 as a tuple containing 1 and t1_mate; t1 cannot be modified now. We create t2 as a list containing 2, 3, and 4; we want to remove the number of the player who is on player 1’s team, so we cannot initially use a tuple. Executing t2.remove(t1_mate) means that t2 now contains two elements, the players who are not on Team 1 (i.e., Team 2). When using them as dict keys, t1 is already a tuple so we can use it directly. Since t2 is a list, we have to first convert it to a tuple in order to use it as a dict key.

Inspecting the distribution of teams

We now have a structure representing the makeup of all 2,000,000 teams:

In [42]: team_counts
Out[42]: defaultdict(, {(1, 2): 332986, (1, 3): 332808, (1, 4): 334206, (2, 3): 334206, (3, 4): 332986, (2, 4): 332808})

Those distributions look about even, but let’s make sure:

total = sum(team_counts.itervalues())
for team, occurrences in team_counts.iteritems():
    print "Team %s: together %.1f%% of the time" % (team, occurrences * 100.0 / total)

# which output:
Team (1, 2): together 16.6% of the time
Team (1, 3): together 16.6% of the time
Team (1, 4): together 16.7% of the time
Team (2, 3): together 16.7% of the time
Team (3, 4): together 16.6% of the time
Team (2, 4): together 16.6% of the time

That looks pretty even to me! It seems my method for randomly selecting 2-on-2 teams is an accurate one.

itervalues and iteritems

A dict is a structure that associates a key with a value. When we want to iterate over that dict, it is not immediately clear whether we would like to iterate over the keys, the items, or both. By default (when we say something like for x in some_dict: ...), Python iterates over the keys of a dict. Using itervalues, we can instead visit each value in the dict one-by-one. Since sum accepts an iterable as an argument, the sum(team_counts.itervalues()) expression will sum up the total number of teams produced.

The latter method, iteritems, is one that I find myself using with some regularity. It allows us to iterate simultaneously over the keys and associated values of the dict. Each element of the iteration is a tuple of (key, value) for each item in the dict. Our loop head, for team, occurrences in team_counts.iteritems(), unpacks that tuple into the variables team and occurrences. So inside the loop body, team is a tuple representing a team (remember our previous discussion of lists vs. tuples?), and occurrences is the number of times that team was together in the 1,000,000 iterations of team creation.

How long will it take? or

What about the infinite loop in get_teammate?

In our get_teammate function, we relied on an infinite loop to repeatedly flip coins until we got a result that was neither all heads nor all tails:

while 1:
    flip = triple_coin_flip()
    total = sum(flip)
    if 0 < total < 3:
        break

We have no guarantee that our infinite while loop will terminate. Theoretically, we could have an unending sequence of [0, 0, 0] and [1, 1, 1]. On average, how many times should we expect the loop to run?

There are eight possibilities when performing three random flips:

000
001
010
011
100
101
110
111

Two of those result in another coinflip, so we are reflipping ¼ of the time. Thus, our number of flips is a probabilistic value which can be represented by the following infinite summation:

1 * 3/4 + 2 * 1/4 * 3/4 + 3 * 1/4 * 1/4 * 3/4 + ...

That is, we have ¾ probability of one flip, ¼ * ¾ probability of two flips, and so on. Let’s find a pattern:

1 * 3 / 4^1 + 2 * 3 / 4^2 + 3 * 3 / 4^3 + ...

There are formulas for calculating these sorts of infinite sums, but let’s just use Python:

flips = 0
for i in xrange(100):
    flips += i * 3.0 / 4.0 ** i
flips
# The result is 1.33333333333

It looks like we have converged on 4/3, meaning that on average it will take 1-1/3 coinflips (where a coinflip is three players each flipping a coin) to determine teams. There is little danger of our loop running for very long, much less infinitely.

Review

  • Install and run ipython instead of the standard python interpreter. it has much better support for interactive, incremental development.
  • Use list comprehensions and defaultdict to make code more elegant.
  • A list is mutable (modifiable) and a tuple is not. Only a tuple can be used as a dict key.
  • There are multiple ways to iterate over a dict: use iterkeys (the default), itervalues, and iteritems based on which best suits your needs.
  • It’s often possible to use a statistically significant number of iterations in an interactive programming environment to answer questions that would otherwise require analysis using mathematical or statistical theory.
  • If you want to randomly select two teams from four players, have three of them flip a coin. When exactly two players flip the same value, put them on a team together.

Unix Home Field Advantage - vim and awk

Baseball and Unix

I've been a baseball fan my entire life. Back in 1995, I saved up my money and bought the Strat-O-Matic player cards for 10 different teams. For years I rolled those dice and battled wits with my stepdad, my friend Stevie down the street, and anyone else I could rope into playing with me. I read Bill James and bought into sabermetrics as a middle schooler. A half-dozen years later I found CSFBL and tested my managerial skills against other players.

What does this have to do with Unix? Tonight I watched the Tigers beat the Yankees in a close game 5. I didn't follow much baseball this year, so I checked out the final standings to see how the season wrapped up. While looking at the standings, I noticed that the home team seemed to win a disproportionately high number of games, and I was curious exactly what the percentage was. Unix to the rescue!

Sometimes copy-paste and vim is the easiest way

I usually use the Python Beautiful Soup library for extracting the desired content from webpages. In this case, unfortunately, the data I wanted was not part of a static webpage, so I instead extracted it the easiest way possible: copy-paste.

Copy-paste-standings
I pasted it into a vim buffer as follows (anything after ';' is a comment):

:set paste<RET>  ; set to paste verbatim mode (to avoid auto-formatting)
i<C-V><ESC>      ; Insert, paste (using Ctrl-Shift-V in gnome-terminal)

The result was several columns, and I was only concerned with the Home and Road records. With vim, it's easy enough to isolate these two columns using macro editing.

gg    ; return to the beginning of the file
qq    ; begin recording into the 'q' buffer
$     ; move to the end of the line
5dB   ; delete back to the beginning of the (space-delimited) word 5x
4x    ; delete 4 characters
BB    ; go back two (space-delimited) words
d^    ; delete from cursor to the beginning of the line
jq    ; proceed to the following line and end macro recording
@q    ; repeat 'q' macro once (confirm it did what you expected)
99@@  ; repeat previously executed macro to the end of the file

Now we just need to clean up the file by removing the non-record lines (e.g., heading lines like "WCGB L10") using a simple dd on each one. Once we've done that, we can save the file (:w records). The result is a file that looks like this:

52-29   45-36
47-34   44-37
45-36   45-36
42-39   39-42
39-42   30-51
50-31   45-36
44-37   36-45
36-45   43-38
40-41   31-50
33-48   30-51
52-29   44-37
45-36   41-40
43-38   31-50
39-45   28-50
52-29   50-31
47-34   42-39
44-36   36-45
34-47   43-38
31-47   41-43
57-24   39-42
45-36   45-36
42-39   37-44
36-45   36-45
39-42   32-49
31-50   25-56
51-30   43-38
46-35   40-41
42-39   40-40
38-43   35-46
35-46   36-45

sed and awk finish the job

But how to determine the home win percentage? Time to whip out Python? I elected to use awk instead:

sed 's/-/ /g' records |
  awk '{hw += $1; hl += $2; aw += $3; al += $4}
    END{printf "HOME\n  Wins: %i\n  Losses: %i\n  Win %%: %.3f\nAWAY\n  Wins: %i\n  Losses: %i\n", hw, hl, hw*1.0/(hw+hl), aw, al}'

And the resulting output:

HOME
  Wins: 1277
  Losses: 1152
  Win %: 0.526
AWAY
  Wins: 1152
  Losses: 1277

We can do a quick sanity check by comparing Home W-L to Away W-L. As expected, their values are complements (for every home win, there is an away loss, and vice versa). And here we get the answer we were looking for: the home winning percentage for all Major League teams in 2011 was .526.

The sed command is pretty basic: it simply replaces the '-' in the W-L records with a space. The 'g' flag indicates that we want to perform all replacements, instead of the default one per line. We use sed in this way because awk splits on whitespace sequences by default, so this step ensures that awk recognizes four fields in each row: Home Wins, Home Losses, Away Wins, and Away Losses.

The awk command is where it gets fun. The first thing to note is that the first {...} portion is executed once for each line in the file. Each $ variable represents a field in the row, so $1 is Home Wins, $2 is Home Losses, and so on. I'm totaling these values by accumulating them in the variables hw, hl, aw, and al. Variables in awk are automatically initialized to 0.

The END command separates line-by-line processing from post-processing. Thus, the second {...} portion is executed after the entire file has been processed. Here we use printf in exactly the same way that it is used in C.

Conclusion

And that's that! With just a few dozen keystrokes in vim, it was possible to transform a copy-paste mess into a couple consistent columns. Using sed for preprocessing, I passed the result to an awk script that calculated and reported the data I was looking for.

This is where Unix really shines. Textual data is only a few minutes away from being useful information. I love using Unix tools for one-off tasks; often they later transform into more general, longer-lived bash scripts. Once a script reaches a certain size, however, it usually makes sense to either break it up or translate it into a more general programming language like Python.

In my next part, I'll show how we can use Beautiful Soup and Python to calculate the same information over a longer time period, and we'll get an idea of just how much of an advantage it is to play at home.