Erlang hot code swapping

February 16, 2010

Categories: erlang, hot code swapping

Hot code swapping is an attractive feature of Erlang. It's what allows those fancy Ericsson telecom systems to achieve 99.9999999% reliability. Implementing it in your own code is simple. All of your favorite OTP behaviors take care of the heavy lifting for you. You're responsible only for writing a function to update the state of your stateful processes and the built-in Erlang release handler takes care of the rest.


Here we have a very basic gen_server module that stores an integer counter as its state. It provides a function, print_state/0, that prints the current state to std out.

After compiling the module, we start the gen_server process and print its state a few times.

jvorreuter$ erl
Erlang R13B01 (erts-5.7.2) [source] [smp:2:2] [rq:2] [async-threads:0] [kernel-poll:false]

Eshell V5.7.2  (abort with ^G)
1> c(t).
2> t:start_link().
3> t:print_state().
state: 0
4> t:print_state().
state: 1

Now we decide to change the state counter to a float. We make the following changes, updating where the state is initialized and also editing the code_change function to convert an integer to a float.

In the same console that we were working in before we can recompile and load the new version of our gen_server module.

5> sys:suspend(t).
6> c(t).
7> l(t).
8> sys:change_code(t,t,[],[]).
9> sys:resume(t).
10> t:print_state().           
state: 3.0
11> t:print_state().
state: 4.0

We use the sys module to suspend the registered process for our gen_server. When a process is suspended, it will only respond to system messages. Next the change_code/4 method is called:

change_code(Name, Module, OldVsn, Extra) -> ok | {error, Reason}

    Name = pid() | atom() | {global, atom()}
    Module = atom()
    OldVsn = undefined | term()
    Extra = term()

Since the process who's state we are changing is registered locally with its module name we use the same atom as the first two arguments. Also, in our simple example we are not specifying an old version or an extra term. The OldVsn argument is useful if your module has been deployed to multiple environments and may be running more than one old version. You would then be able to pattern match on the version number currently running to generate the new state for your process. The documentation for the sys module says that "The Extra argument is reserved for each process to use as its own." Basically, you're free to do whatever you want with this value or just ignore it.

Finally, we resume the suspended process. The print_state/0 function shows that the state has been updated to a float.

A note on triggering code changes

In the example above we are forcing a code change manually from the shell. It's typically cleaner to let the release handler take care of code swapping. To do this you need to generate a release package containing an appup file. An example appup file looks like this:

  [{"0.1", [{update, t, {advanced,[]}}]}], 
  [{"0.1", [{update, t, {advanced,[]}}]}]

The tuple is of the following form:

  [{UpFromVsn, Instructions}, ...],
  [{DownToVsn, Instructions}, ...]}.

The update instruction specifies the module to update and an optional change parameter, which in this case is {advanced, Extra}. Extra is passed into the code_change method as the last parameter. Leaving out the advanced tuple will result in no code swapping, just the module code being reloaded.

Digging a little deeper

In the next example we'll implement similar functionality to that of the gen_server in our first example, but instead of using the gen_server behavior's callback methods we will add in code to respond directly to the underlying system messages that the gen_server behavior abstracts away.

Our new server is started with proc_lib:start_link/3, which in turn calls the init method and starts a loop with an integer state. The print_state/0 function makes use of gen:call/3 to send a message to the registered server process. The server replies with gen:reply/2 and then calls the loop method again with an updated state. The interesting part of this module is the system message that the loop handles in its receive block. The sys documentation has this to say about handling system messages: "The content and meaning of this message are not interpreted by the receiving process module. When a system message has been received, the function sys:handle_system_msg/6 is called in order to handle the request."

Once a process is setup to receive system messages it must also provide a callback module that exports system_continue/3, system_code_change/4 and system_terminate/4 methods. One thing to notice in our module is that after passing the system message data to the sys:handle_system_msg/6 method we do not call loop/1 again. The callback method system_continue/3 is called from sys:handle_system_msg/6 when the process should continue. The current state variable is passed to sys:handle_system_msg/6 and then to system_continue/3. This allows system_continue/3 to re-enter the server's loop with the same state as when it received the system message.

The system_code_change/4 method is almost identical to the gen_server callback method, code_change/3. The only difference being the additional Module parameter. Lastly, the system_terminate/4 method is responsible for any cleanup that needs to occur before the process exits.

As with our first module, we compile and start our new server and print its state a couple of times.

1> c(p).
2> p:start_link().
3> p:print_state().
state: 0
4> p:print_state().
state: 1

We make the same change as we did with our gen_server example. Update the state that is initialized to be a float and also update the state that is returned from the system_code_change/4 method to convert to a float.

Suspend the process, re-compile, re-load, call change_code and resume the process. The server's state is now a float.

5> sys:suspend(p).
6> c(p).
7> l(p).
8> sys:change_code(p,p,[],[]).
9> sys:resume(p).
10> p:print_state().           
state: 2.0
11> p:print_state().
state: 3.0
blog comments powered by Disqus