Channels ▼
RSS

C/C++

Making C Extensions More Pythonic

Source Code Accompanies This Article. Download It Now.


Jan01: Making C Extensions More Pythonic

Andrew is a consultant in Santa Fe, New Mexico. He can be contacted at dalke@ acm.org.


It is relatively easy to turn C libraries into Python extensions, especially with automatic interface generators like SWIG (see http://www.swig.org/ and "SWIG and Automated C/C++ Scripting Extensions," by David Beazley, DDJ, February 1998). However, viewed from Python, the new extensions still feel like C libraries because they don't take advantage of features such as automatic garbage collection, classes, and exceptions. Extension users are thus forced to write more code to deal with the mismatch between the two language styles. In this article, I'll present PyDaylight (http://starship.python.net/crew/dalke/PyDaylight/), an object-oriented wrapper for Python that provides the low-level interface to the underlying C libraries. Wrappers like this let you build more "Pythonic" interfaces to existing C libraries.

During the past couple of years, I have been working with the Daylight toolkit (http://www.daylight.com/). Broadly speaking, Daylight lets you model, store, and analyze chemical structures using a computer. The toolkit is a set of C libraries for different aspects of chemical informatics. As a rough, nonchemistry analogue, some of the libraries correspond to a string library, a regular expression package, display widgets, and database clients.

The toolkit API is a set of function calls and compile-time constants. Internal data structures are only exposed through opaque integer valued handles. This design is similar to many other C libraries, such as file descriptors in the Standard C Library, so the methods I describe are widely applicable.

I started using Daylight when working at Bioreason (http://www.bioreason.com/), which uses the toolkit as one component of its chemical knowledge discovery and management system. We needed a flexible system for rapid development and testing of new algorithms. Since I had had good experiences using Tcl and Perl in previous projects, I wanted to use a very high-level language. I had also found that nondeveloper programmers — physicists, biologists, and chemists who program — usually didn't leverage the benefits of either language. The problem seemed to be the difficulty in describing complex data structures, especially as combined with Tcl's quoting rules or Perl's syntax. Python appeared to be a cleaner language, so we started experimenting with it. We were helped by Roger E. Critchlow, Jr., then working at Daylight, who had written DaySWIG (http://www.daylight.com/meetings/mug98/Critchlow/dayscript/title.html) — a program that massages the library header files into a SWIG interface file, then creates the dayswig_python.so Python extension module.

Automatic Garbage Collection

Thinking that Roger had already done the hard work, I proceeded to write my first program using the dayswig_python.so module. It worked fine on my small test set, so I tried it on our full chemical library. After the disk started thrashing, I realized the extension module must be leaking memory because of a missing dt_dealloc callthe equivalent to free for Daylight's opaque handles.

Python has automatic garbage collection based on reference counting. When an object's count goes to 0, a special method named "__del__" is called. This hook let me write a wrapper object (see Listing One), which stores the handle and calls dt_dealloc on the handle when the object is no longer referenced. The result is that Python manages toolkit handles, rather than you.

The __del__ method is tricky in that it may be called during program termination, when the interpreter is cleaning up. Modules are removed in a somewhat arbitrary order, so the dayswig_python module may be cleaned up before the wrapper object. If that happens, accessing dayswig_python.dt_dealloc creates a NameError exception because dt_dealloc no longer exists. Instead, an extra reference to the function is stored as a default parameter in the argument list. The reference will only be removed after all instances of the class are deleted. Using default arguments this way is a common technique, especially for performance reasons, but is best left to functions like __del__ that aren't part of the public interface.

PyDaylight is a Python package that builds on the shared library produced by DaySWIG. It uses a smart_ptr class similar to that in Listing One, so that toolkit handles are deallocated automatically. The toolkit functions still expect an integer, not a wrapper object, so the smart_ptr also implements the special method named __int__, which Python uses for implicit integer conversion. In general, C libraries might use a pointer for the handle. SWIG converts these into specially formatted strings, so the proper coercion would use the __str__ method.

Attributes

The handles are identifiers for objects in the toolkit's data model, such as molecules, atoms, and bonds. Properties, like the charge of an atom, are normally found through function calls, as in Listing Two(a).

Python supports object-oriented programming, so it's more natural to access properties as attributes; see Listing Two(b). Using attributes instead of accessor methods may surprise you if you're used to C++ and Java, and expect to see methods such as getCharge and setCharge. Those two languages use methods to separate interface from implementation. Python supports a different mechanism based on the special methods __getattr__ and __setattr__.

During attribute lookup, the Python runtime checks if the attribute exists in the instance or class namespaces. If that fails, and the __getattr__ method exists, it is called with the attribute name as the sole parameter. The method can execute arbitrary code and either return the appropriate value or raise an AttributeError to signify that the attribute does not exist. For the rare case of a write-only property, it should raise a TypeError.

When setting an attribute, the run time first checks if the method __setattr__ exists. If it doesn't, the default action puts the name and value into the instance's namespace, called __dict__. If it does exist, the method is called with two parameters: the attribute name and its new value. Again, the method can execute arbitrary code to set the value. Special attribute names are forwarded to toolkit functions, or if the attribute is read-only, TypeError is raised. If an attribute name isn't special, the new __setattr__ method should implement the default action of adding the name and value into __dict__.

Listing Two(c) shows a simple Atom class to get and set the atom's charge, and get the atomic symbol. It is another wrapper, so it stores the handle and implements integer coercion with __int__, with a call to the int function in case the handle is actually a smart_ptr.

Dispatch Table

Having a list of if/elif statements for all of the attributes is cumbersome and causes a lookup time linear in the number of attributes. The toolkit accessor functions have the same functional form with the handle as the first parameter, and for setter functions, the new value as the second. Python allows polymorphic parameters and return values, so the accessors can be listed in a dispatch table based on the attribute name. A table item's value is the tuple of the getter and the setter. Listing Three(a) also raises the expected TypeError exception for read-only properties. The extension to write-only properties, which do exist in the toolkit, is not shown but is straightforward.

If a derived class's __getattr__ fails, it should normally call the __getattr__ of its parents. Python method calls are somewhat expensive and should be minimized, especially as lookups often occur inside inner loops. All of the PyDaylight base classes use a dispatch table, so for performance, a derived class can merge the parent's dispatch table with its own; see Listing Three(b). That isn't a pure object-oriented design, but it does give a noticeable performance increase and is hidden from external users.

The actual PyDaylight code is more complicated, but faster than the example. The dispatch table is converted at module import time into a getter list and a setter list, respectively, which removes the overhead of subindexing the get_set tuple. Those lists are passed as default arguments so they can be looked up in constant time instead of resolving self_properties. This requires that each class implement its own __getattr__ and __setattr__ instead of using the base class methods, but has the advantage of reducing call time as they can be found without doing a full traversal of the class hierarchy.

Listing Available Attributes

Python has an introspection command called "dir," which lists all of the attributes of an object. That function is useful in an interactive environment to see what attributes are available from an object. Most objects store their attributes in their __dict__ dictionary so the list of variables is available by calling __dict__keys().

The __getattr__ and __setattr__ methods expose new attributes on demand and will not be in the list of keys. To support this case, dir uses the list of keys then checks for two special variables — __members__ and __methods__. The first lists all additional attributes and the second lists any methods that are created dynamically. All three lists are combined to produce the list of available attributes.

Because all instances of a class have the same attributes, I often use a class variable to store the list of keys from the dispatch table, as shown twice in Listing Three.

Other Special Methods for Objects

There are three other special methods you should know about when making wrapper objects. The first is __cmp__, which defines how two objects are compared. In PyDaylight, two objects are equal if their handles are the same, so the base __cmp__ method simply compares their integer values. Watch out for equivalency, as it may cause some confusion because two different Python objects can be wrappers around the same toolkit object. Modifying the nontoolkit attributes of one object does not affect the other. This is a problem with almost any wrapper system.

Python has a key/value container called a "dictionary." Any object can be a key as long as it defines a constant hash value such that if two objects are equal, they have the same value. The value is returned via the special method named __hash__. Toolkit handles have the right properties, so they are used as the hash values in the wrapper objects. If your library uses SWIG-style string encoded pointers, the built-in function hash can be used to return the string's hash value.

The final special method is __nonzero__, which is used in a Boolean context to tell if an object is True or False; for instance, if x: print "true." Objects are normally considered True unless __nonzero__ exists and returns 0 or __len__ exists and returns 0.

Even if your objects are always True, __nonzero__ should be defined since lookup failure has the overhead of calling __getattr__ to see if the method is created dynamically. During performance tuning, it's helpful to print __getattr__ failures to see if there are extra lookups for special methods like these.

Ownership and Lifetime

In the toolkit data model, atoms are part of a molecule. Their handles are not stored in a smart_ptr object because their lifetime is determined by the molecule containing them. Molecule handles are different. In some cases, they are independent objects and need to be reference counted. In other cases, they are part of another object, called a "reaction," which determines their lifetimes. (Reactions contain reactant, agent, and product molecules.) The implementation difference is the handle passed to the molecule's constructor. In the first case it is a smart_ptr while in the second it is just the integer handle.

Ownership rules cause some complications with SMARTS match objects. A SMARTS pattern is to chemical graphs like a regular expression is to a string. It defines a subgraph that may match one or more parts of a compound. Each match, called a "path," can be queried to identify which atoms and bonds were part of the SMARTS pattern. Paths allocate new memory so a smart_ptr object is used to ensure proper garbage collection.

The path contains references to part of a molecule. If that molecule is deleted, the path is invalid and the toolkit automatically deletes it. The Python smart_ptr object doesn't know about this toolkit relationship. When the smart_ptr finally goes out of scope, it will try to delete the already deleted handle, and fail.

The Python class wrapper for the path data type knows how to enforce the dependency. The Path object stores the path handle and the molecule used when creating the match. Because instance variables, during cleanup, are normally removed in arbitrary order, Path objects define a __del__ method; see Listing Four, which removes the path handle first and then the molecule. This guarantees that the molecule will always exist longer than any match using that molecule.

Lists and Iterators

The toolkit has two types of list containers — a stream and a sequence. Each is both a container and a single associated forward iterator. The difference is the container for a stream is owned by another toolkit object (for example, a list of atoms in a molecule), while the container for a sequence is owned by the caller.

The two lists allocate some memory, so they must be held by a reference counted object. A stream only owns its iterator so it can be held by a smart_ptr. A sequence may also own all of the data in the container, depending on the context. If the sequence does own its data, the elements must be deleted when the sequence is finished.

This calls for a new type of smart wrapper, which I call a "smart_list." It is identical to the smart_ptr except for the __del__ method. This resets the sequence iterator then traverses the list, deallocating as it goes. When finished, it deallocates the sequence handle.

Python has its own list container with a different interface than streams or sequences. The easiest way to make toolkit lists work like Python lists is to copy each element into a new Python list; see Listing Five. Since the items are usually object handles, the copy routine takes an optional converter function, which will likely be a class constructor.

Copying works best if the lists are small and there is no need for list modifications to affect the original stream. If those conditions don't hold, you need a class that implements the Python list behavior and translates it to the underlying toolkit calls. The description of list behavior is fully described in the Python documentation.

PyDaylight doesn't need the full behavior, only a list wrapper for iteration through streams and sequences. Classes can implement iteration by defining a __getitem__ method, which takes the integer offset as its parameter. Python assumes lists are random access while the toolkit lists only offer forward iteration. Random access can be emulated by resetting the iterator and seeking forward to the right position, but this leads to unexpected performance characteristics such as order N**2 reverse traversal, so my interface tracks the current position and raises an exception if anything other than forward access is tried; see Listing Six. Just like the list copy function, the list iterator takes an optional conversion function to turn toolkit handles into their wrapped form.

The class also implements a method named "next," which returns the next object from the list. This is not part of Python's list interface, but is a common idiom for forward iterators. It returns a None object at the end of the list rather than raising an exception.

It's helpful to know the size of the list. A list class can make this known by implementing the __len__ method. An empty list is considered to be False when in a Boolean context, so the __nonzero__ method can be identical to the __len__ method.

Exceptions

As with most C libraries, the Daylight library returns errors via out-of-range values and the corresponding error message via a global function. From long experience, few people check the return values of every function call, so errors have a tendency to hide and pop up unexpectedly elsewhere.

The problem arises because errors in C are implicitly ignored, which is usually the wrong action. Instead, most of PyDaylight checks the return values and raises an exception if there are any problems. The exception contains the relevant message from the global error function, which helps with debugging and diagnostics.

Checking every function return value does have some overhead because there are cases where it isn't needed. For those rare performance-critical regions of code, I bypass PyDaylight entirely and call the underlying toolkit calls directly.

Several classes of errors don't even need to be checked because Python ensures they will never occur. Some toolkit errors occur because the wrong object type was passed to a function. For example, dt_ charge takes an atom handle and passing it any other handle returns an error value. PyDaylight only calls dt_charge for Atom objects so it can never pass in the wrong data type. There is no need to check for an error.

Conclusion

PyDaylight improves upon the Daylight toolkit API by presenting it in a more modern form. Developers who have tried it find it much easier to use than the standard API, so people are able to get much more work done in less time. The techniques I used are appropriate for many other C libraries, and I hope they will be useful to you in migrating existing code to Python. Finally, I've also included a test suite (available electronically; see "Resource Center, page 5) for the code presented here. The test suite assumes you are working on a UNIX-based system and Python is in your path. To test the examples, simply enter "make test." If everything works correctly, the last line printed should read, "All tests passed successfully."

DDJ

Listing One

# Wrapper object to garbage collect the toolkit handle when no longer
needed.
import dayswig_python
class smart_ptr:
    def __init__(self, handle):
        self.handle = handle
    def __del__(self, dt_dealloc = dayswig_python.dt_dealloc):
        dt_dealloc(self.handle)
    def __int__(self):
        return self.handle

Back to Article

Listing Two

#  Getting and setting an atom's charge using (a) toolkit function calls
and # (b) attributes. (C) shows how attributes are converted to function calls.

# Part (a)
print "The charge is", dt_charge(atom)
dt_setcharge(atom, 1)

# Part (b)
print "The charge is", atom.charge
atom.charge = 1

# Part (c)
class Atom:
    def __init__(self, handle):
        self.handle = handle
    def __int__(self):
        return int(self.handle)
    def __getattr__(self, name):
        if name == "charge":
            return dt_charge(self.handle)
        elif name == "symbol":
            return dt_symbol(self.handle)
        raise AttributeError, name
    def __setattr__(self, name, val):
        if name == "charge":
            dt_setcharge(self.handle, val)
        elif name == "symbol":
            raise TypeError, "readonly attribute"
        else:
            self.__dict__[name] = val

Back to Article

Listing Three

#  (a) Part of the dispatch table used in PyDaylight's base class.
#  (b) A derived class which adds atom-specific attributes.

# Part (a)
dayobject_properties = {
   "type": (dt_type, None),
   "typename": (dt_typename, None),
   "stringvalue": (dt_stringvalue, dt_setstringvalue),
}
class dayobject:
    __members__ = dayobject_properties.keys()
    _properties = dayobject_properties
    def __init__(self, handle):
        self.handle = handle
        return int(self.handle)
    def __getattr__(self, name):
        get_set = self._properties.get(name, None)
        if get_set is None:
            raise AttributeError, name
        return get_set[0](self.handle)
    def __setattr__(self, name, val):
        get_set = self._properties.get(name, None)
        if get_set is None:
            self.__dict__[name] = val
        else:
            set = get_set[1]
            if set is None:
                raise TypeError, "readonly attribute"
            set(self.handle, val)
# Part (b)
atom_properties = dayobject_properties.copy()
atom_properties.update( {
    "charge": (dt_charge, dt_setcharge),
    "symbol": (dt_symbol, None),
    "weight": (dt_weight, dt_setweight),
})
class Atom(dayobject):
    __members__ = atom_properties.keys()
    _properties = atom_properties

Back to Article

Listing Four

#  Enforcing a toolkit dependency by deleting dependent objects first.
class Path(dayobject):
    def __init__(self, path, mol):
        self.handle = path
        self.mol = mol
        #  .. more initialization code ..
    def __del__(self):
        del self.handle
        del self.mol

Back to Article

Listing Five

#   Converting from toolkit streams and sequences to a Python list.
def toList(seq, converter = None):
    if not seq:
        return []
    dt_reset(seq)
    result = []
    while 1:
        element = dt_next(seq)
        if not element:
            return result
        if converter:
            result.append(converter(element))
        else:
            result.append(element)

Back to Article

Listing Six

#  A list-like class for forward iteration through toolkit streams.
class Iterator:
    def __init__(self, handle, converter = None):
        self.handle = handle
        self._i = 0
        self.converter = converter
    def __len__(self):
        return dt_count(self.handle, TYP_ANY)
    __nonzero__ = __len__
    def __getitem__(self, i):
        if i != self._i:
            raise IndexError, "forward iteration only"
        element = dt_next(self.handle)
        if not element:
            raise IndexError, "list index out of range"
        self._i = i + 1
        if self.converter:
             return self.converter(element)
        return element
    def next(self):
        try:
            return self.__getitem__(self._i)
        except IndexError:
            return None




Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video