20 Nov 2012 crhodes   » (Master)

For posterity, and to postpone my deletion from planet.lisp a few more months... I was asked by Faré about whether it is possible to speed up the first calls to some user-defined generic functions in SBCL.

To understand that request, first we need to understand the implementation strategy for the discriminating function in SBCL's metaobject protocol. The discriminating function is responsible for computing the set of methods which are applicable given the arguments to the generic function, determining the effective method (the actual code to be run given the generic function's method combination), and then running that effective method.

One possible strategy for implementation is to do nothing at all in advance – simply at each call of the generic function, call compute-applicable-methods to find the ordered set of applicable methods, apply method combination to that set to generate an effective method form, compile it, and then call the resulting function with the argument list and the methods list. This is about as slow as it sounds: while it is correct, since it is typical for generic functions to be called more than once with the same classes of arguments, there is wasted effort in this strategy from repeating the same computations over and over again.

Fortunately, it is possible to do better. Quite apart from some special cases (slot accessors, predicates, single methods – documented in the SBCL Internals manual), if the generic function has the same methods, the result of compute-applicable-methods on arguments of the same classes will be the same (hence compute-applicable-methods-using-classes), and if it also has an unchanged method combination the effective method form and function will be unchanged. This suggests cacheing the result of computing the effective method, attempting a lookup based on the classes of the arguments before the slow path, and invalidating the cache if things change (e.g. adding or removing methods to the generic function or a change in the class hierarchy).

All the above is in “Efficient Method Dispatch in PCL” (Kizcales & Rodruigez, 1990); while there are more details in SBCL's implementation (including an alternate strategy for cacheing based on type dispatch rather than class hash codes) that paper is recommended reading for anyone wanting to understand how to get tolerable function call speed in generic-function-based environments. But there remains the problem of the initial state of the discriminating function: what should it be? The natural choice is of an empty cache, so that the cache gets filled based on what effective methods are actually used, but that means that there can be a substantial amount of work to do at startup to warm up all the caches for all the generic functions in a system. In fact, in the days when I used to develop and deploy CLIM-based applications, this was so noticeable that the deployment script I used started the application up, exited it and scrubbed the application state before dumping the memory image, precisely so that the generic function caches had relevant entries, meaning that application startup was much faster. How much faster? To get some kind of sense, let's look at an example:


(defgeneric foo (x)
  (:method ((x string)) (concatenate 'string "x" x))
  (:method ((x integer)) (1+ x))
  (:method ((x pathname)) (pathname-type x))
  (:method ((x generic-function)) (sb-mop:generic-function-name x)))

(time (foo 3))     ; 239,230 processor cycles
(time (foo 3))     ; 223,855 processor cycles
(time (foo 3))     ; 73,925 processor cycles
(time (foo 3))     ; 4,100 processor cycles
(time (foo #'foo)) ; 176,225 processor cycles
(time (foo #'foo)) ; 5,340 processor cycles
(time (foo "x"))   ; 103,090 processor cycles
(time (foo "x"))   ; 9,645 processor cycles
(time (foo #p""))  ; 136,780 processor cycles
(time (foo #p""))  ; 6,560 processor cycles

Eyeballing these (and knowing something more about the details of the implementation, which always helps), we can estimate the overhead as being about 400,000 processor cycles plus about 100,000 per distinct concrete class passed as an argument: call it about 0.5ms total per generic function based on a 1GHz processor. It doesn't take too many generic functions with empty caches called in sequence (as in protocol-heavy frameworks like CLIM) to make this startup delay noticeable.

What if it isn't possible to run the application beforehand? Well, it is possible to fill caches by hand. Here's one way to do it:


(defun precompile-gf (gf)
  (let* ((lock (sb-pcl::gf-lock gf)))
    (setf (sb-pcl::gf-precompute-dfun-and-emf-p (sb-pcl::gf-arg-info gf)) t)
    (let ((classes-list (mapcar (lambda (x) (sb-mop:method-specializers x))
                                (sb-mop:generic-function-methods gf))))
      (multiple-value-bind (dfun cache info)
          (sb-pcl::make-final-dfun-internal gf classes-list)
        (sb-thread::call-with-recursive-system-lock ; or -SPINLOCK
         (lambda () 
           (sb-pcl::set-dfun gf dfun cache info)
           (sb-mop:set-funcallable-instance-function gf dfun))
         lock)))))

This computes the set of classes directly named in the method specializers, and pre-fills the cache with entries corresponding to direct instances of those classes: as long as no subsequent changes occur, every call involving direct instances will be a cache hit and there will be no expensive recomputations:


(fmakunbound 'foo)
(defgeneric foo (x)
  (:method ((x string)) (concatenate 'string "x" x))
  (:method ((x integer)) (1+ x))
  (:method ((x pathname)) (pathname-type x))
  (:method ((x generic-function)) (sb-mop:generic-function-name x)))
(precompile-gf #'foo)

(time (foo #p""))  ; 5,995 processor cycles
(time (foo #p""))  ; 5,940 processor cycles
(time (foo 3))     ; 173,955 processor cycles
(time (foo "x"))   ; 81,945 processor cycles
(time (foo #'foo)) ; 148,435 processor cycles

The “direct instances” restriction there is a strong restriction: if the hierarchy is based around protocol classes (used in specializers) and implementation classes (used as concrete classes for instantiation) then the initial cache filling will be useless (as in the case above: in SBCL, 3 is a direct instance of the (implementation-specific) FIXNUM class.

What's the second try, then? If the class hierarchy is not too deep and the generic functions don't have too many multiple-argument specializers, then pre-filling caches with all possible class argument combinations might not be totally prohibitive:


(defun class-subclasses (class)
  (let ((ds (copy-list (sb-mop:class-direct-subclasses class))))
    (if (null ds)
        (list class)
        (append (list class) 
                (remove-duplicates (mapcan #'class-subclasses ds))))))

(defun class-product (list)
  (if (null list)
      (list nil)
      (let ((first (class-subclasses (car list)))
            (rest (class-product (cdr list)))
            result)
        (dolist (c first result)
          (setf result (nconc (mapcar (lambda (r) (cons c r)) rest) result))))))

(defun exhaustive-class-list (gf)
  (let ((specs (mapcar #'sb-mop:method-specializers 
                       (sb-mop:generic-function-methods gf))))
    (remove-duplicates (mapcan #'class-product specs) :test #'equal)))

(defun precompile-gf-harder (gf)
  (let* ((lock (sb-pcl::gf-lock gf)))
    (setf (sb-pcl::gf-precompute-dfun-and-emf-p (sb-pcl::gf-arg-info gf)) t)
    (let ((classes-list (exhaustive-class-list gf)))
      (multiple-value-bind (dfun cache info)
          (sb-pcl::make-final-dfun-internal gf classes-list)
        (sb-thread::call-with-recursive-system-lock ; or -SPINLOCK
         (lambda () 
           (sb-pcl::set-dfun gf dfun cache info)
           (sb-mop:set-funcallable-instance-function gf dfun))
         lock)))))

(fmakunbound 'foo)
(defgeneric foo (x)
  (:method ((x string)) (concatenate 'string "x" x))
  (:method ((x integer)) (1+ x))
  (:method ((x pathname)) (pathname-type x))
  (:method ((x generic-function)) (sb-mop:generic-function-name x)))
(precompile-gf-harder #'foo)

(time (foo 3))     ; 4,510 processor cycles
(time (foo #p""))  ; 6,060 processor cycles
(time (foo "x"))   ; 10,685 processor cycles
(time (foo #'foo)) ; 5,985 processor cycles

So, nirvana? Well, what happens if the class hierarchy is not so friendly, and the exhaustive classes list is too exhausting? Stay tuned for the next thrilling episode...

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!