In which I take on the great @stevewoz at code golf

(subtitle: when does doubling the code size and messing with the bear seem like a good idea?)

I’m like the boy that couldn’t stop washing his hands, with the Sweet16. I wasn’t using the BK instruction for anything in PETTIL, so I wanted to transform it into a NXT instruction which returns to 6502 (just like RTN) but also performs the “jmp next” to terminate the primitive.  Then my Sweet16 primitives could change from the “before” to “after,” speeding things up a few cycles and gaining three whole bytes!  Because this is how code golf is played. Just please don’t ask me why I play.  Here’s what I wanted to be able to do in the client code:

    .byt ld | R4
    ... more Sweet16 code
    .byt push
    .byt st | R4
    .byt rtn
    jmp next        ; 4 bytes

    .byt ld | R4
    ... more Sweet16 code
    .byt push
    .byt st | R4
    .byt nxt        ; 1 byte

Wow. Such score. So hip. Many holiday.

While rooting around in there, I noticed a bit of non-DRY code in the half dozen register-oriented branch instructions (BP, BM, BZ, BNZ, BM1, BNM1) They all (or could easily) begin with the phrase [ ASL; TAX; LDA R0H,X ]

This sent me down a new wormhole.  What could I do to consolidate these four bytes repeated six times into a single DRY instance?  The logical place to put it would be in the TOBR dispatcher routine.  Here’s the original Woz TOBR:

        BNE  TOBR2          ;INCR PC
        INC  R15H
        PHA                 ;ONTO STACK FOR NON-REG OP
        LDA  R14H           ;"PRIOR RESULT REG" INDEX
        LSR                 ;PREPARE CARRY FOR BC, BNC.
        RTS                 ;GOTO NON-REG OP ROUTINE

and the original BRTABLE opcodes:

00    RTN
01    BR
02    BNC
03    BC
04    BP
05    BM
06    BZ
07    BNZ
08    BM1
09    BNM1
0a    BK
0b    RS
0c    BS
0d    unused
0e    unused
0f    unused

Repurposing R14 register
Instead of combining the (doubled) prior result register with the carry flag in R14H, I’m using R14L for the prior result register, because it was never used for anything else and it shortens the final code.  The original way:

 f e d c b a 9 8 7 6 5 4 3 2 1 0
|     | prior |C|               |  R14  "Woz"

R14H    combined prior register/carry flag
R14L    unused

Carry bits are always shifted in with a ROL or ASL (to bit R14[8]) but they are nondestructively tested, so nothing ever actually clears R14H.  This causes bits 9-F of R14 to fill with remnant carries (“trash”).  These bits are ignored.  New and improved:

 f e d c b a 9 8 7 6 5 4 3 2 1 0
|    trash    |C|     | prior |0|  R14 PETTIL

R14H    carry flag only
R14L    doubled prior result register

So, kind of a big change.  Here’s another big change.  After some analysis, I’m renumbering the BRTABLE opcodes, and reorganizing things on the “opcode handler page” (where all entry points must fit within the page).  There are a few things that need to happen to BRTABLE opcodes:

A) bump the R15 program counter
B) low byte of the handler address to the stack
C) preload the Carry flag
D) preload the high byte of the prior result
E) dispatch the handler with RTS

Renumbered (and new) BRTABLE opcodes
BE      00    RTN    ; needs to remain $00
ABDE    01    BP
ABDE    02    BM
ABDE    03    BZ
ABDE    04    BNZ
ABDE    05    BM1
ABDE    06    BNM1
; new - RTN followed by JMP NEXT
BE      07    NXT
; new - perform a DUP operation on the split PETTIL data stack
BE      08    PUSH
; new - perform a DROP on the PETTIL data stack
BE      09    PULL
ABE     0a    BR
; new - native 6502 extensions to Sweet16
BE      0b    EXT
ABE     0c    BS     ; needs to remain $0c
BE      0d    RS
ABCE    0e    BC
ABCE    0f    BNC

Let’s rewrite some code. Here’s the new TOBR, with comments mixed in.

all opcodes need to do “B”.  Let’s get that out of the way first.
B) low byte of the handler address to the stack
some opcodes (on the high half of the page) can be dispatched immediately (NXT, PUSH, PULL, EXT, RS, RTN)
A) bump the R15 program counter
    INC R15L
distinguish between “C” and “D” branch opcodes
    CPX #nxt*2 ; #14
D) preload the prior result high byte, setting N and Z flags for  BP, BM, BZ, BNZ, BM1, BNM1These next two instructions used to be replicated within each handler.
    LDX R14L
    LDA R0H,X
E) dispatch the handler with RTS
C) preload the prior result Carry flag for BC, BNC.   BR and BS don’t use carry, but they do require A) to get to the offset byte.  They’re here just because it’s the shorter code path.
    LDA R14H
E) dispatch the handler with RTS
A) bump the R15 program counter.   Incrementing the high byte is rare, only occurring when Sweet16 code spans a page boundary.  Putting the rest of it here saves a clock for the most typical case.
    INC R15H

Adding up the scorecard

WozTOBR is spare, clean, elegant.  It always runs in 26 clocks, or 30 clocks when R15 crosses a page boundary, for an average of 26.015625 clocks.  For code golf purposes, it is fair to assume that a page boundary is not crossed.

; A has the opcode
; X has the branch table index
    INC R15L     ; [5]
    BNE TOBR2    ; [3]
    INC R15H     ; [0]
    LDA BRTBL,X  ; [4]
    PHA          ; [3]
    LDA R14H     ; [3]
    LSR          ; [2]
    RTS          ; [6]
   ;[== 13 bytes]  [== 26 clocks]

PETTIL-TOBR is messier, harder to understand, and yields only a quite minor improvement in overall memory and clock cycles.  Arguably it was a complete waste of time to do this, but I had fun and learned something. 

    LDA BRTBL,X  ; [4]
    PHA          ; [3]
    BMI TOBR4    ; [2/3]
    INC R15L     ; [5]
    BEQ TOBR5    ; [2]
    CPX #7*2     ; [2]
    BCS TOBR3    ; [2/3]
    LDX R14L     ; [0]  (free)
    LDA R0H,X    ; [0]  (free)
    RTS          ; [6]
    LDA R14H     ; [3]
    LSR          ; [2]
    RTS          ; [6]
    INC R15H     ; [0]
    BNE TOBR2    ; [0]
   ;[== 26 bytes]  [== 16 clocks -- RTN, NXT, PUSH, PULL, EXT, RS]
                 ; [== 26 clocks -- BP, BM, BZ, BNZ, BM1, BNM1]
                 ; [== 32 clocks -- BR, BS, BC, BNC]
Assuming an (unlikely) evenly distributed instruction mix, we average 23.75 clocks per opcode through PETTIL-TOBR vs. 26 clocks for WozTOBR.  I’m calling this one for PETTIL. As for memory usage, I’m adding 13 bytes here but removing 24 bytes from the Sweet16 handler page. The benefits are 11 fewer bytes and DRYer code.

But wait, there’s more!</ronpopeil>

Since we just shaved 24 bytes off the crowded opcode handler page, there is room to move some of the off-page handler code (e.g. the SET –> SETZ branch) back onto the page.  The memory was trimmed in a place where it can do more good.  Even more win.

Tune in later for more 6502 chop-O-matic action