Beamhacking: Cycle Counting

We've determined that we can get in sync with the raster on a TRS-80 Model 3. But to stay in sync there's nothing for it but to keep track of all the cycles our program uses. At a high level the process is simple enough. Write up the code that changes the screen on the fly. Then add code that controls the display whether that means moving stuff around, checking for user input or whatever. Then put in some code to use up any remaining T-states before the whole process must repeat again. Hopefully everything goes right and you're rewarded with a solid display. Typically there's some error in calculation and the display "drifts" as the program slides slowly (or sometimes rapidly) out of sync with the beam.

For the moment let's ignore the detail of counting T-states because even once we know how many T-states our code uses there's still the problem of using up the excess. What we really need is a subroutine to use up any given number of T-states. Z-80 programmers may wish to stop at this point and try and write such a subroutine — it's a very fun challenge. You may find this chart useful.

I went through several variants before my current version which is small and, most importantly, has relatively little overhead. The subroutine obviously can't use zero T-states if asked. In fact, there will be a minimum number of T-states it uses no matter what. What we end up with is something that burns some passed in number of T-states plus a fixed overhead. Here's the best I've got so far.

; wHL -- Waste HL + 100 T states. Only uses A, HL.

wHL256:
        dec     h               ;<0>  | <4>
        ld      a,256-4-4-12-4-7-17-81       ; 81 is wA overhead
                                ;<0>  | <7>
        call    wA              ;<0>  | <17+A>
wHL:    inc     h               ;<4>  | <4>
        dec     h               ;<4>  | <4>
        jr      nz,wHL256       ;<7>  | <12>
        ld      a,l             ;<4>
wA:     rrca                    ;<4>
        jr      c,wHL_0s        ;<7>  | <12> 1 extra cycle if bit 0 set
        nop                     ;<4>  | <0>
wHL_0s: rrca                    ;<4>
        jr      nc,wHL_1c       ;<12> | <7>  2 extra cycles if bit 1 set
        jr      nc,wHL_1c       ;<0>  | <7>
wHL_1c: rrca                    ;<4>
        jr      nc,wHL_2c       ;<12> | <7>  4 extra cycles if bit 2 set
        ret     nc              ;<0>  | <5>
        nop                     ;<0>  | <4>
wHL_2c: rrca                    ;<4>
        jr      nc,wHL_3c       ;<12> | <7>  8 extra cycles if bit 3 set
        ld      (0),a           ;<0>  | <13>
wHL_3c: and     a,0fh           ;<7>
        ret     z               ;<11> | <5>  done if no other bits set
wHL_16: dec     a               ;<0>  | <4>  loop away 16 for remaining count
        jr      nz,wHL_16       ;<0>  | <12>
        ret     z               ;<0>  | <11>
; Last jr was 7, but the extra 5 from "ret z" keeps us at 16 * A.
; The "ret z" cost balances the previous "ret z" in the 0 case.

The numbers in angle brackets are a bit of legacy from a semi-automated cycle counting system, but they also show how the extra cycles get added as a function of HL. The first column of numbers gives instruction times if a branch isn't taken, the second column when the branch is taken. Looking at the first section of wA we see that if bit 0 of A is clear it takes 4+7+4==15 T-states. But if bit 0 is set then we pay 4+12+0==16 T-states. Similar machinations occur down to wHL_16 to effect A mod 16 extra T-states. At wHL_16 A counts the number of 16 cycle chunks that remain which we can burn up with a simple loop. We're pretty lucky to have jr; using jp would use 14 T-states with no way of balancing it to 16 T-states — there is no instruction that uses less that 4 T-states. The first column T-state counts from wA to the end add up to 81. Thus wA will use A + 81 T-states.

It's no accident that the entire first column adds up to 100 and represents the minimal cost of the subroutine. Calling wHL with HL=0 will cost 100 T-states. The last bit of code we need to examine is at the entry point. If H == 0 if falls through to wA with A == L. For non-zero H we jump back to wHL256 which counts down H and wastes 256 T-states using wA. The trick is figuring out the proper value of A that accounts for the overhead in wA and the other time spent passing through this sub-loop of the code.

Nothing more to say. wHL first wastes H*256 T-states and then L T-states which is exactly the desired effect as HL==H*256+L.

wHL is really handy for soaking up a few thousand cycles to pad a program's time to a full frame. It won't do you much good in balancing the action when the screen is being modified. At 10 T-states to load HL, another 17 to call wHL and 100 minimum there's only 1 T-state left per line. The screen update code (or graphics kernel, if you will) must be crafted by hand with the odd instruction added here and there for timing considerations. Put it this way. With each line you'll likely spend 40 or 50 T-states loading up graphics data and then another 40 or 50 writing it to the screen. Using up a remainder of say, 28 T-states is something best done by consulting a table of Z-80 instruction timings. I suggest you use LD (0),HL followed by JR $+2 to burn 16 and 12 T-states respectively.

What about the rest of the code? Here I recommend my version of zmac which has T-state counting built in. Each instruction assembled adds its instruction time to a running total of T-states. The t(mem) operator returns this total for all instructions assembled just before memory location mem. You can change this running total using the tstate n psuedo-op which sets the current total to n. Instruction timing on the Z-80 is a little complicated because some instructions have fast and slow paths depending on input conditions. Assembled instructions always add the minimum number of T-states. Two addition operators help in accounting. tilo(mem) and tihi(mem) return the fast and slow T-cycle counts of the instruction at memory location mem.

The simplest thing to do is count the T-states of a linear section of code.

code:   ld      a,5
        ld      (hl),a
        inc     a
        inc     hl
        ld      (hl),a
cost    equ     t($)-t(code)

cost will be the number of T-states used by those 5 instructions. We could define a bunch of symbols like this to measure the various blocks of our code and add them up later but its much easier to let the assembler keep its running total and adjust it when there's something it doesn't know. For example, here we add in 100 T-states because we know that's the minimum cost of wHL.

        call    wHL
        tstate  t($)+100

Obviously a macro that adds to the current T-state count would be preferable and in this case we'd even want a macro for calling wHL that hides it all. But for now let's keep everything exposed. Putting these two techniques together we can now easily write a loop that will execute in exactly the same time as a Model 3 frame.

loop:   ld      hl,264*128-total
        call    wHL
        tstate  t($)+100
        jp      loop
total   equ     t($)-t(loop)

Loops are one area where zmac just can't get the timing right. As long as the loop has a fixed number of iterations the loop cost can be added quite simply.

cnt     equ     10
        ld      c,cnt
        ld      hl,buf
lp:     ld      (hl),0
        dec     c
        jp      nz,lp
        tstate  t($)+[t($)-t(lp)]*[cnt-1]

Note that t($) accounts for the first iteration of the loop. We must add one less than the number of iterations to balance things out. A simpler expression can be obtained by working with the start of the loop as our basis.

        tstate  t(lp)+[t($)-t(lp)]*cnt

The same loop using jr instead of jp is a bit trickier. Only one iteration of the loop incurs the low T-state cost of jr; all the rest are the maximum. zmac counts the low cost so those cnt-1 interations must have the difference between the high and low cost added. We also need to know what the jr instruction is two bytes long in order to query its high and low users — that's where the $-2 comes from.

cnt     equ     10
        ld      c,cnt
        ld      hl,buf
lp:     ld      (hl),0
        dec     c
        jr      nz,lp
        tstate  t($)+[t($)-t(lp)+tihi($-2)-tilo($-2)]*[cnt-1]

What about loops that execute a variable number of times? I don't recommend them. Your program has a fixed amount of time before it must get back to the business of hammering screen memory. The beam waits for no man. Any loop that executes a variable number of times must have a maximum and you can't run out of time even if that maximum is hit. Thus it is just as well to process the maximum every time to guarantee that the program stays in sync in that case.

Unless you're getting fancy. And why not? The most agressive technique I've used was in Doubled Dancing Demon (and to a lesser extent in the bload simulator). Even if we take away sound the demon's animation needs to be updated every frame. He might need to move to the left, or change to a different frame or do nothing. Making the various small subroutines take exactly the same amount of time is tedious (but more feasible since I added assert to zmac). Instead they're all expected to subtract the number of T-states they use from HL. We'll assume there's some subroutine which updates animstep to point to the animation subroutine for this frame. I can level them out to using 3000 T-states by doing something like this.

        ld      hl,retaddr
        push    hl
        ld      hl,3000
        ld      ix,(animstep)
        jp      (ix)
retaddr:
        call    wHL

Since the Z-80 doesn't have CALL (IX) I had to fool around with pushing the return address I want on the stack and doing a jump. Each animation subroutine does the T-state accounting at run time. For example:

right:  ld      de,(position)
        inc     de
        ld      (position),de
        ld      de,t(right)-t(r_end)
        add     hl,de
        ret
r_end:

That t(right)-t(r_end) might look backwards, but the idea was to count the negative number of T-states used so that adding it to HL will actually subtract the number of T-states used.

One more thing. I should account for the cost of the dispatch. Here's it is again with things adjusted so that the animation subroutine and the dispatcher itself will be an even 3000 T-states.

dispatch:
        ld      hl,retaddr
        push    hl
        ld      hl,3000-d_cost
        ld      ix,(animstep)
        jp      (ix)
retaddr:
        call    wHL
        tstate  t($)+100
d_cost  equ     t($)-t(dispatch)

These examples adequately describe the basic mechanisms needed to track T-states. Like any programming problem you may be best off contemplating the fundamental operations and putting them together in your own way. It all gets tied up in how the rest of the program operates.

Up next is the more straightforward topic of Model 3 video timing.

George Phillips, July 28, 2009, george -at- 48k.ca