Doctor Who Video on the 4P

Around June of 2014 Dusty kicked off a Doctor Who intro contest (https://web.archive.org/web/20141106002721/http://www.trs-80.org.uk/Tandy_Page_5x.html) to see who could write a TRS-80 program that best plays any Doctor Who title sequence. The contest was open to all models of TRS-80 which is quite a range, but for me the Model I/III/4 line was a given due to my long experience with it. The 4th Doctor's (Tom Baker) intro was my only choice. In the 70's I went to England with my parents. I'd never seen Doctor Who in Canada, but our cousin was a fan and we watched a few episodes together. I don't remember a thing about the episodes but the title sequence stuck with me. The visuals were cool and who can forget the wonderful theme music?

After considerable effort, I came up with a technique for streaming audio and video off floppy disks that resulted in this:

112 x 48 pixels at 25 FPS with 1 bit 31,250 Hz audio. The movie data is 694 KB spread across 4 diskettes.

Streaming playback from the floppy drives wasn't what I originally planned at all. This is typical of most software projects but rarely are the details published. Program walkthroughs give the impression of a straightforward path starting from an idea of what needs to be done leading to a rough outline which is implemented as designed. The implementation has some minor bugs which are dispatched and the program is done.

Instead of jumping straight to a walkthrough I'm going to go through the various twists, turns and dead ends I encountered along the way. Somewhat like the Commodore 64 Paradroids diary (http://www.zzap64.co.uk/zzap3/para_birth01.html) or the Sinclair Spectrum R-Type book (http://bizzley.com/) but a much smaller effort.

Table of Contents

First Try

Remember when I said I wasn't going to gloss over details? Well, there is one big dead end that I won't say too much about. I wrote a standalone program that plays the Tom Baker intro:

I did review the other opening sequences just in case I saw something I could do well, but there was little doubt which one I'd pick. I chose the Model 4P purely based on availability (I have one) and the numbers. It is the fastest stock Model I/III/4 TRS-80 (4 MHz Z-80 CPU with no wait states) and it can boot from the RS-232 serial line. Most of my development is done on a Windows PC where I run zmac (http://48k.ca/zmac.html) to assemble code and run it under my trs80gp TRS-80 emulator (http://48k.ca/trs80gp.html. For final checks I can download to the 4P. I decided against using hi-res graphics because only my slower Model 4 has that option and the board can't really manage high-speed full-frame effects due to limited bandwidth between it and the processor.

It took quite a bit of work. There are some hyper-fast line and circle drawing subroutines, various ad-hoc compression methods and some tricks to overlay the various effects. And there was no small amount of work in getting it to look close to the original. In some cases (like the start with the walls) I wrote code to produce a similar effect. At other times (the Doctor's picture) I converted frames from the original to 128 x 48 graphics and did a little touchup by hand.

You can download the prototype here: tb.zip. Run it directly under my emulator:

        trs80gp -m4p "tb.zip?tb.cmd"
or extract tb.cmd and run it on a real Model 4.

I was quite pleased with the result, but it was missing something very important: sound. Given the iconic nature of the Doctor Who theme music I daresay that it was more important than the graphics. So why didn't I start with that? Well, you know, graphics programming is generally more fun and in many ways easier than audio. There is always some surprise, but generally one can visualize what a programmed visual effect will look like. The tricks of good sound are less familiar to me. Not to mention that the TRS-80's audio capabilities are very limited. The 4P has a built-in speaker that is driven by single bit output. Tones are easy enough but require careful timing. The Z-80 must toggle the speaker at the frequency of the tone which is usually done with carefully constructed timing loops. Even then I'd have to transcribe the score though at least that is readily available from http://dwtheme.com/.

I did do that. I even tried out some ways to mix the melody and bass line. Getting that music into the existing program was going to be difficult, but it seemed feasible. To do A just above middle C the speaker must be toggled every 1/220th of a second (giving a 440 Hz tone). That may seem fast to us humans, but that's roughly 18,000 machine cycles at 4 MHz. More than enough time to get some appreciable chunk of graphics drawing done. For lower tones there is even more time. If the TRS-80 had a programmable, high-resolution timer interrupt the job would be easy. Sadly, it does not. Instead I'd have to break down the graphics drawing into precisely timed chunks. As long as each chunk is reasonably small (say 3,000 cycles) I could switch between drawing graphics and toggling the speaker as needed.

Those draw routines would have to be timed exactly, though. The other big downside of audio over graphics is the precision required. With graphics if you draw frames a little bit too fast or too slow you'll notice the difference but it will still look fine. Not so with audio. Being too fast or slow pushes notes out of tune. Variable timing leads to warbling and other unwanted side effects. Not only would it be quite a challenge, but the overhead of switching between graphics and audio would mean I'd have to lower the frame rate of the graphics. I'm not good at that kind of compromise. I'd worked hard to get the graphics running at 30 frames/second and I'd hate to see that drop.

In fact, I did originally have a rough plan for audio. I'd run the graphics at 30 FPS but make sure all the drawing could be done in 1/60th of a second. I could then use the 60 Hz timer interrupt to switch between graphics drawing and audio output. The audio would have 1/60th second gaps in it but otherwise could be quite high fidelity. Those gaps would have a horrible effect on the quality, but compromises have to be made. I'd still have to cut back on the graphics. About half of the drawing was under 1/60th of a second, but parts where the TARDIS zooms in and when the logo zooms out were just barely under 1/30th of a second.

So I had two ways I could add semi-decent audio to what I had. Except for one rather large problem -- I was out of memory. The existing graphics display used up pretty much all of the 64 KB on my 4P. TRS-80 Model 4 computers can have up to 128K, but only my slower Model 4 had that. If I set that machine as my target the graphics would take a bigger hit. Even then I don't think it would have been enough memory. The most direct way to do the tones was to store the number of cycles between each speaker toggle. A list of frequencies and durations is quite short but listing the time between each speaker toggle is not. Maybe it could be computed on the fly if the tone music was restricted to just the melody or the bass line. That would imply doing an arrangement of the music which is a much more difficult task for me than programming.

The data space needed for playback of a digitized audio with gaps was even worse. I knew from previous experiments that 15 KHz 1 bit audio can sound pretty good. 30 seconds of that audio is about 56,000 bytes. Due to the nasty gaps I could cut that in half to 28,000 bytes. Half my memory! The graphics would take an even bigger hit.

I wasn't enthused about the alternatives. Either way it meant a bunch more programming, a cutback on the graphics and all for audio that was on the barely acceptable side of things. Faced with unpalatable alternatives I did the only reasonable thing and hoped for a better way.

The Alternative

At a minimum I needed to find a way to squeeze audio data into my graphics demo. The usual response to running out of RAM is to load data from disk. The 4P has two single-sided, double density drives capable of storing 180 KB each. More than enough space for the extra 64 KB of data I need. It wouldn't really do to stop the intro half-way through, load the second half and continue. Perhaps it could load smaller pieces as it goes? The chunks would have to be small. Ideally they would load in 1/60th of a second allowing the program to interleave graphics drawing and data loading. How fast can the floppy transfer data? Once it is up to speed the floppy disk spins at 300 RPM or 5 revolutions per second. There are 18 sectors of 256 bytes each on a track. That amounts to 90 sectors per second or 1.5 sectors in 1/60th of a second. Loading one sector every 1/30th of a second over the first half of the video (about 366 frames) would give us over 90 KB of data. Good, more than enough.

Loading the audio data will be easier than loading the graphics data because of its simple structure. The "gappy" digitized audio is a better choice than the tones. Not only is it easier to generate but the result will sound closer to the original.

There are some timing problems. One can't simply ask the floppy for one of its 720 sectors and start loading it instantly. Instead one has to tell it to move the head to the track that contains the sector (called seeking). And then one might have to wait almost a full revolution before the sectors comes around under the head to be read. Seeking will be OK. Obviously I'll put the sectors in order such that the first bunch are on the first track, the second on the the second track and so on. I'll only need seek every 18 sectors. The disk can seek from one track to the next in 6 milliseconds and will do so by itself when instructed. A 1/60th of a second is 16.67 milliseconds so when I do need to seek it can be done before the graphics start drawing and I'll know it is done before we start reading the next sector.

Rotational latency can be dealt with easily enough. Arrange the sectors so that they'll be in position when needed. At one sector every 1/30th of a second and with 3 sectors going by every 1/30th of a second all that needs to happen is to put data in every 3rd sector. You can't depend on exact synchronization of the video frames and disk data. The disk won't be exactly 300 RPM and there's certainly no guarantee it will be an exactly multiple of the 60 Hz video interrupt. That's OK as I can let the disk data rate drive our code. There will be some tearing of the video if we update it at will, but that won't be terribly noticeable.

There is a problem with the plan. Reading the sector data will take up 2/3 of the 1/60th of a second allotted to audio output. Making a bigger gap without audio is going to bring the quality down. Maybe there's a way around that? Could I update the audio while reading bytes from the disk?

At this point I needed to learn more about how data is read from the floppy. With the drive on the right track the procedure is fairly straightforward. You set the sector to read and issue the "read sector" command. Then poll the floppy controller until it signals that the disk data is coming in. At that point the data can be read a byte at a time. Synchronization is guaranteed as the disk controller will issue wait states to the processor until each byte is ready. All well and good, but how fast is the data coming? Well, 18 sectors, 256 bytes/sector at 200 milliseconds per revolution works out to 43.4 microseconds per byte. As it turns out, the bytes come faster than that. Space on the disk is used to separate the sectors. There are header bytes to denote the start of a sector, trailing CRC checksums and small ID sectors used to identify the number of each sector. The WD1773 manual gives the actual answer -- a nice round 32 microseconds/byte. That's 129.76 Z-80 cycles per byte for the 4P which runs at 4.05504 MHz. The fastest Z-80 instruction is 4 cycles so we can get at most 32 instructions done in that time but more likely, on average, about 16. Not a lot, but seems like more than enough to get a byte and update the audio. And to do 15 KHz the audio only needs to be updated every second disk byte.

With such a workable scheme in place it was surely time to build it, right? Instead I decided to indulge in greed. Partly I looked at the disk and saw that it wasn't being used to its full capacity. If I'm going to learn how to read data from it I may as well get all I can out of it. Mainly I wanted to do better than audio with 1/60th second gaps in it. The disk could readily supply enough data to be playing audio all the time. All I'd need to do is update the graphics while reading audio data. It would be a more aggressive form of the interleaving I contemplated earlier when thinking about putting tone music into the demo. The relative overhead would be much greater. Each chunk of graphics update would need known timing. That may need a little explanation. When reading data from the disk the data bytes will act as a sort of clock. As long as I don't take too much time the act of reading the data will cause the Z-80 to wait until the next data byte is ready. Thus the toggling of the speaker will occur at precise intervals. But when not reading data the code will have to maintain the interval timing itself by executing for 129 cycles.

Once again I have a workable idea. The graphics will likely need modification as they won't get full use of the CPU to keep up the frame rate. The graphics drawing itself will have to be broken into little chunks. It can work, but it will be a lot of work. The pressures of fundamental laziness and lack of copious spare time press upon me for an easier solution. Certainly the graphics update would be a lot easier to time and manage if it was something simple like drawing a byte at a time. Maybe I could stream the graphics from disk?

The disk can deliver 23,040 bytes/second. A screen full of graphics is 1,024 bytes. Looks like a respectable 20 frames/second is achievable. Maybe more as the graphics screen could be packed into 768 bytes. The resolution is 128 x 48 which is 6,144 pixels that can be represented in 768 bytes with 8 pixels per byte. The natural format is larger as the TRS-80 has a 64 x 16 screen and does graphics using a set of 64 graphics characters that only have 6 pixels per byte. Some bytes will be needed for audio but the technique looks feasible. And it is very attractive as I'll have full audio and no restrictions on graphics. I just need to toss together an implementation.

Stream Dreams

All the various estimates of hardware speed made the idea of streaming audio and video from the floppy disk look feasible. All I had to do is refine the high level concept of the program into actual Z-80 code:

  for (;;) {
    read data byte from disk
    update audio
    update graphics
  }

Getting in to the details, that "read data byte from disk" is actually a series of different operations. It must set up the disk controller to do a read, wait for the data to spin under the head, read the data bytes and repeat. In that sense the loop would be unrolled so the actual program structure is more along these lines:

  for (;;) {
    set track
    update audio + graphics
    set sector
    update audio + graphics
    while (no data)
      update audio + graphics
    while (have data) {
      read data byte from disk
      update audio + graphics
    }
  }

Of course, there's still the need to seek from time to time but that doesn't complicate the picture too much. What I've not yet considered is buffering. The code will have to read ahead a bit so it actually has graphics and audio data to output while it sets up and reads the next sector. I'll need to have an initial phase where sufficient data is buffered up. I'll assume the program consumes audio and graphics data at close to the same average rate as the floppy can produce it. If their rates don't match then either the buffer will be gradually depleted or it will gradually increase.

How much buffer does it need? There are the gaps between sectors. I don't need to figure out the time exactly, so let's assume it is less than an 1/18th of a rotation (11.1 milliseconds) at most. The manual says track to track seek times are only 6 milliseconds. We can arrange the sectors so that there is no rotational latency so 12 milliseconds should be enough. At 43 microseconds per byte, that works out to 280 bytes -- a pretty trivial amount.

Although I'm trying to present this as it happened, I can't remember all the little mistakes made and it is really hard to convey how muddled up one's thinking can get. For instance, the notion of matching data rates and the actual disk data rate spent a lot of time being quite unclear in my mind. I often confused the average rate with peak rate. And half the time I was working on bits of Z-80 code to assure myself that the 129 cycle budget could be met.

With that in mind, I went quite a ways down the path thinking that buffering was so small as to be hardly necessary. It was quite a shock when I looked at the data requirements. At 20 frames per second and 1 KB per frame the 30 second intro would require 3.33 disks to hold all the data. That itself wasn't too bad. A bit inconvenient but in principle a TRS-80 Model 4P with 4 drives could play the demo. Mine doesn't have 4 drives but 7 or 8 seconds for each disk leaves plenty of time for me to swap in a new disk.

At first I thought that switching disks wouldn't be a problem. Partly that was assuming that I could control the drives independently. Once I'd learned a bit about the hardware it was clear that only one drive could operate at a time. Even then it took a long time to grasp that I could not reasonably expect the drives to be in rotational phase. I might have to wait an entire revolution before the first sector was available. That pushes the buffer up to 200 milliseconds.

It gets worse. The drive needs time to get up to full speed. The manual says to allow one entire second for that. One second! Five entire rotations! Now I'm looking at around 22 KB of buffer. Talk about sticker shock. After I calm down a bit I realize the program itself is small and it otherwise has no serious amounts of data and with 64 KB available there is plenty of RAM to spare.

Buffering will make things a little more expensive. Well, that isn't precisely true. The buffering was always needed and rather implied in the pseudo-code. I've just come to realize that there is some extra work to be done. Fortunately, a simple ring buffer will suffice. In C we could put data into the ring buffer like so:

  ring[write_idx++] = disk byte;
  write_idx %= sizeof ring

and similarly read from the ring buffer:

  data = ring[read_idx++]; read_idx %= sizeof ring

If "sizeof ring" is a power of 2 (in our case, 32 KB), then the relatively expensive modulus operation can be replaced with a bitwise "and"

  write_idx &= 0x7fff

In Z-80 code we wouldn't bother with an array but use pointers to memory. By lining up the ring buffer in memory the "and" trick can still be used. The top 32KB of RAM on the 4P will do nicely putting the ring buffer from address $8000 to $FFFF. Reading and writing to the ring buffer are basically the same. Assuming Z-80 register DE serves as the buffer pointer:

	ld	(de),a		; write disk byte
	inc	de		; move to next location
	ld	a,$80
	or	a,d		; keep D in $80 to $FF range
	ld	d,a

OK, it isn't actually the same "and" trick. It would be if we used $0000 to $7FFF as our buffer but luckily we can get away with $8000 to $FFFF using "or".

The cycles used (7 + 6 + 7 + 4 + 4) add up to 28. The same work has to be done reading from the buffer so right off we're using 56 cycles from our 129 cycle budget. That's quite a bit for just moving data around, but it doesn't break the bank. And after thinking about it for some time I came up with something a little more efficient:

	ld	(de),a
	inc	de
	set	7,d		; set bit 7 of D

That gets the cycle count down to 7 + 6 + 8, a mere 21 which takes away only 42 from our budget. Hurray for optimization!

I should note that at the same time I had been keeping an eye on the cost of those disk reads, audio update and graphics updates. Seeking commands and the setup of reads are pretty much a few "out" instructions and waiting for them to be acknowledged. Reading the bytes is the most expensive as the disk controller must be told to issue wait states before every read. It boils down to:

	ld	a,$c1
	out	($f4),a
	in	a,($f3)

Uses a total of 7 + 11 + 11 or 29 cycles.

Audio updates are quite simple. The core operation is to output a bit and shift the next one into place. If we can use the Z-80 A' register this will do it:

	ex	af,af'
	out	($90),a
	rrca
	ex	af,af'

Cycle count there is 4 + 11 + 4 + 4 = 23. Another possibility is having the C register pre-loaded with $90 and have our audio bits in B register:

	out	(c),b
	srl	b

That is a mere 12 + 8 = 20 cycles. A bit of a saving if we can afford to tie up the BC register.

Graphics updates are just a matter of writing to the screen. If HL points to the screen then the core operation is:

	ld	(hl),a
	inc	hl

Cycle count is 7 + 6 = 13. Tallying that all up, there are 42 cycles for the ring buffers, 29 cycles for disk operations, 23 (or perhaps 20) for audio and 13 for graphics for a total of 107. Huzzah! 22 cycles to spare.

22 cycles is not a lot, but seems like enough time to glue these little code fragments together and take care of any little details. Running a loop only takes 14 cycles (dec reg; JP cond,label).

But, of course, it isn't quite that simple. The problem is that my little picture of the program as an infinite loop misses out on the fact that the disk reading and graphics/audio updates are decoupled. If instead of a floppy disk we were reading from a CD or DVD it would be that easy. Those media are designed for steaming. Unlike a floppy, which has 40 concentric tracks that are rings on the diskette, CD's and DVD's have a single, spiral track like an old vinyl record. Once data starts coming in it will arrive at a nice fixed rate. Not so with a floppy where there are gaps between sectors and seeks between tracks and huge gaps when we switch to another drive.

A slightly more realistic loop is something like this:

  for (;;) {
    do whatever disk operation is currently required
    update audio and graphics
  }

In other words, instead of putting the disk operation directly in the main loop, I'd need to call out to a subroutine, which is 17 cycles for the CALL and 10 more cycles for the "RET" to return back. Now I'm over budget by 5 cycles. And I need to do something different than a CALL. What I need is an indirect CALL, something like "CALL (HL)" because I don't know what the disk is working on. Oh, and about those 6 cycles I'm over by? Well, that "for(;;)" loop needs a jump to run the infinite loop which means I'm really over by 15 cycles.

Once again there is just one little and possibly insurmountable problem to solve before I can get this to work.

Threading

I put an awful lot of thought into figuring out how to interleave the audio+video display update and the disk reading. I can't help but think that I can do better than the solution I ultimately arrived at, but let's just confine ourselves to what I did rather than what I could have done.

Conceptually the program is really two processes or threads if you will. One thread reads from the floppy disk, the other thread updates the graphics and audio. The two threads communicate using a ring buffer. General purpose threads are a common programming abstraction used to give the illusion that you have many processors working on a problem simultaneously. However, you may only have a single processor switching rapidly between the different threads. That is certainly the case here with only one Z-80. The switch between threads is called a context switch and generally involves saving all the machine registers for the current thread, finding the next thread to execute, loading the saved registers for that thread and running it.

I need to switch between the two threads every 129 cycles. There's no way I can afford to save and restore registers. Instead I'll have to have the threads cooperate and share the registers between them. That is no big deal except for the program counter. I will absolutely have to save and restore the program counter when switching between the two.

Here is where we run into trouble. The Z-80 has only one way to save the program counter (PC) register and that is as a side effect of the CALL instruction which pushes it onto the stack. Or the RST instruction which is a faster CALL with a fixed destination. There are many instructions which load the PC but only two that do it indirectly: RET and "JP (HL)" (well, there is "JP (IX)" and "JP (IY)" but they're slower than "JP (HL)"). All well and good but hooking them together in various combinations leads to something slower than a CALL and RETURN pair which was already deemed too costly.

Then again, I don't have to save and restore the program counter, I only need to change it. I can use the HL and HL' as pseudo program counters. A context switch is a matter of switching to the other register set and jumping to HL:

	exx
	jp	(hl)

A very attractive 8 cycles, but there is the matter of updating the pseudo program counter. In order for the current thread to return to the next step of its processing it will need to load HL with the address of that step. A general purpose context switch looks like this:

	ld	hl,next_step
	exx
	jp	(hl)

This is more than I can afford at 18 cycles which is really 36 as there are two context switches every 129 cycles. That's 14 or maybe 11 cycles over, but at least the technique is fully fleshed-out -- it will work as stands. I thought I could make it work by squeezing some cycles out of the other steps. Even the context switch can be made to give up a few cycles. A full HL load isn't needed all of the time. A "LD L,N" will suffice if the next step is in the same page as the current one. That cuts out 3 cycles from each switch for a savings of 6. Heck, with a very sparse program organization I could use "INC H" to move to the next step. Doing that would save a whole 12 cycles which is enough. And in some cases we may not need to change HL at all. We could be looping over the same step. That idea really didn't pan out, though. Maintaining a loop was almost always more expensive than a full HL load.

While the symmetry of the context switch was appealing, I could never quite get it to work. Looking at it now, I'm not exactly sure why. General uncertainty that comes with new territory may have been confounding. Or perhaps there were too many states for "INC H" to work. Or maybe HL was too valuable a register to tie up.

There was another program counter loading technique I'd used before that I realized could help. I originally came up with the idea when I had a program that needed to do a series of steps quickly but the steps varied. Normally I'd unroll the steps into one long chunk of code but the varying foiled that and there were too many possible variants to unroll them all. I did have space to list all variants as a series of CALL's as that only requires 3 bytes per step and the steps themselves are reused subroutines. Nothing too special about that, but the time overhead was a problem. Each step took 27 cycles longer -- 17 for the CALL and 10 for the RET.

Considering the list of calls I thought there was a lot of unnecessary work there:

	call	part1
	call	part2
	call	part3
	call	part3
	call	part10

The first call saves the return address of the next call which itself is just saving the address of the next call. When a subroutine returns it always goes back to a call to the next one. You'd think there would be some way to pre-compute the calls somehow, some way to cut out the middle man.

Well, I thought, why bother calling "part1"? Just push the address of "part2" on to the stack and jump to "part1". When "part1" returns it will go immediately to "part2". Though when "part2" returns there's a problem. But we can simply repeat the trick and put "part3" on the stack and then "part2". This won't be any faster if we write code to do this, but if we know the sequence ahead of time this will do the trick:

	seq:	dw	part1
		dw	part2
		dw	part3
		dw	part3
		dw	part10
		dw	done
	;
		ld	sp,seq
		ret
	done:			; we come back here when all the "calls" have been made

There, no more calls. Each step only costs 10 more cycles for the RET instructions. I think this is similar technique to what is called "return oriented programming" (http://en.wikipedia.org/wiki/Return-oriented_programming) but used for good.

The trick can't be applied to both threads, but certainly a context switch to one of the threads can be a single, 10 cycle "RET" instruction. This frees up HL which turns out to help the disk reading thread. Which is a little unlucky as the straight-line execution of the graphics and audio update works better with the technique. But if we need to loop back to a previous step all we need do is load up the stack pointer to that step. One way of looking at this is that SP (the Z-80 stack-pointer register) is being used as our pseudo program counter. But it has one big advantage over HL in that it automatically moves to the next step where we always had to be explicitly updating HL.

The "RET" context switch at 10 cycles is a little faster than even the possibly workable 12 cycle "INC H; EXX; JP (HL)". It will definitely be easier to use and it frees up one of the HL registers. Freeing HL helps considerably. The disk byte read and write into the ring buffer can now be expressed as:

	ld	a,$c1
	out	($f4),a
	ini
	set	7,h

At 7 + 11 + 16 + 8 = 42 cycles that is 8 cycles faster than the 21 + 29 = 50 of the individual steps.

There are a few cycles to be saved on the graphics and audio updates. If I can make assumptions about the data stream alignment then always forcing the 7th bit of D register high is not necessary. Neither is a full increment of DE needed all the time. We can get away with incrementing E register only as long as we know its value is not 255. Similarly the register pointing to the screen doesn't need a full increment every time. Sure, "INC BC" is only 2 cycles slower than "INC C", but every cycle counts!

At this point the code fragments had become large enough to make it all look eminently viable. I'd even got a pretty concrete idea of what the data format would look like. Most of the graphics and audio data would be broken into 8 byte chunks. The first 7 bytes are graphics which are written directly to the screen. Those bytes can be pulled from the ring buffer without any concern over setting the high bit of D register and only need increment E register because their address will always be 0, 1, 2, 3, 4, 5 or 6 modulo 8. The 8th byte is audio which contains the 8 audio bits needed to be output for itself and the the next 7 graphics bytes.

I hadn't worked out what sort of padding would be needed to balance the input and output rates. What was clear is that some number of audio-only bytes would be added to each frame to achieve rate balance. When they are processed the graphics + audio update is only reading a byte from the ring buffer every 8 steps. Enough of those can be added to balance out the times when the disk reading thread is not bringing in bytes. For the record, I cannot tell you how many times I made mistakes in this padding. The padding itself has to come in chunks of 8 audio bytes to keep the chunks always aligned. Each audio byte itself is composed of 8 one bit audio samples. Many times I'd do a calculation based on a multiple of 8 audio samples rather than 64. Or thinking about 64 audio bytes rather than 8. Something about the two 8's that caused me to mix things up.

Writing and Reading Data

With a prototype of the player ready it was time to write data to a diskette. Some of that work was already done when I was learning how to read data from the floppy. Although I've been discussing data transfer rates and other matters in terms of sector reads and I had decided to read data a track at a time. Why? For the usual reasons: greed and laziness. By reading a track at a time there would be no need for sectors and all their overhead. The considerable increase in the amount of data available per track translates into higher bandwidth and therefore a higher frame rate. It also means it issues a single floppy controller command per track rather than multiple sector reads. A distinct benefit given the difficult programming environment of the disk thread. Above all, it made the process of writing the data even easier. For various reasons I avoided using LDOS and its ability to write sectors. Since I was going to have to deal with the floppy controller at a low level when playing back the intro, it wouldn't be much extra effort to write code that puts movie data onto floppies. I also don't normally boot the 4P into LDOS or any other operating system. And in order to exert the necessary control over formatting I'd have to become an expert in using LDOS. It also wasn't clear how well LDOS supported track writes. The normal process of writing data to a floppy requires two steps. First all the tracks must be formatted with empty sectors using by writing the entire track. Then individual sectors can be written as desired. LDOS does a track write for formatting but seemed like it wouldn't write an arbitrary track of data. If I wrote my own track write routine it would effectively do the format and data write in one step without even having to figure out sector writing.

Track at a time operations come at a price. Track reads all start at the same rotational position on the disk -- when the index hole comes around. Therefore it isn't possible to read an entire track, seek to a new one and start reading as the index hole will be missed. It would function, but it would take an entire revolution before the next track came in. Instead one must read most of the track then seek to the next with enough time to spare to start reading the next.

A formatted double-density diskette with 18 sectors of 256 bytes each per track can store 4,608 bytes of data per track. At the raw data rate of one byte every 32 microseconds a track can pack in 6,250 bytes in the 200 milliseconds of a revolution. Cutting off 6 milliseconds of that time due to the seek gives a theoretical capacity of 6,000 bytes per track. Yowsa!

Reading the WD1773 floppy disk controller manual it became clear there was another slight downside to track reads. A track write mostly writes literal data to the track. But a few bytes instruct the controller to write special marks indicating the start of a sector, the CRC checksum of a track and one other special mark. As a consequence the movie data cannot contain $F5, $F6 or $F7. Not too much of a hardship. The graphics bytes are $80 through $BF so they can all be written. Audio bytes could contain those values. I can just flip a bit in those cases to avoid the illegal value. The very slight change won't affect the already noisy one bit audio.

I finally had to come to grips with pinning down the disk data rate which will determine the frame rate and movie format. It pains me to relate the confusion that pervaded my calculations. Part of the difficulty was the assumption that I could choose the number of bytes per track which, naturally, I wanted to maximize. That changes the data rate which makes the frame format change. Add to this the confusion I mentioned earlier about the audio byte padding used to adjust the rate. I plead difficulties in doing this simultaneously while figuring out how to code the playback and trying to make it even possible. Recall that exploiting data alignment was critical in saving CPU cycles and meeting the 129 cycle limit. I also had made some estimates of frame rate which became promises I was unable to keep.

As it was the data rate issue was only finally resolved at about the same time the player was debugged. However, I'll outline the final parameters now in order to keep the discussion more concrete. One can write an equation for the effective frame rate based on the number of bytes in the frame and the number of bytes of "padding" audio. But because the audio samples can only be added in 8 byte chunks there are not very many possible frame sizes that can be supported by the disk data rate. I restricted the display to 112 x 48 (rather than the full 128 x 48) based on slightly flawed reasoning. A full frame is 1,024 bytes and the absolute limit of disk transfer rate is less than 30 KB/second so there was no way to do 30 frames/second with a full frame. Yes, despite the fact that a 112 x 48 screen requires 1,024 bytes of data once audio is included. I think I forgot to include the audio in my initial estimate. It may be just as well as the frame rate was going to be below 30 for sure.

Finally I looked at it from a simple angle. Each block of 8 audio bytes padded to the 1,024 byte frame would lower the frame rate and thus the graphics and audio data consumption rate. I just needed to try the few possible combinations until the data rate looked reasonable. Then I could set the number of bytes per track to exactly match. It turned out that adding 4 blocks of audio bytes gave the right balance. Each frame takes 1,024 steps plus 256 for the audio padding. That works out to 40,960 microseconds/frame or 24.41 frames/second. The total frame size is 1,056 bytes given a data rate of 25,781 bytes/second. At 5 revolutions per second that means 5,156 bytes per track. A nicely conservative number but at least more than what a formatted track can hold.

I put together some placeholder data by doing a straightforward conversion of the intro. I wasn't going to worry about quality until the player was known to be working. This involved writing a program for the 4P which reads data from the RS-232 and writes it as tracks onto a floppy. And another program for Windows that writes the the track data to the RS-232.

The result? It worked perfectly! Ha ha, just kidding. It kind of worked, but the display was corrupted strangely:

One thing was a success. Hearing the floppy drive go "chunk chunk chunk" as it sped through the entire disk in under 8 seconds was great. I'd never heard a TRS-80 drive sound like that before. It was like the roar of a powerful engine being pushed to its limit. I was really making the machine work.

It was really odd that the display retained the rough shape of the images but it had impossible characters in it. I could see data bytes being dropped or even possibly duplicated, but it was showing characters that were simply not in the data stream.

I was mystified so I decided I should have some other program read the data to rule out having made some kind of mistake. Here LDOS's ability to read tracks looked quite attractive. I wrote up a program, booted LDOS and ran it. It did not seem to work, at all. After some puzzlement I looked at the LDOS source code (http://nemesis.lonestar.org/computers/tandy/software/os/logical_systems/lsdos6/src631/) and discovered that the floppy driver does not support track reading. I could only laugh; I was on my own.

I modified the 4P track writer and Windows track writer to do read back of track data. Sure enough, the track data I read back corresponded to what I was seeing on the 4P's screen. I couldn't blame the problem on some obscure timing difficulty associated with the streaming disk reader. Either I was making some kind of fundamental and systematic coding mistake or there was something else going on.

After a bunch of Googling about I came upon the answer: false syncs. My first mistake was thinking that track reads would return the raw data of a track. Not quite. The controller still has to get into byte synchronization. It will read the data a bit at a time until it finds a special pattern that marks byte sync. At that point it knows that every 8 bits is a byte. I should have known better as I wrote a sync byte at the start of the track. What was more surprising is that the choice of sync byte pattern used in standard MFM recording was poorly chosen. Certain combinations of data bytes can generate this pattern. This isn't a problem when reading normal sector data because the controller knows it is reading data and ignores sync patterns until it has finished with the sector. Not so when reading a track data where the controller is always on the hunt for sync byte patterns. When it sees one it doesn't return it as data but simply starts reading bytes starting at the new alignment. This bit shift is where the "impossible" characters were coming from. These false syncs are pretty interesting and well described at http://info-coach.fr/atari/hardware/FD-Hard.php#False_Sync_Byte_Pattern. Apparently they are quite useful in copy protection schemes.

It was pretty easy to add an extra processing step in conversion of the movie data to track data. It computes the MFM pattern the disk will be writing and alters data bytes if they will produce a sync pattern. A little more complicated but in spirit the same issue as avoiding $F5, $F6 and $F7. Once again we can hope and probably depend on any of the graphical or audio noise introduced to be not a big problem.

Now I could try switching between disks. I wrote data to a second floppy, made some fairly simple modifications to the player so it could stream from the second floppy and ran the new program. It worked perfectly! Ha, I'm kidding again. Playback was OK until it started reading from the second drive and the data seemed to be getting shifted and otherwise out of alignment.

I solved that problem fairly quickly. I had been worrying from the start that depending on a steady 129 cycles between bytes might be too much to expect from even a single drive. Variations in drive RPM are to be expected and could cause the program to miss data bytes. It seemed even more likely that little differences in the speed between drives would be a problem. I had, in fact, written the diskette used in drive 1 on drive 0. Writing the second disk on drive 1 fixed it. I really should have taken the time to extend the disk writing program at the start but foolish optimism ruled the day.

Things seemed to be working fairly well but I wasn't able to get a clean run. It seemed like I was missing bytes. I wrote a test movie which was an ASCII box around the screen which made it pretty clear that bytes were being dropped. I got rather confused looking at what was happening. Skipping over disk bytes was clearly the issue, but after a bit of unaligned playback the display would get back in sync again. I just could not imagine any way that bytes could be duplicated never mind just enough bytes to get back into alignment. In hindsight the answer was obvious. The track reading code might well miss some bytes when reading, but it would still read an exact number of bytes per track (the tracks were always padded to full length). So even if, say, 3 bytes of the expected data are missed it simply means that 3 of the padding bytes at the end will be read. After each track the data stream will be back in alignment. A track has about 5 frames worth of data so the bursts of misalignment last long enough to be quite noticeable.

It took me a while to get to the bottom of it. Eventually I reasoned that I must be taking too much time between disk reads. Yet all of the steps were within the time budget. I had even accounted for possible extra time due to wait states when writing to display memory. Nonetheless I decided to test the theory by taking by disk track reading utility and padded the byte read loop to take exactly 129 cycles. Sure enough, it started to lose bytes. Reassuringly, when I reduced it to 128 cycles there were no bytes lost. Fortunately the audio+video update routines were all padded by a decent number of cycles so it was easy if a little tedious to shave a cycle off of each step. Finally I had something that worked.

Funny thing is that, at the time, I thought I knew why I had gone wrong. In writing this up I realized I was wrong about how I was wrong. I don't know what error I made exactly, but I had re-figured the number of Z-80 cycles in 32 microseconds and came up with a number slightly over 128. Boy, did I feel dumb. Using more cycles than the numbers supported was sure not to work. I had gotten to the right conclusion based on faulty reasoning. I can't speak for the world, but I think this kind of thing happens a lot in programming and human endeavours in general. Though in my defense I'll point out that I wasn't working solely with equations. The empirical result from artificially slowing down the disk read was the real driving factor. The additional reasoning was to reassure myself that I'd solved the actual problem instead of papering over the real issue.

The real problem was wait states on output to port $F4 (i.e., "out ($f4)" instructions). Having just re-read the Model 4P service manual I found there were warnings about delays added to accessing that port on the order of 1 to 2 microseconds. Or 0.5 to 1 microseconds which applies to my particular Model 4P. Absurdly, I had already discovered these delays. When I ran into trouble reading a disk in drive 1 that had been written on drive 0 I wrote a program to test the drive speed to verify that drive speed variances were the issue. It gave some figures that seemed too far off the expected 300 RPM to be reasonable. I used a different technique that gave more reasonable numbers and concluded that "out ($f4)" was suffering wait states. I even measured those wait states to be somewhere in the range of 1 to 4.

Had I been a little bit more on the ball I'd have realized the implications for the playback code and took steps to change it. But it wasn't my focus at the time and, besides, the playback code was just about working. I think I understand the situation now though there are still some questions that need investigation. At 128 cycles per step the code is coming in 1.76 cycles under budget. That's 0.434 microseconds which is a little under the minimum delay (according to the manual) of 0.5 microseconds. My measurements of wait states were quite variable but in some runs it would be 2 or 3 and in others is would be 3 or 4. Perhaps with regular access at fixed timing intervals the wait states can consistently hit 2 cycles or maybe less? In the playback code there are as many as 3 cycles available as video writes may have anywhere between 0 to 3 wait states added but I always ensure there are 4 to spare. That isn't true of the track reading code so it doesn't really explain why a 128 cycle budget works there.

At any rate, even though I couldn't explain exactly why it worked, clearly it did. Now I needed to feed it some better movie data.

Movie Conversion

The first step was to get the Tom Baker intro sequence off a DVD and into a file. I used VideoLAN (http://www.videolan.org/vlc/index.html) to do that. VideoLAN wasn't particularly easy to use in this regard, but after a bit of fiddling I got an "ribos1.mp4" movie file.

The next step was to convert the image data of the video into 128 x 48 monochrome. Here ffmpeg (https://www.ffmpeg.org/) was my tool of choice. I'd used it before and as a command line program it was easy to automate steps. After reading the documentation and doing the requisite Google searches I tried this as a conversion step:

  ffmpeg -i ribos1.mp4 -vf scale=128x48 -flags gray -pix_fmt monow out%03d.bmp

The command would scale the movie down to 128 x 48, convert it to grayscale and then to monochrome (black and white) and write the output as a series of individual images out000.bmp, out001.bmp, ... out733.bmp. The thumbnail image view in Windows file explorer make it pretty easy to inspect the results. It turned out not too badly if the tiny images were viewed at 1:1 resolution, but scaled up to TRS-80 screen size they looked pretty bad.

I'd hoped it would be that easy but wasn't terribly surprised. Automated conversion of images to monochrome can work pretty well, but losing both colour and spatial resolution makes it hard to get something resembling the original image. Even fairly sophisticated dithering techniques (see http://en.wikipedia.org/wiki/Dither) weren't up to the task as they are pretty dependent on the pixels being small enough to blend together as least somewhat. The intro video is a bit of a worst case too as those "tunnel" effects turn into a fairly uniform lump of gray when scaled down. As this called for some experimentation, I dumped the full-resolution and colour frames:

  ffmpeg -i ribos1.mp4 out%03d.bmp

I knew the best way to convert the images: hand them over to an artist. Whether by re-drawing the sequence entirely or touching up automated conversions a human would produce outstanding results. Well, as long as they have some reasonable amount of artistic talent which I certainly do not. Nor did I have anybody I could coerce into doing the job. Nor was I willing to pay for it. Automated conversion would have to do.

I did have a pretty good idea what general conversion strategy would work. Enhancing edges would bring out detail that could survive a brutal threshold conversion to monochrome. Some experiments in GIMP (http://www.gimp.org/) with edge detection filters were promising but not quite there. I tried quite a number of GIMP's filters that seemed up to the job before settling on "cartoonify" with a large radius parameter to produce thick cartoon outlines.

The inital part of the intro with the "walls" streaming past was still a problem. Cartoonify preserved none of the detail leaving two wedges that didn't appear to move at all. For that art I used "posterize" to reduce the number of colours to 3 and then applied a bit of custom code to convert the colours into black or white while avoiding merging adjacent areas of the same colour into a single blob.

For the final conversion pass I had ffmpeg change the frame frame rate to 25 per second and crop the image slightly to get rid the edges of the video that were blank. Come to think of it, it may have made sense to more aggressively crop the edges as the change would not be obvious and I'd effectively get higher resolution. The ffmpeg command was:

  ffmpeg -i ribos1.mp4 -vf crop=690:480:20:0 -r 25 tmp\frm%%03d.bmp

GIMP has the ability to run in a batch processing mode so I was able to write a script to do the posterizing and cartoonifying. I think it now supports Python but having used the Scheme scripting before I opted for that. It isn't too bad for doing the image manipulation, but it is terrible and practically undocumented when it comes to file operations like reading a directory. I wrote a Perl script that generates a Scheme script to process a set of images. Then the Perl script hands that generated script to GIMP for processing. I know it sounds complicated, but it was a lot easier that way.

Audio conversion was easier as I could afford to manually edit the single track as needed. The player outputs an audio sample every step. At 32 microseconds/step that works out to 31,250 Hz. Extracted using:

  ffmpeg -i ribos1.mp4 -ar 31250 -ac 1 ribos31.wav

As the player frame rate is not exactly 25 FPS the audio track was a bit too long even after trimming off some initial and trailing silence using Wavosaur (http://www.wavosaur.com/. My first thought was to use Audacity's (http://audacity.sourceforge.net/) "Effect -> Change Tempo" to speed up the audio while preserving the pitch. The difference in tempo was not subtle so instead I simply faded the audio out early.

There was still the matter of converting the audio from 16 bits/sample down to 1 bit/sample. I wrote a small C program to do the conversion. It outputs a normal .wav audio file having only two levels and a stream of bits suitable for playback on the TRS-80. I would play the .wav output on the PC to gauge the quality of the conversion. I've found that playback on the PC always sounds better than on the TRS-80 so while it is useful there's no substitute for trying in on the real hardware.

Converting audio to 1 bit is at least mathematically similar to converting graphics to monochrome. The first-order techniques like thresholding are the same and improvements can be had through dithering (see http://en.wikipedia.org/wiki/Dither. Yep, the same page I referenced when discussing graphics conversion.

I think the 31,250 Hz sampling rate is helpful as thresholding without any post-processing produces reasonable results. Watch the following video to get an idea what it sounds like. You'll notice some distortion and a whole lot of hissing. The hissing is less of an issue on the TRS-80 itself as the higher frequencies are lost.

Thresholding introduces noise into the audio because it cannot accurately represent the waveform. Each output sample will end up having quite a different value and that difference manifests itself as noise. That noise can be reduced by taking the error from each output sample and adding it on to the next sample before thresholding. As I understand it, this pushes the noise into higher frequencies which greatly reduces it. Here's a video that demonstrates the improvement. There's pretty much no hiss, but still the distortion and overall the sound is crackly.

Reviewing the conversion code I do wonder if my noise shaping has some bugs. I also tried applying rectangle and triangle filters to help matters but they didn't seem to have much effect. When it came down to choosing the audio track to use I went with thresholding mainly because it sounded the loudest. The audio had to compete with the rather loud sound of the disk drives constantly seeking. Here I would have really liked to have a plain old fast Model 4 which requires an external speaker and can be amplified.

Once the video and audio were converted to something the TRS-80 could handle it was easy enough to write a utility program to combine the two into a movie data file specifying the bytes to write to each track. A slightly more complicated program was needed to download the tracks over the RS-232 and write them all to floppy diskette.

The Player

Here's a pretty extensively commented version of the movie player. Click on play3.z to get the text file by itself if you're a mind to work on it. Otherwise, just read on.

;
; play3.z - dual task audio/video streaming from diskette
;
; Author: George Phillips
;
; This program can play back a short movie from several floppy diskettes.
; It outputs both video and audio and is known to handle 30 second clips.
; Playback time is theoretically unlimited as long as you're OK with
; constantly swapping floppies.
;
; Requires a TRS-80 Model 4 with 64 KB of no wait state RAM and dual floppies.
; These limitations could be checked but are not.
;
; The program will wait for a key to be pressed before beginning.  The user
; must have the first two data disks in the floppy drives and ready.  Once
; the light on the first drive light goes off the 3rd disk should be inserted.
; And the 4th disk when the second drive light goes off.  And so on if more
; floppies are needed.
;
; On any error the program jumps to 'err;' which displays a single '!' in
; the top left corner.  Anyone modifying the program would do well to add
; register dumps at that point to assist debugging.
;
; There are two threads of control.  The disk thread reads data from
; the floppies and puts it into a ring buffer.  The output thread reads
; data from the ring buffer and outputs the video and audio.  The two
; threads are rate matched so only a small amount of buffering is needed
; to cover the times when the disk thread is switching between tracks or
; across floppies.  The rate matching is not perfect so that will ultimately
; limit how long the streaming can continue without problem.
;
; The threads are run in lockstep with each getting a fixed portion of a
; 128 cycle steps.  A pre-built stack allows the output thread to pass
; control to the disk thread using the 'RET' instruction.  The disk thread
; passes control to the output thread using 'JP (HL)'.  The disk thread
; moves to its next step automatically but can jump to other steps by loading
; the stack pointer.  The output thread must load HL with a new value to
; choose a different step.  Both use unrolling to save time and keep the
; coding simpler.  The program itself is small but uses up considerable
; space as it unpacks this unrolled program code.
;
; The output thread has BC, DE and AF' for its use.  The disk thread can use
; BC', DE' and HL'.  Both may use AF but only within a step as the other
; thread may change it.  IX and IY are unused.
;
; The 128 cycle step was chosen because floppy disk bytes arrive every
; 129.76 cycles.  When reading bytes from the floppy the timing will be
; fixed by the floppy data rate as the Z-80 will be forced to wait until
; the data is ready.  During disk seeks and such the timing is maintained
; by ensuring the disk and output threads always use exactly their allotted
; time quanta.  Each step is padded with otherwise useless instructions
; in order to meet this restriction.  Macros and assert statements are used
; extensively to enforce these rules and ease the programming burden.
; The latest version of zmac is needed and recommended to assemble:
;	http://48k.ca/zmac.html
;
; The disk thread is given a 54 cycle quantum which leaves a 74 cycle
; quantum for the output thread.
;
; Approximate memory map:
;
;   $0800 - $37ff  12 KB "ret" stack for disk thread
;   $4000 - $7aff  14.5 KB unrolled code for output thread
;   $7b00 - $7fff  Main program startup code
;   $8000 - $ffff  8 KB ring buffer for disk data
;

;	1 byte audio, 7 bytes video - basic movie unit
;

stack	equ	$3800	; end of ~ 12 K for ret-controlled disk thread
audvid	equ	$4000	; unrolled audio/video display code (output thread)
ring	equ	$8000	; start of 32 KB disk input ring buffer

; Hard coded parameters from movie generator.
audblk	equ	4	; 64 bit (8 byte) audio blocks per frame
tdatln	equ	5156	; data bytes per track
framcnt	equ	684	; frames in movie
numdsk	equ	4	; diskettes to read

	org	$7b00
proglow:

; -------------------- Disk Thread --------------------------

; Various macros for construction of each step in the thread.
; Ultimately the "ret" stack is an array of addresses of each step to use
; in sequence.  To speed loading this "ret" stack is assembled as
; run-length encoded (RLE) data.  A control word with the high bit ($8000)
; set uses the lower 15 bits to record twice the number of times the
; following word is repeated.  All the other control words indicate the
; length of literal data to copy onto the "ret" stack.
;
; The RLE data for the "ret" stack is appended on to the end of the program
; assembly.  The macros will ORG to the program end, add some RLE data
; and then ORG back to where assembly is happening.

dskqnt	equ	54	; Disk thread quantum

; Fail assembly if the current cycle count is not exactly the disk quantum
dq_check macro
quant	defl	t($)
	assert	quant == dskqnt
	endm

; Working variables to track the construction of the disk thread stack.
stpnum	defl	0		; size of "ret" stack
stpbase	defl	stack_init	; pointer to literal data size count
stporg	defl	stack_init+2	; where to store next step address in block
stpcnt	defl	0		; number of steps in literal block

; Macro for starting the next step in the disk thread.  The step is called
; <name> and stpoff_<name> is defined to record the step number.  It adds
; the state to the current block of literal data under construction.

step	macro	name
stpoff_`name defl stpnum	; remember step number for goto
stpnum	defl	stpnum+2	; "ret" stack has another entry
stpcnt	defl	stpcnt+1	; one more step in the current literal block
	sett	0		; reset zmac's cycle counter
name:				; label the step
	org	stporg		; record address of step in RLE data
	dw	name
stporg	defl	$		; update RLE data pointer
	org	name		; go back to where we were assembling
	endm

; End the current block of literal data.

endlit	macro
	assert	stpcnt > 0	; fail if no steps in literal block
tmp	defl	$		; remember where we are
	org	stpbase
	dw	stpcnt*2	; record size of literal block
	org	tmp		; return to assembling where we were
stpcnt	defl	0		; no steps in literal data
stpbase	defl	stporg		; get ready for next
stporg	defl	stporg+2	; RLE data record
	endm

; Reuse a step.  The current step in the disk thread does not require new
; code but simply repeats a previously generated step.  No extra code is
; assembled, but the "ret" stack still grows by "count" words.

reuse	macro	name,count
	endlit
tmp	defl	$
	org	stpbase			; directly emit
	dw	$8000|((count)*2)	; 1 or more repeats
	dw	name			; of step 'name'
stpnum	defl	stpnum+(count)*2	; record "ret" stack size growth

stpbase	defl	$		; get ready for next
stporg	defl	$+2		; RLE data record
	org	tmp		; as you were
	endm

; Load the stack pointer so that the next step will be the one given.
; "goto" is a little misleading as it doesn't immediately transfer control
; but instead means control will transfer to "name" when the output thread
; returns back to the disk thread.
goto	macro	name
	ld	sp,stack_top+stpoff_`name
	endm

; Helper macros to get the timing right for conditional jumps.  The tricky bit
; is that each branch of a conditional jump must end up using the same number
; of cycles.  "baljp" records the number of cycles used in the current step.
; "tail" is used to record the cycles used if the conditional jump is not
; taken.  Then "tail_check" is called after the code at "label" to ensure that
; the cycle count is the same as the not taken case.

baljp	macro	cond,label
	jp	cond,label
bjt0	defl	t($)
	endm

tail	macro	name
bjt1	defl	t($)
case1	defl	bjt1 - bjt0
name:
	endm

tail_check macro
case2	defl	t($) - bjt1
	assert	case1 == case2
	endm

; -------------------- Disk Thread --------------------------

; Using the helper macros we can program the disk thread in a straight
; line fashion.  Instructions added for the purpose of padding timing
; are generally commented as "balance" instructions.
;
; The disk thread reads all the tracks from 0 to 39 off each drive in
; turn until "numdsk" floppies have been read.  The tracks themselves
; are raw data without sector markings save for a few sync bytes at the
; start followed by a 0 to indicate the start of the data.
;
; The entire track is not read, only the first "tdatln" data bytes.  This
; means I don't need (and don't get to) the NMI that signals the end of
; the track (though I do set it up as a vestige of original testing code).
; Moreover, it is critical to data throughput.  Track reads all start at
; the same physical rotation of the diskette (at the index hole).  Reading
; an entire track would mean waiting a full rotation before the next track
; could be read.  Instead we read most of the track and seek to the next
; one leaving enough time for the seek to happen (6 milliseconds says the
; manual) and for the head to settle and be able to read data.  I don't
; recall any recommended time to wait for settling.  Instead, the limits
; were determined by experiment and even a mere 6 ms or less is enough.
; I think mechanically the seeks are faster than required by the controller.
;
; The main program sets "drive" to 0, "track" to 0, loads HL' with the
; start of the ring buffer ("ring") and starts us as step "stream0".

; Prepare to read a track from drive 0.
; stream0 and strrm0a modify the track reading code to
; select drive 0 access with and without wait states as necessary.

	step	stream0
	jp	$+3		; balance
	ld	a,$81		; balance
	ld	a,$81		; select drive 0 with no wait states
	ld	(dr0),a		; modify track
	ld	(dr1),a		; reading code
	jp	(hl)		; run output thread
	dq_check
; remember that we fall through to the next step
	step	strm0a
	ld	a,$c1		; balance
	ld	a,$c1		; select drive 0 with wait states
	ld	(drw0),a	; modify track
	ld	(drw1),a	; reading code
	goto	stream		; continue on at "step stream"
	jp	(hl)		; run output thread
	dq_check

; Prepare to read a track from drive 1.

	step	stream1
	jp	$+3		; balance
	ld	a,$82		; balance
	ld	a,$82		; select drive 0 with no wait states
	ld	(dr0),a		; modify track
	ld	(dr1),a		; reading code
	jp	(hl)		; run output thread
	dq_check
; remember that we fall through to the next step
	step	strm1a
	ld	a,$c2		; balance
	ld	a,$c2		; select drive 0 with wait states
	ld	(drw0),a	; modify track
	ld	(drw1),a	; reading code
	goto	stream		; continue on at "step stream"
	jp	(hl)		; run output thread
	dq_check

; Begin streaming data from the drive selected via self-modification.
	step	stream
	ld	a,$81
dr1	equ	$-1
	out	($f4),a		; select drive with no wait states
	ld	a,$08		; restore to track 0 without verify, 6 ms step
	out	($f0),a		; start the disk seek to track 0
	jp	$+3		; balance
	nop			; balance
	jp	(hl)		; run output thread (I won't say it again)
	dq_check

; Disk commands require a short amount of time to pass before the
; processor will get reliable status data from the controller.  This step
; does nothing but waste time and we'll be reusing it quite a bit.

	step	rest
	jp	$+3
	rept	10
	nop
	endm
	jp	(hl)
	dq_check

	reuse	rest,1		; Repeat that step, so 256 cycles pass

; Some of the most painful series of steps as we wait for the seek to
; track 0 to complete and check for any errors.  It'd be easier if we
; had more time to check multiple conditions at once but instead must
; break the process down into multiple steps.

	step	cw1
	in	a,($f0)		; get status of "restore" command
	exx			; switch to out working errors
	ld	b,a		; save status in B'
	exx
	and	1		; look at bit 0 of status
	baljp	nz,cont		; if bit 0 set then
	goto	cwechk		; restore complete, check for error next
	jp	(hl)
	dq_check
	tail	cont
	goto	cw2		; check for timeout
	jp	(hl)
	tail_check

	step	cwechk
	ld	a,0		; balance
	nop			; balance
	exx
	ld	a,b		; get the status we saved
	exx
	and	~($20|4|2)	; track 0, head loaded and one other bit
	call	nz,err		; are expected/don't care, otherwise error!
	goto	nxttrk		; All OK, then start reading the track
	jp	(hl)
	dq_check

	step	cw2
	exx
	bit	7,b		; Did the restore operation time out?
	call	nz,err		; Yes, too bad.
	ld	bc,$f3		; Set data port ahead of time (and balance!)
	exx
	nop			; balance
	goto	cw1		; No timeout, keep polling the status
	jp	(hl)
	dq_check

; At this point either a restore or a track seek has completed.
; Therefore we issue the track read command.

	step	nxttrk
	jp	$+3		; balance
	nop			; balance
	ld	a,$e8		; read track, no settle
	out	($f0),a
	ld	a,$80
	out	($e4),a		; allow NMI
	jp	(hl)
	dq_check

; Wait for the track read start sending data.
; No checking for errors when waiting for data.  Bit hard to
; get it fitting without using AF' and it occurs to me that
; I can't really afford missing a second step.
	step	wtdat
	in	a,($f0)		; get status
	bit	1,a		; any data yet?
	baljp	nz,havdat	; yes, go read it
	ld	a,0		; balance
	nop			; balance
	goto	wtdat		; no, keep waiting
	jp	(hl)
	dq_check
	tail	havdat
	in	a,($f3)		; get data (balances 'ld a,0; nop')
	goto	sync
	jp	(hl)
	tail_check

; Since we're pulling data, we don't have to balance, just hit the minimum.
	step	sync
	ld	a,$c1		; drive select, with waits
drw0	equ	$-1
	out	($f4),a
	in	a,($f3)		; Waits until track data is ready
	or	a
	jr	z,synced	; if taken we lose 7, gain 10 so will be OK
	goto	sync		; No zero byte?  Keep looking for it.
synced:	jp	(hl)
	dq_check

; Now we have the track data to read and place into the ring buffer.
; Disk data is coming so fast we much process a byte every step.  Thus
; the cycle count for this step defines the disk thread quantum.  The faster
; the better to give more time for processing graphics and video data in
; the output thread.

	step	dskbyt
	exx
	ld	a,$c1
drw1	equ	$-1
	out	($f4),a		; Keep telling the drive we want to wait.
	ini			; Put byte from disk into ring buffer
	set	7,h		; Keep HL in the ring (i.e., $8000 - $ffff)
	exx
	jp	(hl)
	dq_check	; doesn't have to be equal, but it is our basis

	reuse	dskbyt,tdatln-1	; repeat byte reads for the rest of the track

; We've read enough of the data.  Stop NMI and wait states.
	step	datend
	ld	a,(0)		; balance
	nop			; balance
	xor	a
	out	($e4),a		; turn off NMI
	ld	a,$81
dr0	equ	$-1
	out	($f4),a		; turn off wait states (prob. not needed)
	jp	(hl)
	dq_check

; The track read is still active so cancel it.
	step	cancmd
	jp	$+3		; balance
	jp	$+3		; balance
	nop			; balance
	nop			; balance
	nop			; balance
	ld	a,$d0
	out	($f0),a		; terminate any commands in progress
	jp	(hl)
	dq_check

; The disk controller requires quite a long time before it will be
; willing to accept new commands.
	reuse	rest,26		; long wait (wasting stack)

; Now some pretty straightforward code to increment the track and determine
; if we should seek to a new track or move on to the next diskette.
	step	trk1
	ld	a,(0)		; balance
	ld	a,(0)		; balance
	ld	a,0		; cute trick to save a few cycles
track	equ	$-1		; "track" is stored in the code.
	inc	a
	ld	(track),a	; track++ and, oops, out of time!
	jp	(hl)
	dq_check

	step	trk2
	jp	$+3		; balance
	ld	a,(track)
	cp	40		; at the end of the disk
	baljp	c,trkok	
	goto	dsknxt		; yes? move to the next drive
	jp	(hl)
	dq_check
	tail	trkok
	goto	seek		; no? head off to seek to next track
	jp	(hl)
	tail_check

; We've just finished reading an entire diskette.  Reset some variables
; and figure out which drive to start reading next (or stop).
	step	dsknxt
	ld	a,r		; balance
	xor	a
	ld	(track),a	; set track counter back to 0
	ld	a,0
drive	equ	$-1
	inc	a
	ld	(drive),a	; drive++ and, oops, out of time!
	jp	(hl)
	dq_check

	step	dn2
	jp	$+3		; balance
	ld	a,(drive)
	cp	numdsk		; Have we read all the diskettes?
	baljp	c,dskok
	goto	drain		; Yes?  We can rest easy.
	jp	(hl)
	dq_check
	tail	dskok
	goto	seldsk		; No, figure out which drive next
	jp	(hl)
	tail_check

	step	seldsk
	jp	$+3		; balance
	ld	a,(drive)	; disk number mod 2 is the drive to read
	and	1
	baljp	z,sel0
	goto	stream1		; == 1 then go read drive 1
	jp	(hl)
	dq_check
	tail	sel0
	goto	stream0		; == 0 then go read drive 0
	jp	(hl)
	tail_check

; Issue command to seek to the next track.  If you've read the rest then
; the drill should be well known by now.  Send command, wait, wait for
; ready and check for errors.
;
; In other words, the "restore" sequence was so commented that I'll not
; say too much here.

	step	seek
	nop			; balance
	nop			; balance
	ld	a,(track)
	out	($f3),a		; set track to seek to
	ld	a,$18		; seek, motor on, no verify
	out	($f0),a
	jp	(hl)
	dq_check

	reuse	rest,2		; short delay

	step	cs1
	in	a,($f0)
	exx
	ld	b,a
	exx
	and	1
	baljp	nz,cont1
	goto	csechk		; Ready? Then go check for errors.
	jp	(hl)
	dq_check
	tail	cont1		; Not Ready?  Then go check for timeout.
	goto	cs2
	jp	(hl)
	tail_check

	step	csechk
	ld	a,0		; balance
	nop			; balance
	exx
	ld	a,b
	exx
	and	~$20		; Ignore "head loaded" bit.
	call	nz,err		; Anything else set is an error.
	goto	nxttrk		; Otherwise we're ready to read the track
	jp	(hl)
	dq_check

	step	cs2
	jp	$+3		; balance
	nop			; balance
	exx
	bit	7,b
	exx
	call	nz,err		; Error if command timed out.
	goto	cs1
	jp	(hl)
	dq_check

; We've read all the data required.  Nothing for us to do but use up
; our time quantum.  The output thread is responsible for deciding when
; all the data has been processed.

	step	drain
	jp	$+3		; balance
	jp	$+3		; balance
	jp	$+3		; balance
	jp	$+3		; balance
	goto	drain
	jp	(hl)
	dq_check

; Now a bit of bookkeeping.  A call to "endlit" to finish the last block
; of literal RLE data.  Then lay down a 0 control record to mark the
; end of the RLE data stream.  And do an assert to ensure we are not
; assembling code into the ring buffer.  Which, actually, could be
; survivable but we unpack the data on each run to keep things simple.
; Besides, if an error occurs it uses "call" which will wipe out a bit
; of the "ret" stack.

	endlit			; finish last set of literals
tmp	defl	$
	org	stpbase
	dw	0		; emit 0 length to end run-length encoding
	assert	$ <= ring
	org	tmp

; -------------------- Output Thread --------------------------

; The output thread must first wait for the ring buffer to fill up so
; that is does not overrun the disk data.  Then it just displays frames
; until it has done "framcnt" and that's it.  The biggest requirement is
; sending an audio bit each step.  With approximately 128 cycles per step
; that gives an audio bit rate of 31250 Hz.  Given the quality of one bit
; audio there wouldn't be terrible harm in missing the odd step.  On the
; other hand, it is very easy to generate audio artifacts and a systematic
; skip could easily generate a nasty 25 Hz buzz.
;
; The main program initializes BC to the top left of the display area,
; DE to the start of the ring buffer and "frame" to "framecnt" and starts
; at "fillwt".
;
; The threading overhead isn't as complicated as the disk thread.  We
; just "RET" to pass control to the disk thread and "LD HL,code" to set
; the next output thread step to execute.  The main complication comes
; in the unrolling of the code.  There isn't enough time to maintain
; loop counters.  Instead we make copies of each step and link them
; together by modifying "LD HL,nn" instructions.  By following the
; convention of putting "LD HL,nn; ret" as the last instructions in each
; step the unrolling code always knows where to place the links.
;
; The movie data format is organized into two types of 8 byte blocks.
; A graphics block has 7 graphics characters followed by one byte
; (8 samples) of audio.  An audio block is 8 bytes (64 samples)
; of audio.  A frame consists of 64 graphics blocks followed by "audblk"
; audio blocks (as determined by the data converter).  The graphics
; characters are arranged into a 56 x 16 array centered 64 x 16 text
; mode for a resolution of 112 x 48.  Timing is driven by the program
; and works out to about 24.4 frames/second.
;
; A block size of 8 divides our ring buffer size.  And the blocks all
; end in an audio byte.  The means that the ring buffer pointer DE only
; needs to be forced into the ring when we read an audio byte.  And 8
; also divides 256 so when reading a graphics byte we only have to
; increment E and not DE saving 2 more cycles.

shwqnt	equ	128-dskqnt	; Output thread quantum

; Use this macro at the end of a step to ensure the time taken is
; exactly that allocated to output thread steps.
sq_check macro
quant	defl	t($)
	assert	quant == shwqnt
	endm

; When we write to video we must leave 4 cycles free as a video write
; in 64x16 mode may incur as many as 4 wait states.  Any output thread
; step that writes to the screen uses this macro to ensure the time
; taken is the output thread quantum less the 4 cycles.
vwq_check macro
quant	defl	t($)
	assert	quant == shwqnt - 4
	endm

; Wait until the ring buffer is mostly full ($6000 bytes, currently).
	sett	0
fillwt:	ld	a,(0)		; balance
	jr	$+2		; balance
	exx
	ld	a,h		; peek at disk streaming pointer (HL')
	exx
	cp	high(ring + $6000)
	baljp	c,notyet
	ld	hl,audvid	; move on to audio/video output
	ret			; let the disk thread run
	sq_check
	tail	notyet		; not enough bytes, continue waiting
	ld	hl,fillwt	; unnecessary, but for balance
	ret			; let the disk thread run
	tail_check

; Code to read and display a graphics character.
; Now that we have finished waiting each step must output a sound sample.
; And these bits of code are copied to create unrolled code linked together
; by changing address loaded by the "LD HL," instruction near to end.
; This is the most repeated step (56 * 16 == 896 times) so its size
; largely determines the size of the unrolled output thread code.
	sett	0
vidbyt:	ex	af,af'		; get audio samples
	rrca			; move to next 1 bit sample
	out	($90),a		; output audio bit
	ex	af,af'		; save audio samples
	ld	a,(de)		; get graphics byte
	inc	e		; we know E is never == 255
	ld	(bc),a		; write graphics to screen
	inc	c		; also C is never == 255
	ret	z		; balance (C cannot be 0)
	ld	hl,0
	ret
	vwq_check
vidbyt_len equ $ - vidbyt

; Code for reading the audio byte at the end of a graphics block.
; With no graphics to update we have plenty of time to do a fully
; general increment and modulus of the ring buffer pointer.
	sett	0
audbyt:	ld	a,(de)		; get data byte
	inc	de		; move to next position in ring
	set	7,d		; keep DE in ring ($8000 - $ffff)
	out	($90),a		; output audio bit
	ex	af,af'		; save it in AF'
	ld	l,(hl)		; balance
	add	hl,hl		; balance
	ld	hl,0		; link to next step in unrolling
	ret			; switch to disk thread
	sq_check
audbyt_len equ $ - audbyt

; The last byte in a block is always audio.  In a few cases special
; processing is required.  At the end of a line (after 8 graphics blocks)
; we must load BC with the start of the next line.  The very last
; audio byte of the audio blocks must load BC with the top of the
; screen.  And the last audio byte after the end of graphics data for
; the frame must load BC with loop counters so that the audio blocks
; can be output efficiently without having to unroll code for each
; audio bit.
;
; In short, sometimes we need to load a new audio byte and load BC
; with something.  This macro provides for that case.

m_audlbc macro	bcval,next
	ld	a,(de)
	inc	de
	set	7,d
	out	($90),a
	ex	af,af'
	bit	0,a		; balance
	ld	bc,bcval
	ld	hl,next
	ret
	endm

; Instantiate the BC loading variant of reading an audio byte for
; use by the unrolling code.
	sett	0
audlbc:	m_audlbc 0,0
	sq_check
audlbc_len equ $ - audlbc

; Instead of unrolling code for each audio bit in the audio blocks
; we go to the trouble of running a loop over 6 of the bits in each
; audio byte and over the audio bytes themselves. B is used as the
; bit counter, C as the byte counter.  The branches involved are
; always a bit painful to get right but the savings in unrolled code
; size is worth it.

; Output the first bit in an audio byte and load B with 6 so the
; next 6 bits of the byte are output with a loop.

audb7	macro
	sett	0
	ex	af,af'
	rrca
	out	($90),a
	ex	af,af'
	ld	b,6		; remaining bits
	ld	l,(hl)		; balance
	inc	hl		; balance
	add	hl,hl		; balance
	ld	hl,$+4
	ret
	sq_check
	endm

; Output an audio sample and loop on B.  If B != 0 then we repeat this
; step otherwise we move on to the next.

audblp	macro
	local	tobyt,me
	sett	0
me:	ex	af,af'
	rrca
	out	($90),a
	ex	af,af'
	add	hl,hl		; balance
	inc	hl		; balance
	dec	b
	baljp	z,tobyt		; any more bits in this byte
	ld	hl,me		; yes? Then repeat this step.
	ret
	sq_check
	tail	tobyt
	ld	hl,$+4		; no? Move to the step just after us.
	ret
	tail_check
	endm

; Load an audio byte and loop on C for more bytes.  This step is used
; for audio bytes both in the middle of a block and at the end so it
; must do a fully general increment and modulus of the ring buffer
; pointer to guarantee it stays in the $8000 - $ffff range.

audonl	macro	back
	local	aodn
	sett	0
	ld	a,(de)
	inc	de
	set	7,d
	out	($90),a
	ex	af,af'
	dec	c
	baljp	z,aodn		; more audio bytes?
	nop			; balance
	ld	hl,back		; yes? loop back as directed
	ret
	sq_check
	tail	aodn
	nop			; balance
	ld	hl,$+4		; no? continue to the step after us
	ret
	tail_check
	endm

; Output of the last 7 bits of the last audio byte of the audio blocks
; is done specially so we can update the "frame" countdown and jump to
; the program end when it reaches 0.

; A few of the bits can be output simply.

audbit	macro
	sett	0
	ex	af,af'
	rrca
	out	($90),a
	scf			; balance (and helping audbf0)
	ret	nc		; balance
	ex	af,af'
	add	hl,hl		; balance
	add	hl,hl		; balance
	ld	hl,$+4
	ret
	sq_check
	endm

; For three of the bits we use the time to test the "frame" counter.
; As usual, this would be trivial if cycle balancing and budgets were
; not a factor.

; First step, load frame counter into BC and decrement it.

audbf0	macro
	sett	0
	ex	af,af'
	ret	nc		; balance (works because of scf in audbit)
	rrca
	out	($90),a
	ex	af,af'
	ld	bc,(frame)
	dec	bc
	ld	hl,$+4
	ret
	sq_check
	endm

; Second step.  Save the frame counter.

audbf1	macro
	sett	0
	ex	af,af'
	rrca
	out	($90),a
	ex	af,af'
	ld	(frame),bc
	add	hl,hl		; balance
	ld	hl,$+4
	ret
	sq_check
	endm

; Third step.  Test the frame counter for zero and exit the thread if so.

audbf3	macro
	sett	0
	ex	af,af'
	rrca
	out	($90),a
	ex	af,af'
	ld	a,b
	or	c
	jp	z,done		; We're done if frame count == 0
	ld	l,(hl)		; balance
	inc	hl		; balance
	ld	hl,$+4
	ret
	sq_check
	endm

; To process the audio blocks in a loop we need to calculate the number
; of audio bytes with is simply the number of blocks times 8 and is assigned
; to "extra".  I've left the original code in place, but you can see that
; the first two asserts can never be true and the calculation is rather
; convoluted.

extraT	equ	audblk*64
	assert	(extraT % 8) == 0
extra	equ	extraT / 8
	assert	extra % 8 == 0
	assert	extra > 1	; We insist on having at 2 audio blocks

; The code to handle the audio blocks.  The unrolling code will figure out
; the number of audio bytes less 1 and load that into C register in the
; step before us.  This chunk of code handles looping over C bytes.  If
; there are more than 256 bytes then we unroll this loop as many times as
; necessary to handle the 256 bytes.  To be honest, I don't think I've
; tested that case.

aud_only:
	rept	(extra-1+255)/256
	local	loop
loop:	audb7			; Handle first bit, loading B with 6
	audblp			; loop over the 6 bits using B
	audonl	loop		; loop over the bytes using C
	endm

; Now we have that very last audio byte of the frame where we check
; the frame count.

	rept	3
	audbit			; 3 easy bits.
	endm

	audbf0			; increment "frame"
	audbf1			; save it
	audbf3			; if "frame" == 0 then we're done.

; Output last audio bit, load BC with the top of the screen and
; go back to the first step of a frame (the audvid unrolled code buffer).

	m_audlbc 15360+4,audvid

frame:	dw	0	; frame number countdown to detect end of movie

; -------------- Main Program Subroutines -------------------------

; Restore drive selected by D (e.g., $81 for drive 0, $82 for drive 1)
; to track 0.  The straightforward code here that commands the disk
; controller might help make sense of the Data Thread which does pretty
; much the same things but in a much more convoluted style.

restore:
	ld	a,$d0
	out	($f0),a		; terminate any commands in progress
	ld	b,0
	djnz	$
	ld	a,d
	out	($f4),a		; select drive
	ld	a,$08		; restore, no verify, 6 ms step
	out	($f0),a
	call	stat		; get command status result.
	; Seems that '4' may be OK (TR00 indication)
	; As perhaps '2'
	; And head loaded ($20) is fine
	and	~($20|4|2)
	ret

; Wait for and return disk controller command result status.

stat:	ld	b,$12		; wait for disk controller
	djnz	$		; to be ready for answer
.wst	in	a,($f0)		; read disk command status
	bit	0,a
	ret	z		; return if not busy
	bit	7,a
	ret	nz		; return if not ready
	jr	.wst		; wait until not busy or not ready

; Non-maskable interrupt is vectored here.  Because we never read a track
; to completion it should never happen and is treated as an error condition.

nmi:	ld	a,'N'
	ld	(15360),a
	xor	a
	out	($e4),a		; turn off NMI

	in	a,($f0)
	or	a
	jp	err		; say where it happened

; Copy C bytes from HL to DE and put the resulting DE value into the
; copied block as a link to the next block.  In other words, a subroutine
; perfect for copying output thread steps into the unrolled buffer.

copy:	ld	b,0
	ldir			; copy audio/video program step
	push	de
	pop	ix
	ld	(ix-2),d	; link to next step following immediately
	ld	(ix-3),e
	ret

; Unroll 7 graphics byte output steps into the unroll buffer.

vid7:	ld	a,7
block:	ld	hl,vidbyt
	ld	c,vidbyt_len
	call	copy
	dec	a
	jr	nz,block
	ret
rowcnt:	defb	0
colcnt:	defb	0

; ------------------ Main Program -----------------------------

; Besides a minor bit of hardware setup, the main program must RLE
; uncompress the "ret" stack for the disk thread and unroll the
; audio and video output code for the output thread.

start:	di
	im	1

; Choose memory map 1 which has 64K of RAM with keyboard and video
; mapped to the customary Model I and III locations.
	ld	a,1
	out	($84),a
; Switch out the Model 4P boot ROM.
	ld	a,0
	out	($9c),a

	ld	a,$48
	out	($ec),a		; fast CPU + wingdings

; Vector Non Maskable Interrupt to our own handler.  We should never get
; an NMI and shouldn't really ask for it, but it can act as a minor error
; check for the programmer.
	ld	a,$c3		; Z-80 "JP" instruction opcode.
	ld	($66),a
	ld	hl,nmi
	ld	($67),a

; We loop back here when the movie display is done.

done:
	di			; No interrupts, please!
	ld	sp,stack	; A little stack space while we setup

	ld	hl,15360
	ld	de,15360+1
	ld	bc,1024-1
	ld	(hl),128
	ldir			; Clears the screen.

; Wait for any key to be pressed

wk:	ld	a,($38ff)
	or	a
	jr	z,wk

; Unroll the output thread loop
	ld	hl,15360+4	; Start address of first line of graphics
	ld	de,64		; Offset to next line of text
	exx

	ld	de,audvid	; Output thread unroll buffer

	ld	a,16
row:	ld	(rowcnt),a
	ld	a,7
col:	ld	(colcnt),a
	call	vid7		; Unroll 7 steps for graphics bytes
	ld	hl,audbyt
	ld	c,audbyt_len
	call	copy		; Unroll audio byte step
	ld	a,(colcnt)
	dec	a
	jr	nz,col		; Do 7 of the 8 graphics blocks in a line

	call	vid7		; Graphics bytes for last block

	ld	hl,audlbc
	ld	c,audlbc_len
	call	copy		; Unroll special BC loading audio byte step

	exx
	add	hl,de		; next line address
	ld	(ix-5),h	; B: modify "LD BC," instruction to
	ld	(ix-6),l	; C: have BC get address of next line
	exx

	ld	a,(rowcnt)
	dec	a
	jr	nz,row		; Do all 16 rows.

; We don't need the address of the next line.  Intead we're linking to
; the code that outputs the audio bytes en-masse.  The number of audio
; bytes was computed previously as "extra".  We set up C with that less
; one as the very last byte is unrolled in "aud_only" and handles
; frame count checking and moving back to the top of the screen.
	;ld	(ix-5),6		; B (not needed)
	ld	(ix-6),low(extra-1)	; C
	ld	(ix-2),high aud_only
	ld	(ix-3),low aud_only

; This is a debugging check that could be removed.  It guarantees that
; the unrolled output thread code does not run into the start of the
; program.

	ld	hl,proglow
	or	a
	sbc	hl,de
	call	c,err
;

	ld	d,$81
	call	restore		; Restore drive 0 to track 0
	call	nz,err
	ld	d,$82
	call	restore		; Restore drive 1 to track 0
	call	nz,err

; Unpack the RLE data of the  "ret" stack for disk thread
; We're going to overwrite our stack so no subroutine calls
; from here on out, please.  Or PUSH!  POP could be OK.

	ld	hl,stack_init	; pointer to RLE data
stack_top equ	stack-stpnum
	ld	de,stack_top	; destination pointer is the "ret" stack

silp:	ld	c,(hl)
	inc	hl
	ld	b,(hl)
	inc	hl
	ld	a,b
	or	c
	jr	z,sidn		; RLE code 0 means end of data
	bit	7,b
	jr	nz,sirun	; High bit set means a run.
	ldir			; Otherwise, literal data, copy it
	jr	silp		; and keep uncompressing
sirun:	res	7,b		; Clear repeat bit to get count in BC
	ldi			; must have two bytes to copy
	ldi
	jp	po,silp		; only 2 bytes?  Then we're done.
	ld	(hlsv+1),hl	; save HL (note, PUSH is not safe right now)
	ld	hl,-2		; set up overlapping LDIR
	add	hl,de
	ldir			; to copy out the repeats
hlsv:	ld	hl,0		; restore our RLE data pointer
	jr	silp
sidn:

; BTW, that "jp po," has got to be one of my favorite bits of Z-80 programming.
; It is not often that you get to use the flags set by LDI!

; Wait for the all the keys to be released.

wtup:	ld	a,($38ff)
	or	a
	jr	nz,wtup

	xor	a
	ld	(drive),a	; Disk thread to start on disk 0 in drive 0
	ld	hl,framcnt	; Number frames in movie
	ld	(frame),hl	; for the output thread to count down.

	; Initialize registers for output thread
	ld	bc,15360+4	; Top left of display for centering 56 wide.
	ld	de,ring		; Ring buffer read pointer
	ld	hl,fillwt	; first step in output thread
	xor	a
	ex	af,af'		; clear first audio output

	; Initialize registers for disk thread
	exx
	ld	hl,ring		; Ring buffer write pointer
	exx

	goto	stream0		; first step in disk thread
	ret			; and send it off and running

; Called when an error occurs.  A jump would be fine but I do a call so
; that the place where the error happened can be reported.  But that
; diagnostic code isn't necessary for a demo.
err:	pop	hl
	ld	a,'!'
	ld	(15361),a
	jp	wk

stack_init:

	end	start


George Phillips, February 27, 2015. george -at- 48k.ca