Double-wide Bug
I found a hardware bug in the TRS-80 Model 3 video circuitry. You may find this implausible due to the absurd simplicity of TRS-80 graphics and the age of the machine, but hear me out. It goes something like this:- If the display is in double-wide mode (32 x 16 characters) ...
- And the processor reads or writes video memory ...
- Near the time the video circuitry is reading video memory ...
- Then the next RAM access by the processor will be corrupted.
- In particular, it will be the AND of the RAM value desired and the video memory.
ld (15360),a jp somewherecan go crazy because the next RAM access after the write to the screen is the fetch of the next instruction. Instead of executing a jp it might get a nop or any other number of instructions. Even a nop is bad news as the processor would then execute the address of the jump as code!
Put another way, any access to video memory in double-wide mode will lead to disaster.
Impossible!
It does seem a little unlikely, doesn't it? Sure, double-wide mode wasn't used in a lot of software, but it was far from unknown. And you could even list entire programs in double-wide mode in BASIC without a crash. It's a big enough bug that surely it would be common knowledge by now.Well, for a start you can try it out for yourself. Download bad32.zip and run it on your Model 3. A real model 3 — not one of your new- or even old- fangled emulators. Please! There is some miniscule chance this is all just a problem with my computer. I really don't think so, but replicating the bug elsewhere is the only way to be sure.
The safe money says you'll see the bug. Keep in mind that this bug only affects the next RAM access after the CPU's video memory access. It does not bother programs that run in ROM. No BASIC program would ever crash due to this bug as it will always be doing video reads and writes from code in ROM. Most programs that do wide character output use the ROM subroutines and thus are safe.
The only program I've seen so far that does double-wide itself is Star Castle when it displays "The Cornsoft Group presents...". Tellingly, it never accesses video memory when the screen is in double-wide mode. The code first writes the message, then it switches into double-wide mode, switches out of it and erases the screen.
In short, the bug is believable because most programs that use double-wide mode are safe as they use the ROM and those that don't are careful to avoid the bug. Note also that the Model 1 does not suffer from this problem. Was the bug fixed in the Model 4? Sadly, I don't have one to test; let me know if you do.
By the way, I found this bug when I experimented with beam synchronization in double-wide mode.
But why?
I don't know exactly what's going on, but it is clear enough that a byte from RAM and video memory end up on the data bus at the same time. That's just not supposed to happen. Each of the 8 bits from both sources want to get their value put onto the bus. The one bits do so by putting 5 volts onto the wire. The zero bits ground the wire to give it a 0 volt level. When two zero bits collide you get double grounding which is still 0. Two 1 bits just doubly assert the 5 volts or maybe put it to 10 volts? (I'm no digital electronics expert). Either way it is interpreted as a 1 on the bus. The combination of a 1 and a 0 leads to a 0 as the grounding of the bus line drains away the 5 volts from the 1. Thus the net effect is to AND the two values together.I've looked at the video schematic quite a bit but just don't see what could cause it to behave incorrectly. Some day I'll have to run a simulator or get a logic probe and figure out what is the source of the bug.
What else can go wrong?
My test program exploits the instruction change to observe when a collision occurs. However, there are many cases where the next memory access will not be an instruction fetch. I think the rule will still apply but it gets a little harder to predict or describe what will happen. For example, a ld hl,(16383) will load h correctly but l will be the current video character AND memory location 16384 (if a collision occurs). ld hl,(15360) may not be a problem as the second access is another video memory access. Thus the load may work fine but the next instruction could go wrong. Similar shenanigans could arise with inc hl when hl points to video memory or ldd which hl pointing to video memory and de pointing to RAM.The test program only detects a percentage of collisions over many accesses. The exact circumstances leading to a collision are unknown but my best guess is they occur when the Z-80 takes video access away from the video display. A test with beam synchronization could verify this theory. At any rate, collisions are obviously not guaranteed but are very likely given any non-trivial video memory access pattern.
Are there workarounds?
You can always find some good ROM subroutine to do your video accesses. Or just put a nop after every instruction that accesses video memory. Since the opcode for nop is 0, no collision with a video memory value will change that. Conversely, only work with a screen full of character 255 as that value will not alter another when ANDed with it. Actually, only the even addresses matter as the odd ones are skipped in double-wide mode (not that it helps).If I'm right about the exact cause of collision, accessing video memory only in the vertical and horizontal blank times will be fine as well.
About bad32
The bad32 test program exploits the bug by filling video memory with particular character values and then does 200 video memory reads. The instruction after the read and the video memory data are set up such that a collision doesn't crash the program but changes the increment of c register into an increment of b register. If no collisions occur it will show C=200,B=0. But if there are collisions some of the c increments will be changed into b increments and the display will be something like C=164,B=36. The exact results are timing dependent.The 4 different tests demonstrate that only even addresses matter and the ROM is not affected by the bug. If you hold down on the space bar before selecting a test the screen fill pattern will be displayed until you let go of the space bar. I put that in to make doubly sure the test was functioning correctly.
That's all for now
I'll write another blog entry if further investigations take place or other information becomes available. There's still quite a lot I could determine with more subtle Z-80 test programs. Almost certainly I could get my emulator bug-compatible based on those experiments.
George Phillips, October 5, 2009, george -at- 48k.ca