Tail merging might be a performance hit on older CPU's as jumps are expensive instructions on pre-486 CPU's. The Pentium's branch prediction hardware allows it to do a branch in one cycle if the target is in the I-cache, and a 486 only takes 3 cycles if the target is in its onboard cache. As always, let a profiler guide your actions here. If the code isn't in a hot spot, then the merging is a good thing. It may even allow some code that is speed critical to remain parked in a CPU's internal cache by bringing things a little closer together. As we saw in earlier chapters, keeping heavily used sections of code and data in the 486 and Pentium's onboard cache can provide major speed wins. If things get sufficiently tighter, we may manage to pull some code back enough that a page of working set gets saved in a virtual memory system.
Use a little common sense here too. Tail merging exits from functions, where the function executes thousands of cycles before exiting, isn't going to hurt in any visible way even if the profiler says 99% of the time in the program is spent in that function. The cost of the tail merged exit is insignificant compared to the work going on in the rest of the function.
Let the purpose of the code guide your actions. Here's a real life war story:
While working on PC DOS 6.1 I decided to take a look at the PRINTER.SYS device driver to see if it could be slimmed down some. PRINTER.SYS was an important module for many customers in countries that work with code pages other than 437. PRINTER.SYS was fairly big as DOS device drivers go too. The previous DOS 5 version was in the 11.6K range when loaded. What's worse, is that the customers most likely to use PRINTER.SYS are the same customers who are probably loading TSR's and drivers like KEYB.COM, NLSFUNC.EXE, and DISPLAY.SYS. These folks are getting hammered on memory compared to most users in the USA who won't need the national language support modules. Any memory relief thrown their way is usually appreciated.
Well, after doing a bunch of tail merging and a few other tricks, the PC DOS 6.1 version of PRINTER.SYS's resident size had dropped by over 900 bytes when compared to the DOS 5 version. Now it was down to around 10.7K resident. It's goodness when 900 bytes can be stripped out of a device driver that consumes that precious DOS memory below the 1M point.- some people would kill for an extra 900 bytes of memory.
Was the slimmer PRINTER.SYS in PC DOS 6.1 (and all subsequent versions of IBM's DOS) a little slower than the DOS 5 version? Probably, but I could never tell the difference and there's been no complaints from customers so far. Printers are so slow compared to CPU's that any difference just isn't going to be noticeable. Would you rather have that 900 bytes of low DOS memory back, or have a print job complete a couple of microseconds faster? I'd have to say that for 99.999% of the users of PRINTER.SYS, the answer to that question is a no brainer.
Return to chapter index
Let's get an idea of what jumps cost on some different CPU's so we can make rational decisions on the cost/bennefit of using merging techniques. Here's a section of code designed to work within the contest of Michael Abrash's Zen Timer.
;------------------------------------------------------------- ; A test program that can be run with Michael Abrash's Zen ; Timer to test what JMP instructions cost on different CPU's. ;------------------------------------------------------------- jmp TestJMPs ;------------------------------------------------------------- ; IcacheFlush should cause the 8K internal cache on a 486 and ; the 8K I-cache on a Pentium to be purged of relevant data. ; On a 386, 286 or 8088 this function won't do anything. ;------------------------------------------------------------- align 16 IcacheFlush proc near db 8192 dup (90h) ; 8K worth of NOP's ret IcacheFlush endp REPCOUNT = 800 DOJMP macro local target jmp short target db 38 dup (0) target: endm TestJMPs label near ;------------------------------------------------------------ ; Do jumps in a way that should be friendly to a 486's and ; Pentium's internal caches ;------------------------------------------------------------ align 16 call IcacheFlush ; Purge internal caches of anything useful call ZtimerOn align 16 rept REPCOUNT jmp $+2 ; The majority of these JMP's should run ; out of the 486 or Pentium cache due to ; the cache line loading. endm call ZtimerOff call ZTimerReport ;-------------------------------------------------------------- ; Do jumps in a way that should be hostile to a 486 and ; Pentium's internal caches. ;-------------------------------------------------------------- align 16 call IcacheFlush ; Purge internal caches of anything useful call ZtimerOn align 16 rept REPCOUNT DOJMP ; These JMP's should cause a cache line ; load on both 486's and Pentium most times. endm call ZtimerOff
The program does 800 jump instructions two different ways. The first batch are right next to each other in memory. The second batch is scattered around. I ran the resultant test file on several test machines and got some interesting results. All timings are in microseconds from the Zen Timer.
First way Second way 4.77mhz 8088 3027 3028 Uncached 386SX-25 353 387 Uncached 386DX-20 416 421 Uncached 486SX-25 165 548 60mhz Pentium 81 354
Isn't it amazing that the 60mhz Pentium slowed down to the same speed as a 386SX-25 when the jumps were causing misses on its I-cache! Equally amazing is that the 486SX-25 actually ran markedly slower than the clunky old 386SX when it was missing the jumps in its cache too!
LESSON: These newer CPU's are really, really, really dependent on getting hits in their on-chip caches. When you're in the cache you're humming. When you're not, you can be slogging along slower than some of the older generation CPU's without on-chip caches.
Suppose we've got two functions f1 and f2 that look like this:
f1 proc near push ds push es push di call foo pop di pop es pop ds ret f1 endp f2 proc near push ds push es push di call bar pop di pop es pop ds ret f2 endp
There's a common 4 byte exit sequence that happens in each of these functions:
pop di pop es pop ds ret
A standard tail merge transformation on these two functions would look like this:
f1 proc near push ds push es push di call foo MergePoint1: pop di pop es pop ds ret f1 endp f2 proc near push ds push es push di call bar jmp MergePoint f2 endp
There's nothing that says you can merge other merges as well. Suppose there's a third function, f3, involved here?
f1 proc near push ds push es push di call foo MergePoint1: pop di pop es pop ds ret f1 endp f2 proc near push ds push es push di call bar jmp MergePoint f2 endp f3 proc near push ds push es push di call frob call bar jmp MergePoint f3 endp
We did the obvious tail merge on the pops and ret exit sequence, but there's more to be had. Notice that both f1 and f2 have this sequence in common now:
call bar jmp MergePoint
This let's us merge again with a second merge like this:
f1 proc near push ds push es push di call foo MergePoint1: pop di pop es pop ds ret f1 endp f2 proc near push ds push es push di MergePoint2: call bar jmp MergePoint1 f2 endp f3 proc near push ds push es push di call frob jmp MergePoint2 f3 endp
A straight forward unmerged version of the f1,f2, f3 functions would take up 33 bytes (18 one byte PUSH/POP's, 3 one byte RET's, and four 3 byte CALL's). The fully merged version of the f1,f2,f3 functions only takes up 26 bytes. Not bad eh? The size of this code was chopped by around 20% and it does the exact same thing it always did before the merging.
Often we'll see a situation similar to these two little functions:
func1 proc near push ax push bx push dx push ds ; Some action happens here pop ds pop dx pop bx pop ax ret func1 endp func2 proc near push si push ax push bx push dx push ds ; Some action happens here pop ds pop dx pop bx pop ax pop si ret func2 endp
In this case, we can't just do an obvious tail merge with a jump. However, what if we were to change "func1" so that it pushes and pops the SI register just like "func2" does? Now we've created a situation where tail merging these two functions may be possible if saving/restoring SI is harmless for f2. The transformed code would look like this:
func1 proc near push si push ax push bx push dx push ds ; Some action happens here MergePoint1: pop ds pop dx pop bx pop ax pop si ret func1 endp func2 proc near push si push ax push bx push dx push ds ; Some action happens here jmp MergePoint1 func2 endp
By adding 2 bytes in "func1", we were able to save 5 bytes if MergePoint1 was more than 127 bytes away, or 6 bytes if it was in range of a short JMP. The net savings would be 3 or 4 bytes.
It's pretty common to see code where some functions are pushing and popping the same set of registers, just in a different order - like this:
func1 proc near push dx push si push bx ; Something pop bx pop si pop dx ret func1 endp func2 proc near push si push dx push bx ; Something pop bx pop dx pop si ret func2 endp
By changing the order of the pushes and pops in "func1" so they match the order of those in "func2" we'll create a situation where we can do a tailmerge:
func1 proc near push si push dx push bx ; Something MergePoint1: pop bx pop dx pop si ret func1 endp func2 proc near push si push dx push bx ; Something jmp MergePoint1 func2 endp
Reordering the pushes and pops in a function is usually an OK thing to do, and doesn't effect the code size. In this case it enabled a net win of 2 bytes. The only case where this might not be a desirable thing to do is if the function you'd like to reorder the pushes and pops for is one that's accessing, or calling code that looks back on the stack frame and expects certain register values to be saved at specific locations in the stack frame.
Sometimes there's going to be two functions with a blob of repeated code in the middle. The obvious approach, if the entry and exit points from the repeated blob are one way in and one way out, is to just make the repeated code a function on it's own. Suppose the code does things like jump to an error routine though? Extracting the code into a function will cause the stack layout to be altered and this might cause problems if the code at the multiple exit points isn't prepared to deal with extra return addresses on the stack. There's also situations where the common code may call a routine that depends on parameters on the stack and is frame dependent. Such a situation might look something like this:
f1 proc near call foo mov bar,1 call blat ; Common middle code starts here mov si,1234 mov di,5678 push es pop ds cld mov cx,20 rep movsw call barf ;barf() looks for a parameter on the stack and constructs a frame to get at it. or ax,ax jz BarfOK1 jmp BarfBad BarfOK1: ; Common middle code ends here call Foobie call Bletch ret f1 endp f2 proc near call frobotz mov retch, 256 call blat ; Common middle code starts here mov si,1234 mov di,5678 push es pop ds cld mov cx,20 rep movsw call barf ; barf() looks for a parameter on the stack and constructs a frame to get at it. or ax,ax jz BarfOK2 jmp BarfBad BarfOK2: ; Common middle code ends here call Foobie2 call Bletch2 ret f2 endp
In this example, we've got both of those conditions mentioned above. There's the JMP outside the common code to "BarfBad", and the Barf() function creates a stack frame to get at a parameter. In either situation like this we'd possibly be altering the semantics of the code by extracting the common stuff into a function of it's own. This is a difficult situation for merging all of the common code. One straight forward and safe merge opportunity would be to just make this instruction sequence a function of it's own:
mov si,1234 mov di,5678 push es pop ds cld mov cx,20 rep movsw
That group of instructions is 14 bytes and doesn't depend on a stack frame. It's also straight line code that doesn't jump anywhere. Making it a function would cost the 14 bytes, plus a byte for a RET, plus 6 bytes for two CALL's to call it for a total of 21 bytes. This gives a 7 byte savings over having the group duplicated in two places. This version of the code would look like this:
CommonStuff proc near mov si,1234 mov di,5678 push es pop ds cld mov cx,20 rep movsw ret CommonStuff endp f1 proc near call foo mov bar,1 call blat ; Common middle code starts here call CommonStuff call barf ; barf() looks for parameter on stack and constructs a frame to get at it. or ax,ax jz BarfOK1 jmp BarfBad BarfOK1: ; Common middle code ends here call Foobie call Bletch ret f1 endp f2 proc near call frobotz mov retch, 256 call blat ; Common middle code starts here call CommonStuff call barf ; barf() looks for parameter on stack and constructs a frame to get at it. or ax,ax jz BarfOK2 jmp BarfBad BarfOK2: ; Common middle code ends here call Foobie2 call Bletch2 ret f2 endp
This prior use of the CommonStuff function wasn't too bad, and it would be nice if the CommonStuff() function can be applied somewhere else in the code too. However, suppose the common sequence occurred only in two places and we know something special about the common code? Special you say? Yes, special. In this particular example, one thing about the code in the CommonStuff function is that it doesn't alter the flags register. More specifically, it doesn't alter the carry flag.
The carry flag is the only flag on the 80x86 family of CPU's that we can set and clear directly with a one byte instruction, that we can also do a conditional jump on. The direction flag can be set and cleared, but there's no way to jump based on DF's value.
Knowing that the common sequence doesn't alter the carry flag enables us to do a sleazy trick like this one:
f1 proc near call foo mov bar,1 call blat stc ; CF=1 means entry was from f1(), CF=0 entry was from f2() SleazeInFromF2: mov si,1234 ; Common middle code starts here mov di,5678 ; push es ; pop ds ; cld ; mov cx,20 ; rep movsw ; Common middle code ends here jnc short SleazeBackToF2 call barf ; barf() looks for parameter on stack and constructs a frame to get at it. or ax,ax jz BarfOK1 jmp BarfBad BarfOK1: call Foobie call Bletch ret f1 endp f2 proc near call frobotz mov retch, 256 call blat clc jmp short SleazeInFromF2 ; JMP to the merged code SleazeBackToF2: ; Now we're back call barf ; barf() looks for parameter on stack and constructs a frame to get at it. or ax,ax jz BarfOK2 jmp BarfBad BarfOK2: ; Common middle code ends here call Foobie2 call Bletch2 ret f2 endp
What did this buy us here? Well, we traded two 3 byte CALL's and a one byte RET for two 2 byte JMP's plus a one byte CLC and a one byte STC. This version that used the carry flag is a byte shorter than the version that had the common code as a separate function.
If the common code sequence happened to be in a 32 bit code in a USE32 segment, the savings are going to be even greater using this carry flag trick. "near" CALL instructions in USE32 type segments are 5 byte instructions as opposed to near CALL's in USE16 segments which are 3 bytes. Doing the separate CommonStuff() function will have 11 bytes worth of overhead to implement in 32 bit code versus the 7 bytes of overhead we had in the USE16 version. In USE32 code, if you can use the carry flag trick, it's going to be 5 bytes shorter than a separate function.
TIP: If a common code sequence uses the INC or DEC instructions, these do not alter the carry flag
Suppose we've got a common blob of code in a USE32 code segment and it does alter the flags (specifically the carry flag)? Does this force us into having to go with the fatter separate function form of middle merging that's 5 bytes bigger than the carry flag trick? Not necessarily. If the entry and exit points are well defined and the common code doesn't depend on the state of the flags but only changes them incidentally, we might do this instead:
stc SleazeInFromWherever: ;Begin common code ...blah, blah pushfd ; The instruction(s) that alter the flags live here popfd ...blah, blah jnc short SleazeBackToWherever ;End common code
The PUSHFD/POPFD in USE32 code (and PUSHF/POPF in USE16 code) are one byte instructions. We'd still be ahead by 3 bytes if we use PUSHFD/POPFD to keep the common sequence from changing the flags in USE32 code. Note that this technique would be a net loser by one byte in a USE16 code segment where near CALL instructions are only 3 bytes.
There's another consideration about 32 bit code here too. In a USE32 code segment, a conditional JMP (like JNC) can be either 2 bytes or 6 bytes. On a 386, a conditional jump in a USE16 segment will be either 2 bytes or 4 bytes. Here's part of a listing file generated by an assembler for a small test file that demonstrates this:
1 .386 2 00000000 jmps32 segment use32 public 'code' 3 assume cs:jmps32,ds:jmps32 4 00000000 0F 82 00000200 jc foo32 5 00000006 0200*(??) db 512 dup(?) 6 00000206 foo32 label near 7 00000206 90 nop 8 00000207 72 FD jc foo32 9 00000209 jmps32 ends 10 11 0000 jmps16 segment use16 public 'code' 12 assume cs:jmps16,ds:jmps16 13 0000 0F 82 0200 jc foo16 14 0004 0200*(??) db 512 dup(?) 15 0204 foo16 label near 16 0204 90 nop 17 0205 72 FD jc foo16 18 0207 jmps16 ends 19 20 end
Notice how that 512 byte blob on line #5 caused the JC on line #4 to become one of those fat 6 byte conditional jumps. When you're forced into using one of the 6 byte jumps it's a byte worse than doing one of the 5 byte CALL's!
Similarly, in USE16 code, that 4 byte conditional JC on line #13 is a byte worse than a 3 byte CALL. We'll see more about these jumps on the 386 a little later.
One thing we see all the time in code is two procedures that have a common entry sequence of instructions and then go off to do something different at some point like these two example functions:
f1 proc near (1 byte) push ds (1 byte) push es (1 byte) push si (1 byte) push di (6 byte) mov word ptr [SomeThing],1234 ; ; The code gets different here ; ExitMerge: (1 byte) pop di (1 byte) pop si (1 byte) pop es (1 byte) pop ds (1 byte) ret f1 endp f2 proc near (1 byte) push ds (1 byte) push es (1 byte) push si (1 byte) push di (6 byte) mov word ptr [SomeThing],1234 ; ; The code gets different here ; (2 byte) jmp ExitMerge f2 endp (27 bytes total)
We did the obvious tail merge on the exit sequences, but is there some way we might smash the entry sequences together too? Again, the carry flag can come to the rescue in a situation like this one. Unless these functions depend on the state of the carry flag on entry, which is a pretty rare condition, we might do something like this:
f1 proc near (1 byte) clc EntryMerge: (1 byte) push ds ; Note that the flags are (1 byte) push es ; not hit by any of these instructions (1 byte) push si ; in the common prolog sequence (1 byte) push di (6 byte) mov word ptr [SomeThing],1234 (2 byte) jc EntryReturn ; ; The code gets different here ; ExitMerge: (1 byte) pop di (1 byte) pop si (1 byte) pop es (1 byte) pop ds (1 byte) ret f1 endp f2 proc near (1 byte) stc (2 byte) jmp short EntryMerge EntryReturn: ; ; The code gets different here ; (2 byte) jmp ExitMerge f2 endp (23 bytes total)
Was 23 bytes for the last example the best we can do? If we can place one added constraint on the functions we can do a little better. What if it were OK for these functions to clobber the CX register on entry? This is a fairly easily met condition in most cases for functions that will be called by a C/C++ compiler.
Consider this dastardly transformation:
f1 proc near (1 byte) db 0B9h ; The opcode byte for MOV CX,immediate f2 label near (2 byte) xor cx,cx ; 33h C9h (1 byte) push ds (1 byte) push es (1 byte) push si (1 byte) push di (6 byte) mov word ptr [SomeThing],1234 (2 byte) jcxz f2MergeEntry ; ; The code gets different here ; ExitMerge: (1 byte) pop di (1 byte) pop si (1 byte) pop es (1 byte) pop ds (1 byte) ret f1 endp f2MergeEntry label near ; ; The code gets different here ; (2 byte) jmp ExitMerge f2 endp (22 bytes total)
That one is a little tricky and needs some explanation. Here we've got the entry points for f1() and f2() being located within a byte of each other. The reason this works is because the CPU just processes streams of bytes that get interpreted as instructions. Entering at f1 causes a MOV CX,0C933h to be executed, while entering at f1 causes an XOR CX,CX to be executed. Combining that behavior with a JCXZ instruction makes for a pretty sneaky combination.
If you've got two functions with more than 5 bytes of prolog in common and CX is expendable during the common entry sequence, then you can apply this trick. Alas, in USE32 segments, this trick is less valuable because a "MOV ECX,immediate" is a 5 byte instruction. For code in a USE32 segment, we'd have to add two pad bytes after the XOR ECX,ECX we'd be using in 32 bit code. An OR ECX,ECX instruction would do the trick in a case like that.
Setting one segment register to the value in another one is a common occurrence in USE16 code. It happens a lot during a setup to do some string operations, and when the value in one is unknown at some point in the code.
We see 4 byte code sequences like this all the time:
mov ax,ds mov es,ax
PUSH/POP
If the value in AX wasn't going to be used later, then the straight forward transformation to compact that sequence to 2 bytes would be:
push ds pop es
Yes, it works for SS too!
For some reason many people seem hesitant to use the PUSH/POP sequence when SS is the target in a sequence like this one:
mov ax,ds mov ss,ax mov sp,NewSPval
The PUSH/POP thing does work here though, and there's no reason not to use it when space is the issue.
Using a PUSH/POP to replace a MOV/MOV chops the size of setting one segment register to another from 4 bytes to 2. What's the speed hit though? I used this little Zen Timer program to measure the difference on several different test systems:
;-------------------------------------------------------- ; A Zen Timer test to see how well PUSH/POP performs ; compared to MOV/MOV for setting segment registers. ;-------------------------------------------------------- call ZtimerOn rept 5000 push ds ;; 5000 PUSH/POP's pop es endm call ZtimerOff call ZTimerReport db 32 dup (90h) call ZtimerOn rept 5000 mov ax,ds ;; 5000 MOV/MOV's mov es,ax endm call ZtimerOff
We would rightly expect the PUSH/POP sequence to run slower than the MOV/MOV sequence because it's sloshing stuff on and off the stack. How much slower it was on the different CPU types was interesting. All of these timings are in microseconds by the way.
Test CPU PUSH/POP MOV/MOV Ratio 4.77mhz 8088 31057 18102 .58 386SX-25 2029 806 .39 386DX-20 2449 1100 .45 486SX-25 1310 1200 .92 486DX2-66 2235 450 .20 60mhz Pentium 330 249 .75
The effects of the 486DX2-66's clock doubler were really apparent on this test. Keeping things in registers really makes the clock doubling technology shine. Equally notable was the performance of the 486SX-25. It wasn't hurt too badly at all by the PUSH/POP sequence.
So when would we want to trade a 4 byte MOV/MOV for a 2 byte PUSH/POP? As always, a profiler and some common sense should guide our actions. Let's say the segment registers are being set at the beginning of a repeated string operation that's going to be moving several hundred or several thousand bytes around. In a case like this, the added cost of the PUSH/POP is going to be negligible compared to the cost of the string move, so you might as well save the space with a PUSH/POP. One of the advantages of the PUSH/POP sequence is that you don't have to burn a register to do it either. Depending on how the rest of the code is organized, this might be quite handy.
LES/LDS are powerful instructions
We often see code that looks like this sequence in programs:
Thing dd ? ... (3 bytes) mov ax, word ptr [Thing] (4 bytes) mov dx, word ptr [Thing+2] (7 bytes total)
If we're in real mode (or V86 mode on a 386), then we don't need to worry about what the contents of a segment register is when we load it (like we would have to in protected mode where a GP-fault might be generated).
In a case like this, if we can waste the ES register, the code could be transformed to something like this:
(4 bytes) les ax, dword ptr [Thing] (2 bytes) mov dx, es (6 bytes total)
Often we'll be seeing sequences like this in code where a bunch of variables in memory are being set:
(3 bytes) mov ax, word ptr [Stuff] (3 bytes) mov word ptr [StuffSave], ax (3 bytes) mov ax, word ptr [Stuff+2] (3 bytes) mov word ptr [StuffSave+2], ax (12 bytes total)
Again, if we can waste the ES register, that can be transformed into this:
(4 bytes) les ax, dword ptr [Stuff] (3 bytes) mov word ptr [StuffSave], ax (4 bytes) mov word ptr [StuffSave+2], es (11 bytes total)
Now, suppose the sequence is setting more than a couple of variables, but we can't waste the ES register.
; ; We can't alter ES during this sequence ; (3 bytes) mov ax, word ptr [OtherStuff] (3 bytes) mov word ptr [OtherStuffSave], ax (3 bytes) mov ax, word ptr [OtherStuff+2] (3 bytes) mov word ptr [OtherStuffSave+2], ax (3 bytes) mov ax, word ptr [OtherStuff2] (3 bytes) mov word ptr [OtherStuffSave2], ax (3 bytes) mov ax, word ptr [OtherStuff2+2] (3 bytes) mov word ptr [OtherStuffSave2+2], ax (3 bytes) mov ax, word ptr [Stuff] (3 bytes) mov word ptr [StuffSave], ax (3 bytes) mov ax, word ptr [Stuff+2] (3 bytes) mov word ptr [StuffSave+2], ax (36 bytes total)
Now the sequence is big enough that we could think about pushing/popping ES as a way to preserve it through the sequence. This would allow us to waste it temporarily to produce this sequence:
; ; We can't alter ES during this sequence ; (1 byte) push es (4 bytes) les ax, dword ptr [OtherStuff] (3 bytes) mov word ptr [OtherStuffSave], ax (4 bytes) mov word ptr [OtherStuffSave+2], es (4 bytes) les ax, dword ptr [OtherStuff2] (3 bytes) mov word ptr [OtherStuffSave2], ax (4 bytes) mov word ptr [OtherStuffSave2+2], es (4 bytes) les ax, dword ptr [Stuff] (3 bytes) mov word ptr [StuffSave], ax (4 bytes) mov word ptr [StuffSave+2], es (1 byte) pop es (35 bytes total)
Occasionally I see code that looks like this next sequence. It's usually a result of someone who was just in a hurry when it was written, or a beginning assembler programmer who didn't have the LES and LDS instructions down yet.
SomeDwordPointer dd ? ... (4 bytes) mov di, word ptr [SomeDwordPointer] (4 bytes) mov ds, word ptr [SomeDwordPointer+2]
If you ever see code written like this and you're looking to save space, then replace it with:
(4 bytes) lds di, dword ptr [SomeDwordPointer]
There are occasions when space saving is a paramount concern. For instance, when ROM'ing code and everything needs to fit in to a small 32K or 64K ROM. Another might be a "pop up" TSR that interacts with the user at human typing speeds. You may even find yourself in a situation where saving a few bytes in an executable is enough to save a diskette in a product, or would pull a component in a system back under a cluster boundary in DOS's FAT file system. In situations like these, exceptional means are called for. One of the easier ways to knock bytes out of code is to look for plump instructions and turn them into a one instruction subroutines. In USE16 code, a near CALL is only 3 bytes. However, there's a lot of instructions on the 80x86 family that are bigger than 3 bytes. Moving immediate constants to memory locations is a common example. In USE32 code, the situation gets worse. Moving an immediate constant to a dword in memory is a whopping 10 byte instruction. Near calls in USE32 code are 5 bytes.
Here's an example from a TSR I wrote a few years ago that faked a HIMEM driver for XT's and AT's that only had EMS 4.0 memory and no extended memory. The driver was a good enough ersatz HIMEM that the 4.0 and later versions of the SMARTDRV disk cache could run with it. The complete source for the FAKEHI driver is included in one of the appendixes. It uses several of the techniques discussed in the book to keep it's resident size as small as possible.
;------------------------------------------------------------ ; Putting this instruction in a proc saves some space. ; Immediate compares to memory are BIG instructions. ;------------------------------------------------------------ CmpAllocFlag0 proc near cmp AllocFlag, 0 ret CmpAllocFlag0 endp
If there's enough instances of a large instruction in the code extracting that instruction into a subroutine of it's own can pay off.
Realize that this
isn't something you'll want to do in a high performance section of code though.
The CALL/RET are terribly slow compared to inlining the instruction.
When you're looking for opportunities to do a one instruction subroutines, one of the best ways to find them is to look at an assembler listing file. Down the left hand side of the listing will be the generated code. You want to kind of scan the listing looking for instructions that generated 4 or more bytes of code. Plump things really stand out in a listing file.
This one-instruction subroutine trick isn't just for assembler code either. It works well in C/C++ code as well when you're really trying to squeeze something down. Take this function for example.
/* 1INST.C Example for 1 instruction subroutines in C */ int var1; int var2; int var3; #ifdef CALL void near ZeroVar1(void) { var1 = 0; } #endif void foo(int x, int y, int z) { var1 &= 255; if (var1) #ifdef CALL ZeroVar1(); #else var1 = 0; #endif else { if (x) #ifdef CALL ZeroVar1(); #else var1 = 0; #endif } if (y) #ifdef CALL ZeroVar1(); #else var1 = 0; #endif else { if (var2) #ifdef CALL ZeroVar1(); #else var1 = 0; #endif if (z & var2) #ifdef CALL ZeroVar1(); #else var1 = 0; #endif } if (var3 == 7) { #ifdef CALL ZeroVar1(); #else var1 = 0; #endif var2 = 1; return; } if (var1 || var2 || var3) #ifdef CALL ZeroVar1(); #else var1 = 0; #endif }
I compiled this function with the Microsoft version 8.0 compiler using the /Os /O1 /Fa and /Gs switches. /Gs eliminates stack probes, /Fa generates an assembler listing, and /Os and /O1 tell the compiler to go for max space optimizations.
Without the symbol "CALL" being defined, the compiler generated this assembler code:
; Static Name Aliases ; TITLE 1inst.c .286p .287 INCLUDELIB SLIBCE INCLUDELIB OLDNAMES.LIB _TEXT SEGMENT WORD PUBLIC 'CODE' _TEXT ENDS _DATA SEGMENT WORD PUBLIC 'DATA' _DATA ENDS CONST SEGMENT WORD PUBLIC 'CONST' CONST ENDS _BSS SEGMENT WORD PUBLIC 'BSS' _BSS ENDS DGROUP GROUP CONST, _BSS, _DATA ASSUME DS: DGROUP, SS: DGROUP _BSS SEGMENT COMM NEAR _var1: BYTE: 2 COMM NEAR _var2: BYTE: 2 COMM NEAR _var3: BYTE: 2 _BSS ENDS _TEXT SEGMENT ASSUME CS: _TEXT PUBLIC _foo _foo PROC NEAR ; COMDAT ; Line 18 push bp mov bp,sp ; x = 4 ; y = 6 ; z = 8 ; Line 20 and WORD PTR _var1,255 ;00ffH jne $L121 ; Line 28 cmp WORD PTR [bp+4],0 ;x je $I112 ; Line 32 $L121: mov WORD PTR _var1,0 ; Line 34 $I112: ; Line 35 cmp WORD PTR [bp+6],0 ;y jne $L122 ; Line 43 cmp WORD PTR _var2,0 je $I116 ; Line 47 mov WORD PTR _var1,0 ; Line 49 $I116: mov ax,WORD PTR _var2 test WORD PTR [bp+8],ax ;z je $I115 ; Line 53 $L122: mov WORD PTR _var1,0 ; Line 55 $I115: ; Line 56 cmp WORD PTR _var3,7 jne $I118 ; Line 61 mov WORD PTR _var1,0 ; Line 63 mov WORD PTR _var2,1 ; Line 64 leave ret ; Line 66 $I118: cmp WORD PTR _var1,0 jne $I120 cmp WORD PTR _var2,0 jne $I120 cmp WORD PTR _var3,0 je $EX110 $I120: ; Line 70 mov WORD PTR _var1,0 ; Line 72 $EX110: leave ret _foo ENDP _TEXT ENDS END
Even with all the best space optimizations possible, the compiler still generated 4 MOV instructions to "Var1" with immediate constants of zero. The generated code from assembling the ASM listing was 6Ah bytes long.
Compiling the function again, this time with the /DCALL switch to enable the one instruction subroutine generated this listing file:
; Static Name Aliases ; TITLE 1inst.c .286p .287 INCLUDELIB SLIBCE INCLUDELIB OLDNAMES.LIB _TEXT SEGMENT WORD PUBLIC 'CODE' _TEXT ENDS _DATA SEGMENT WORD PUBLIC 'DATA' _DATA ENDS CONST SEGMENT WORD PUBLIC 'CONST' CONST ENDS _BSS SEGMENT WORD PUBLIC 'BSS' _BSS ENDS DGROUP GROUP CONST, _BSS, _DATA ASSUME DS: DGROUP, SS: DGROUP _BSS SEGMENT COMM NEAR _var1: BYTE: 2 COMM NEAR _var2: BYTE: 2 COMM NEAR _var3: BYTE: 2 _BSS ENDS _TEXT SEGMENT ASSUME CS: _TEXT PUBLIC _ZeroVar1 _ZeroVar1 PROC NEAR ; COMDAT ; Line 13 mov WORD PTR _var1,0 ; Line 14 ret _ZeroVar1 ENDP PUBLIC _foo _foo PROC NEAR ; COMDAT ; Line 18 push bp mov bp,sp ; x = 4 ; y = 6 ; z = 8 ; Line 20 and WORD PTR _var1,255 ;00ffH jne $L123 ; Line 28 cmp WORD PTR [bp+4],0 ;x je $I114 ; Line 30 $L123: call _ZeroVar1 ; Line 34 $I114: ; Line 35 cmp WORD PTR [bp+6],0 ;y jne $L124 ; Line 43 cmp WORD PTR _var2,0 je $I118 ; Line 45 call _ZeroVar1 ; Line 49 $I118: mov ax,WORD PTR _var2 test WORD PTR [bp+8],ax ;z je $I117 ; Line 51 $L124: call _ZeroVar1 ; Line 55 $I117: ; Line 56 cmp WORD PTR _var3,7 jne $I120 ; Line 59 call _ZeroVar1 ; Line 63 mov WORD PTR _var2,1 ; Line 64 leave ret ; Line 66 $I120: cmp WORD PTR _var1,0 jne $I122 cmp WORD PTR _var2,0 jne $I122 cmp WORD PTR _var3,0 je $EX112 $I122: ; Line 68 call _ZeroVar1 ; Line 72 $EX112: leave ret _foo ENDP _TEXT ENDS END
It was good that the Microsoft compiler was smart enough to recognize that the ZeroVar1() function didn't need any sort of function prolog/epilog code. The compiler was also nice enough to recognize that ZeroVar1() didn't need to save or restore SI/DI which are commonly used as register variables. The generated code from assembling the ASM listing was 62h bytes long. By making the assignment of Var1 a function, we saved 8 bytes.
TIP: The Borland 4.52 C++ compiler behaved a little differently on this test program and insisted on pushing/popping SI/DI at the beginning and end of the one instruction function even though those registers weren't altered by the code. To force the Borland compiler to omit the push/pop of SI/DI a #pragma option can be used like this:
#pragma option -r- // Temporarily turn off register variables void near ZeroVar1(void) { var1 = 0; } #pragma option -r. // Restore the register variable state
Here's a section of the assembler code produced by the Borland compiler with the #pragma's used as mentioned above:
_TEXT segment byte public 'CODE' ; ; void near ZeroVar1(void) ; assume cs:_TEXT,ds:DGROUP _ZeroVar1 proc near ; ; { ; var1 = 0; ; mov word ptr DGROUP:_var1,0 ; ; } ; ret _ZeroVar1 endp ; ; void foo(int x, int y, int z) ; assume cs:_TEXT,ds:DGROUP _foo proc near push bp mov bp,sp push si push di ; ; { ; var1 &= 255; ; and word ptr DGROUP:_var1,255
Notice how no push/pop of SI/DI was generated within the ZeroVar1() function when the #pragma was used. We can also see that the register variables were made active again when the code for the body of the foo() function was generated because pushes of SI and DI were generated there.
One sequence we'll often see in code is a comparison of a register with the value 1 or -1 like this:
cmp edi,-1 ; cmp di,-1 in USE16 code je SomeWhere ; Blah, blah, blah... SomeWhere: ; Blah, blah, blah...
In USE16 and USE32 code this will be encoded by most assemblers as a 3 byte instruction. In special situations we can transform this into a one byte INC instruction by recognizing that one added to minus one is zero (i.e. an INC will set the Z flag):
inc edi ; inc di in USE16 code je SomeWhere ; Blah, blah, blah... SomeWhere: ; Blah, blah, blah...
This is a nice space saver when we can apply it. When can we apply it though? If the code on the "fall through" and at the target don't depend on the value of the register after the comparison, then this trick is a winner. There may still be an opportunity if the code at either the fall through or the target depends on the register if it's only one of them. For example, let's say the code on the fall through in the example above depended on the value in EDI remaining intact, but the code at the branch target didn't. We can still save a byte by doing this:
inc edi ; inc di in USE16 code je SomeWhere dec edi ; Blah, blah, blah... SomeWhere: ; Blah, blah, blah...
The DEC EDI will restore the value in EDI back to what it was before the test.
When testing for the value 1, the concept is the same as testing for -1 only we'll be doing a DEC instead. Similarly, recovering the register value, if one of the paths needs it, would be done with an INC in this situation rather than a DEC.
The thing people typically do to split the nibbles in the AL register into AH and AL might look something like this piece of code:
(2 bytes) mov ah,al (2 bytes) mov cl,4 (2 bytes) shr ah,cl (3 bytes) and ax,0F0Fh
This is nice and it works OK. However, it is 9 bytes long. There's a way to tighten this up into a single two byte instruction though. In all the Intel CPU documentation prior to the Pentium, the claim was made that the AAM instruction only worked in base 10 decimal. In the Pentium programmer's reference, Intel has finally confessed to how the AAM (and AAD) instructions really work. While this confession occurred in the Pentium programmer's reference, AAM has functioned this way in every 80x86 CPU since the original 8086. Here's a pseudo-C version of how AAM works in reality:
AAM(unsigned char NumericBase) { unsigned char Temp8; Temp8 = AL; AH = Temp8 / NumericBase; AL = Temp8 % NumericBase; }
In fact, the AAM instruction is capable of dealing with numbers in most any numeric base you might want to work in. One of the more convenient bases we may want to have the AAM work with is base 16 (i.e. hexadecimal).
If you were to just code up an AAM instruction, an assembler is going to encode it like this:
db 0D4h,0Ah ; Opcode for AAM is 0D4h.
;Second byte is the numeric base. In this case 0Ah = 10 decimal.
If we hand assemble an AAM with a second byte that's different than 0Ah, then we've got an AAM that's going to work in whatever base we encode as the second byte. If we want AAM to work in hexadecimal, then we'd encode the second byte as 10h (16 decimal). You might want to do a macro for convenience, like this.
; ; MAAM (Mutant AAM) ; MAAM macro NumericBase db 0D4h, NumericBase endm
Let's do a mental run of a nibble in AL through the pseudo-C code for AMM using hexadecimal as the numeric base to see how this works. Suppose the AL register contains the value 37h.
37h divided by 16 is 3. The AH register will get the value 03h.
37h mod 16 is 7. The AL register will get the value 07h
So for our test value of 37h in AL, after a mutated AAM using base 16, the AX register contains the value 0307h. The mutated AAM split those nibbles cleanly into AH and AL.
There's a downside to this AAM trick though...
The performance of AAM is pretty slow on all the CPU's.
The previous 9 byte, 4 instruction sequence that split the nibbles is actually faster than the 2 byte mutated AAM. On a Pentium, the AAM instruction takes 10 cycles. It's one of the most expensive integer instructions you can do on a Pentium. As always, we should let a profiler and some common sense tell us if the AAM trick is appropriate for any given piece of code.
Just as we were able to use a mutated AAM to split nibbles, we can also use another mutated instruction to join them. If the two byte AAD instruction is mutated so the second byte is 10h (16 decimal), it will join the low nibble in AH with the low nibble in AL. The result will have the low nibble of AH being the high nibble of AL and the low nibble of AL will remain unchanged.
Suppose the AX register contained the value 0109h. Doing one of these hexadecimal mutated AAD instructions will result in AL containing the value 19H. The AH register is always set to zero after any form of the AAD instruction.
WARNING: If the high order nibbles of AL or AH could possibly be non-zero, you'll need to do an "AND AX,0F0Fh" before doing the mutated base 16 AAD to join the nibbles.
WARNING: The mutated AAD trick does not work correctly on some of the NEC V series CPU's.
I've tried it on a V20 and the V20 always behaves as if the second byte of the AAM were 0Ah (10 decimal). If you're writing code that's going to run on a 286 or better CPU, then use the mutated AAD trick and sleep well. If you're writing code that may be run on an XT class machine with one of the NEC V20 or V30 chips installed, then you'd better avoid this trick. I've not tried the AAD trick on a V40 or V50 chip yet. On all the Intel CPU's, and variants thereof, that are 286 class or better, the AAD trick will work OK. By the way, if you're looking for an easy way to detect a V20 chip, this is one way to do it.
Frequently we'll encounter some code where a series of error conditions are tested for as it flows along where all the error paths lead to one place. The code might look something like this:
ErrorExit: mov ax,-1 ret FooBar proc near call Check1 jc ErrorExit ; ;The next JC to ErrorExit is still in range of a ;short jump. ; call Check2 jc ErrorExit ; ; Some code is here that causes a subsequent JC to ; ErrorExit to be out of range of a short jump. This ; forced the programmer to do a "jump around the ; jump" construct. ; call Check3 jnc Check3ok (2 bytes) jmp ErrorExit (3 bytes) Check3ok: xor ax,ax ret FooBar endp
In a case like this one, if the code after the call to Check3() is within short jump range of one of the other JC instructions, the code can be transformed like this:
FooBar proc near call Check1 jc ErrorExit ; ;The next JC to ErrorExit is still ; in range of a short jump. ; call Check2 JC_to_ErrorExit: jc ErrorExit ; ; Some code is here that causes a ; subsequent JC to ErrorExit to be ; out of range of a short jump. ; This forced the programmer to do ; a "jump around the jump" construct. ; call Check3 jc JC_to_ErrorExit xor ax,ax ret FooBar endp
The state of the carry flag isn't going to change after a conditional jump, so introducing the label "JC_to_ErrorExit" allows us to transform a 5 byte "jump around the jump" sequence into a nice small 2 byte jump on carry.
One added benefit of doing this is that the non-error path through the code isn't forced to take a jump now. Falling through conditional jumps in the mainstream code is always a good thing. Jumps not taken are invariably faster than jumps taken.
If the routine in question is sufficiently long, with lots of tests, you'll get a nice space savings out of this trick, and the code will be running faster in the normal case too.
Sometimes there's no way to avoid a jump around the jump in a real long routine. That's life. When it happens though, see if you can apply the jump chaining trick to subsequent checks in the routine that may be within short jump range of that jump around jump you were forced to use.
The 386 and later CPU's are special here by the way. They implement a new form of the conditional jumps that can span the range of a 64K segment in USE16 code, and span a 4G segment in USE32 code. In USE16 code, this form of conditional jump is going to be a 4 byte instruction. In USE32 code it will be a 6 byte instruction.
Here's a little assembler module that demonstrates these new forms of conditional jumps:
.386 x segment use32 'code32' assume cs:x jc foo32 db 200 dup (90h) ; Force jump out of short range foo32 label near x ends y segment use16 'code16' assume cs:y jc foo16 db 200 dup (90h) ; Force jump out of short range foo16 label near y ends end
Assembling this module yielded this listing file that shows how these new form jumps are 6 and 4 bytes (on lines 4 and 11 of the listing).
Turbo Assembler Version 3.1 11/19/95 22:09:01 Page 1 jumps.ASM 1 .386 2 00000000 x segment use32 'code32' 3 assume cs:x 4 00000000 0F 82 000000C8 jc foo32 5 00000006 C8*(90) db 200 dup (90h) 6 000000CE foo32 label near 7 000000CE x ends 8 9 0000 y segment use16 'code16' 10 assume cs:y 11 0000 0F 82 00C8 jc foo16 12 0004 C8*(90) db 200 dup (90h) 13 00CC foo16 label near 14 00CC y ends 15 end
I need to mention a gotcha here regarding these new 386 forms of jumps. Some assemblers are not going to be very smart and may assemble what could be a short jump into one of those longer 4 or 6 byte forms. Some will later realize they could do the short form and will generate that, but pad the extra bytes they initially allocated with NOP's.
Here's a slightly altered version of the above example that illustrates this pitfall:
.386 x segment use32 'code32' assume cs:x jc foo32 db 200 dup (90h) jc foo32 db 100 dup (90h) foo32 label near x ends y segment use16 'code16' assume cs:y jc foo16 db 200 dup (90h) jc foo16 db 100 dup (90h) foo16 label near y ends end
The second JC in each segment here is within 127 bytes of the target label and can rightly be assembled as the 2 byte short form of the conditional jump. However, when I ran this code with no special switches, other than to generate a list file, through the version of TASM that came with the Borland version 3.1 C/C++ compiler, I got this result:
Turbo Assembler Version 3.1 11/19/95 22:36:17 Page 1 jumps2.ASM 1 .386 2 00000000 x segment use32 'code32' 3 assume cs:x 4 00000000 0F 82 00000132 jc foo32 5 00000006 C8*(90) db 200 dup (90h) 6 000000CE 72 68 90 90 90 90 jc foo32 7 000000D4 64*(90) db 100 dup (90h) 8 00000138 foo32 label near 9 00000138 x ends 10 11 0000 y segment use16 'code16' 12 assume cs:y 13 0000 0F 82 0130 jc foo16 14 0004 C8*(90) db 200 dup (90h) 15 00CC 72 66 90 90 jc foo16 16 00D0 64*(90) db 100 dup (90h) 17 0134 foo16 label near 18 0134 y ends 19 end
Notice how TASM generated the 4 and 6 byte conditional jumps on lines 6 and 15 of this listing. TASM just blindly allocated 4 and 6 bytes for the jumps and then went back and filled in the gaps with NOP's (90h's) when it realized a short version of the jump would work.
Are TASM users doomed to this kind of behavior? No.. There's two different ways around this situation. One is to use the /m switch when assembling and tell TASM to use multiple passes to resolve forward references. When I rebuilt the example and added a /m3 switch TASM generated this much better code:
Turbo Assembler Version 3.1 11/19/95 22:44:56 Page 1 jumps2.ASM 1 .386 2 00000000 x segment use32 'code32' 3 assume cs:x 4 00000000 0F 82 0000012E jc foo32 5 00000006 C8*(90) db 200 dup (90h) 6 000000CE 72 64 jc foo32 7 000000D0 64*(90) db 100 dup (90h) 8 00000134 foo32 label near 9 00000134 x ends 10 11 0000 y segment use16 'code16' 12 assume cs:y 13 0000 0F 82 012E jc foo16 14 0004 C8*(90) db 200 dup (90h) 15 00CC 72 64 jc foo16 16 00CE 64*(90) db 100 dup (90h) 17 0132 foo16 label near 18 0132 y ends 19 end
See how the jumps on lines 6 and 15 are nice short 2 byte ones and that nasty NOP padding is gone.
One thing that can really pork out the code for DOS device drivers and TSR's is missing an ASSUME directive for the DS register. Here's a section of the assembler listing from the FAKEHI driver I mentioned previously. On closer examination, I realized that I'd blown it when I first wrote the driver.
318 ;--------------------------------------------------- 319 ; Translate Src pointer to EMS form (if necessary) 320 ;--------------------------------------------------- 321 020C 2E: 80 3E 0094 00 cmp EMSxbuf.src_type, 0 ; Lowmem? 322 0212 74 12 je ChkDstForXlat ; No xlat in this case... 323 324 0214 2E: C4 3E 0097 les di, dword ptr EMSxbuf.src_offset 325 0219 E8 FEAE call XlatProc 326 021C 2E: 89 3E 0097 mov word ptr EMSxbuf.src_offset, di 327 0221 2E: 8C 06 0099 mov word ptr EMSxbuf.src_offset+2, es 328 329 0226 ChkDstForXlat: 330 ;--------------------------------------------------- 331 ; Translate Dst pointer to EMS form (if necessary) 332 ;--------------------------------------------------- 333 0226 2E: 80 3E 009B 00 cmp EMSxbuf.dst_type, 0 ; Lowmem? 334 022C 74 12 je XlatingDone ; No xlat in this case... 335 336 022E 2E: C4 3E 009E les di, dword ptr EMSxbuf.dst_offset 337 0233 E8 FE94 call XlatProc 338 0236 2E: 89 3E 009E mov word ptr EMSxbuf.dst_offset, di 339 023B 2E: 8C 06 00A0 mov word ptr EMSxbuf.dst_offset+2, es 340 341 0240 XlatingDone: 342 0240 0E push cs ; DS:SI --> EMS xlated info 343 0241 1F pop ds ; 344 0242 BE 0090 mov si, offset EMSxbuf ;
The function XlatProc won't alter the DS register in this section of code by the way, nor does it depend on the value in DS. In the listing, take a look at the code the assembler generated on lines 321, 324, 326, 327, 333, 336, 338, and 339.
Notice the '2E:'s. Those are CS: segment override bytes on the instructions -- 8 of them.
Now notice the PUSH CS and POP DS down on lines 342 and 343. If I'd moved the push/pop of CS and DS up to the top of this section of code, I would have been able to use an ASSUME on the DS register through this section. That would have allowed the assembler to eliminate those 8 bytes of segment overrides.
Fiddling with the code a little yielded this new listing:
286 020C 0E push cs
287 020D 1F pop ds
288 assume ds:fakehi
289 ;---------------------------------------------------
290 ; Translate Src pointer to EMS form (if necessary)
291 ;---------------------------------------------------
292 020E 80 3E 0094 00 cmp EMSxbufDS.src_type, 0 ; Lowmem?
293 0213 74 0F je ChkDstForXlat ; No xlat in this case...
294
295 0215 C4 3E 0097 les di, dword ptr EMSxbufDS.src_offset
296 0219 E8 FEAE call XlatProc
297 021C 89 3E 0097 mov word ptr EMSxbufDS.src_offset, di
298 0220 8C 06 0099 mov word ptr EMSxbufDS.src_offset+2, es
299
300 0224 ChkDstForXlat:
301 ;---------------------------------------------------
302 ; Translate Dst pointer to EMS form (if necessary)
303 ;---------------------------------------------------
304 0224 80 3E 009B 00 cmp EMSxbufDS.dst_type, 0 ; Lowmem?
305 0229 74 0F je XlatingDone ; No xlat in this case...
306
307 022B C4 3E 009E les di, dword ptr EMSxbufDS.dst_offset
308 022F E8 FE98 call XlatProc
309 0232 89 3E 009E mov word ptr EMSxbufDS.dst_offset, di
310 0236 8C 06 00A0 mov word ptr EMSxbufDS.dst_offset+2, es
311
312 023A XlatingDone:
What appears to be a structure, EMSxbufDS, is actually an equate to a memory location that gets relocated into the program's PSP when it loads, so I had to jiggle that around a little in this section. The previous EMSxbuf symbol had a hard wired in CS: override, so that had to be changed to DS: in recognition that DS was actually usable in this section of code now.
Notice how the CS: override bytes are gone in this section of code now that these adjustments were made. Let the assembler's listing file guide you like in this example from FAKEHI.
If you see a section of code with a bunch of segment override bytes associated with it, there may be a way to eliminate them by making DS available like I did in this new version of FAKEHI.
There's a unique situation that arises occasionally when we'd like to test a particular bit in the AH register and then make a conditional jump based on it. The LAHF instruction will load the AH value into the low byte of the flags register for us.
The bits in the low byte of the flags register that correspond to testable bits in the AH register are as follows:
Low flags byte:(msb) SF ZF ## AF ## PF ## CF (lsb) AH Register (msb) b7 b6 b5 b4 b3 b2 b1 b0 (lsb)
## = this flags bit is indeterminate after a LAHF instruction.
So, knowing this correspondence often allows us to replace code that might look like this:
(3 bytes) test ah,00000001b (2 bytes) jnz Somewhere
with something like this:
(1 byte) lahf (2 bytes) jc Somewhere ; Jump if bit 0 in AH is set
Similarly, if we wanted to test the bit 6 in AH we could do this:
(1 byte) lahf (2 bytes) jz Somewhere ; Jump if bit 6 in AH is set
These situations aren't going to occur all the time, but when they do, this trick is usually good for a 2 byte savings.
Return to chapter index
Whenever you see code that's adding or subtracting one to a register or memory location, check and see if the code following the ADD or SUB instruction depends on the carry flag. If the subsequent code isn't dependent on the state of the carry flag, then it's a safe bet that the ADD or SUB of one can be replaced with an INC or DEC instruction.
- In USE16 code doing an INC/DEC on one of the 16 bit registers is always a one byte instruction. In USE32 code, doing an INC/DEC on one of the 32 bit registers is always a one byte instruction.
- ADD'ing or SUB'ing one to a 16 bit register in USE16 code will be 3 byte instructions, so even doing 2 INC's or DEC's will be a byte smaller than adding or subtracting 2 to or from the register.
- ADD'ing or SUB'ing one to a 16 bit register in USE16 code will be 3 byte instructions, so even doing 2 INC's or DEC's will be a byte smaller than adding or subtracting 2 to or from the register.
- In USE32 code, ADD'ing and SUB'ing small values to a 32 bit register are also 3 byte instructions.
Often we'll se some code that looks like this:
(2 bytes) inc al
If we could stipulate that the value in AL was known to be less than 255, guaranteeing that no overflow into AH is possible, then this INC AL could be replaced with an INC AX which is a one byte instruction in USE16 code. If the value in AH is expendable, then an INC/DEC of AX may be possible if the resultant flags don't matter.
Sometimes people's minds get stuck thinking about 8 bit values and they don't recognize when AH is in fact expendable and an INC/DEC of a 16 bit register could have been used.
This same concept can be applied in USE32 code. If we know the value in AL is in no danger of overflowing into AH (or an overflow doesn't matter), then the INC AL can be replaced with an INC EAX.
Another possible opportunity exists in 32 bit code too. There, someone might be trying to INC the AX register, or ADD something to it. This occurs a lot in older 8086 or 286 code that's been ported to take some advantage of the 386, but hasn't been totally reworked with the 386 in mind. In cases like this, accessing the AX register from USE32 code incurs the penalty of a 66h size override prefix byte on the instruction. If you can stipulate that the value in AX won't roll overflow into the high order word of EAX, or maybe that would be harmless if it did happen, then doing the operation on EAX rather than AX will eliminate the size override prefix byte.
- One advantage of tweaking situations like this in USE32 code is that the transformed code will run a lot faster on 486's and Pentiums.
- Size override prefixes really mess up the CPU pipelines on the 486 and Pentium.
- Instructions with size override prefixes hurt especially bad on the Pentium as they can't be "paired" which cripples it's superscaler dual-pipeline hardware.
Another situation with ADD/INC that crops up frequently is when we're reworking some older 8086 or 286 code to take advantage of a 386's capabilities. Often we'll see sequences of 8088 code that look like this:
foo dd ? ...blah, blah, blah... add word ptr foo, 1 adc word ptr foo+2, 0
The idea in this fragment is to add one to the double word memory variable 'foo'. When doing 386 conversions on older code you'd usually want to replace that sequence with:
inc dword ptr foo
The only exception would be if for some reason the state of the carry flag matters after the sequence. An INC won't alter the carry flag, but the ADD/ADC sequence will. In those odd situations, you'd want to do an ADD of 1 to the variable.
Suppose we've got a situation where we've got two structures and one is being copied to the other, with the exception of some fields in the middle of the structures. We'll often see code that looks like this example of some "flat" 32 bit code.
lea esi, SourceStructure lea edi, DestinationStructure mov ecx,8 rep movsd ; Copy 8 dwords add edi, 4 ; Skip the next dword field (3 byte instruction) add esi, 4 ; Skip the next dword field (3 byte instruction) mov cl,4 ; ECX was zero from REP MOVSD rep movsd ; Copy 4 more dwords add edi, 2 ; Skip the next word field (3 byte instruction) add esi, 2 ; Skip the next word field (3 byte instruction) movsd ; Move the last field
If this isn't a particularly speed critical section of code, we could replace it with this sequence:
lea esi, SourceStructure lea edi, DestinationStructure mov ecx,8 rep movsd ; Copy 8 dwords cmpsd ; Skip dword field (1 byte instruction) mov cl,4 ; ECX was zero from REP MOVSD rep movsd ; Copy 4 more dwords cmpsw ; Skip word field (2 byte instruction in USE32 code) movsd ; Move the last field
- Since we know that ESI and EDI are pointing to valid memory locations and won't cause a Trap-D if memory is accessed via them, CMPSB, CMPSW, and CMPSD instruction will perform the additions to ESI and EDI that we were looking for.
- In the above example, we saved 9 bytes using this trick.
In this example, we can knock off a couple of more bytes by realizing that the count of 8 initially being loaded into ECX is a small number. That MOV immediate to ECX is a 5 byte instruction. If we replaced it with:
push 8 pop ecx
we'd have a 3 byte sequence.
When space is at a premium and we see the sequence:
dec cx (or ECX in 32 bit code) jnz
it can usually be replaced by a LOOP instruction which will be a byte smaller. After all, what LOOP does is decrement CX or ECX and jump if the result is non zero.
The one thing to watch out for when doing a transformation is if the code involved is depending on the state of the flags after the DEC instruction. LOOP doesn't alter the flags, but DEC does.
Don't do this transformation on in very speed critical code that may be running on 386 or later CPU's. The DEC CX/JNZ sequence is actually faster than a LOOP on many of the newer CPU's.
Here's a little Zen Timer program to see how LOOP versus DEC/JNZ performs:
call ZtimerOn mov cx, 32000 even loop1: loop loop1 call ZtimerOff call ZTimerReport call ZtimerOn mov cx, 32000 even loop2: dec cx jnz loop2 call ZtimerOff
Running this on several different test machines yielded these results (all timings are in microseconds):
LOOP DEC CX/JNZ Pentium 60mhz 4255 2659 386DX-20 19754 18122 386SX-25 18063 16705 486SX-25 8955 5118 486DX2-66 10574 17302
The timings for the 486DX2-66 looked a little out of place, so I went back and changed the "EVEN" directive in the test to be an " ALIGN 16" to make sure the loop tops were paragraph aligned. Curiously, this didn't change anything for the 486SX-25, but it made a big difference on the 486DX2-66! With the paragraph aligned loop top, the 486DX2-66's timing on the DEC CX/JNZ sequence was 9613 microseconds. The timing for the sequence using the LOOP instruction stayed the same.
What's real odd here is that the 486DX2-66 is getting beat by the 486SX-25 on this test - even when everything was paragraph aligned for the 486DX2-66. I know from benchmarking the motherboard the 486DX2-66 is in the past, that it's a fairly slow motherboard compared to other systems in the same class. It's external cache hardware is less effective than that of most other 486DX2-66 systems. Even so, the looping code in this test should be parked in the CPU's internal cache for this test. The 486SX-25 motherboard has no external cache at all, so there's no possibility that it's a factor in it's timings.
By the way, Appendix B has some further Zen Timer investigation of the strange performance characteristics of my 486DX2-66. The DX2 is indeed an odd beast.
The XCHG instruction can be a powerful weapon for squeezing bytes. When XCHG is used with AX (or EAX in 32 bit code) and one of the other registers, then XCHG is a one byte instruction, while moving one register to another is a two byte instruction. For example:
XCHG EAX,EDX ; This is 1 byte in 32 bit code. XCHG SI,AX ; This is 1 byte in 16 bit code. MOV EAX,EDX ; This is 2 bytes in 32 bit code. MOV SI,AX ; This is 2 bytes in 16 bit code.
In fact, the one byte NOP instruction's encoding (90h) is actually that of an XCHG (E)AX,(E)AX instruction.
Suppose we've got an instruction sequence like this one:
mov ax,dx mov dx,Something
It wants AX to get DX's value and then causes a reload of DX right after. Sequences like that can be transformed with XCHG to good effect like this:
xchg ax,dx ; AX=DX, DX=don't care mov dx,SomeThing
This sequence is a byte smaller than the one using the MOV we just saw.
When using XCHG as a destructive move like this, I recommend adding a comment similar to that up above. This helps to clarify what the code is really doing. Saying that DX was a "don't care" makes it crystal clear that you're using the XCHG as a destructive one byte move. Comments like this are particularly useful when the instruction that reloads the destroyed register isn't right next to the XCHG instruction like in the examples.
Another stylistic thing that helps when using XCHG as a destructive move is to order the operands as if it were a MOV instruction like in the above examples. That XCHG example above could have been written as:
xchg dx,ax
The code would work perfectly well, but this form just doesn't seem as clear to me.
With operands ordered like a MOV's and with a comment added, there's no question about what's going on in the code.
The LEA instruction is good for loading effective addresses, but there's no requirement that the values it deals with actually be addresses though. LEA also works well as a multi component arithmetic instruction.
Suppose we have an instruction sequence in some USE16 like this one:
(2 bytes) mov ax, si (2 bytes) add ax, bx (4 bytes) add ax, 1234h
That sequence can be reduced to a single LEA instruction:
(4 bytes) lea ax, [bx+si+1234h]
TIP: The LEA instruction doesn't alter the flags. This is often be a very convenient behavior
The real power of the LEA instruction for compacting code shows up in 386 specific code. With the 386 came the ability to have the EAX, ECX, and EDX registers participate in an LEA. The 386 also added the ability to have a scaling factor of 2,4,or 8 applied to one of the component registers like in this next example assembler listing:
1 .386 2 00000000 x32 segment use32 para public 'code32' 3 assume cs:x32 4 00000000 66| 8D 84 88 + lea ax,[eax+ecx*4+1234h] 5 00001234 6 7 00000008 x32 ends 8 9 0000 x16 segment use16 para public 'code32' 10 assume cs:x16 11 0000 67| 8D 84 88 + lea ax,[eax+ecx*4+1234h] 12 00001234 13 0008 51 push cx 14 0009 C1 E1 02 shl cx,2 15 000C 03 C1 add ax,cx 16 000E 59 pop cx 17 000F 05 1234 add ax,1234h 18 0012 x16 ends 19 end
Here we've got a USE32 and USE16 code segments with a couple of LEA instructions that are doing a lot of work. The LEA's on lines 4 and 11 are using the scaling feature of LEA on the 386 to do a multiply by 4 on the value in the ECX register during the additions that LEA does..
If we had to get the value of CX * 4 on a pre-386 CPU, while not destroying CX, we would be forced into doing something like the sequence starting on line 13 that approximates the behavior of the LEA on line 11. If it were stipulated that the flags couldn't be altered in that sequence, then a PUSHF/POPF costing 2 bytes would need to be wrapped around the whole thing too.
Even without the PUSHF/POPF, using the new 386 form of the LEA turned out to be 2 bytes shorter than the sequence starting on line 13.
The previous example's LEA's are kind of worst case versions too. They're causing a 4 byte embedded constant 00001234h to be generated in the instructions (see lines 5 and 12). The LEA instruction can also work with small sign extended constants like in this next listing:
1 .386 2 00000000 x32 segment use32 para public 'code32' 3 assume cs:x32 4 00000000 66| 8D 44 88 33 lea ax,[eax+ecx*4+33h] 5 6 00000005 x32 ends 7 8 0000 x16 segment use16 para public 'code32' 9 assume cs:x16 10 0000 67| 8D 44 88 33 lea ax,[eax+ecx*4+33h] 11 0005 51 push cx 12 0006 C1 E1 02 shl cx,2 13 0009 03 C1 add ax,cx 14 000B 59 pop cx 15 000C 05 0033 add ax,33h 16 000F x16 ends 17 end
In this example listing the LEA's are 3 bytes shorter than in the previous listing because the constant 33h fits in a sign extended byte. Now the LEA in the USE16 segment is 1/2 the size of the equivalent sequence that was forced to preserve the value in CX..
In the USE32 version of LEA on line 4 notice the 66h prefix byte on the instruction. This is the cost of specifying AX as a target in USE32 code. If we could stipulate that the high word of EAX was expendable, and just do the LEA using EAX as a target, then the instruction would be a byte smaller.
Another interesting property of the 386's new 32 bit form of LEA is that the "index" and "base" register can be the same thing. This opens up all sorts of possibilities for doing multiply operations on what previously were inconvenient numbers - like for example 10. If we wanted to multiply EAX by 10, one straight forward way would be:
(5 bytes) mov edx,10 (2 bytes) mul edx ; EDX:EAX = EAX*10
That's 7 bytes, and kind of porky if we're trying to save space. We could get it down to 5 bytes with the PUSH/POP trick examined in a prior chapter like this:
(2 bytes) push 10 (1 byte) pop edx (2 bytes) mul edx
5 bytes is better, but there's still some potentially undesirable side effects. EDX is getting clobbered by the multiply because a 64 bit result is being generated. If we know the numbers involved happened to be small and we weren't going to get any significant result in EDX, then we just burned a register for nothing.
Suppose we used an LEA here like this though:
(3 bytes) lea eax,[eax+eax*4] ; EAX = EAX*5 (2 bytes) add eax,eax ; EAX = EAX*10
This is still 5 bytes, but we've avoided burning the EDX register this time. There's an important performance advantage to this last sequence too. MUL instructions are real expensive, even on the Pentium where they cost about 10/11 clock cycles. On a 486, a MUL can cost as little as 13 clocks all the way up to 42 clocks for a 32 bit multiply.
486's and Pentiums can knock off an LEA instruction in a single cycle though! The register add of EAX,EAX in the last code fragment only costs a single cycle on a 486 and Pentium too.
So, anytime you can replace an expensive unsigned multiply (MUL) with a short sequence of LEA's and ADD's, you're going to be winning bigtime on the newer CPU's.
If you know for certain that you could live with doing a signed multiply, then the way to go for small numbers would be an IMUL instruction. IMUL has a special form that allows multiplying a register by an immediate value. For example:
(3 bytes) imul eax,10
This is 2 bytes better than the LEA/ADD combination, but it does do a signed multiply!
One situation that crops up with annoying regularity in code, is needing to do a conditional jump based on a register or memory variable being zero or not, but where the state of the flags needs to be preserved across the conditional jump. This frequently lot when functions give a return value in the ZF or CF flag and the code is running with interrupts disabled. An OR or CMP followed by a JZ/JE or JNZ/JNE would alter the flags, so often we'll see code that does some sort of twiddle involving PUSHF/POPF to preserve the flags like this code fragment:
IFsensitive proc near cli ; ; Some uninterruptable code is here ; (1 byte) pushf (2 bytes) xor ax,ax (4 bytes) cmp byte ptr Flag,al (2 bytes) je ContinueAX0 ContinueAX1: (1 byte) inc ax ContinueAX0: (8 bytes) POPFF ; ; More uninterruptable code is here ; sti ret IFsensitive endp
Notice the POPFF macro that was used rather than a straight POPF instruction. The stipulation was that interrupts are disabled, so using a straight POPF in this section of code would have made it vulnerable to the POPF bug that exists on early stepping levels of the 80286 chips. POPFF macros are typically gruesome in terms of performance and size. There's a CALL, a couple of PUSH's, a JMP, and an IRET (a very slow instruction) involved. A typical POPFF macro implementation might look something like this 8 byte sequence:
POPFF macro pushf ;; Get the flags push cs ;; and a far return call $+5 ;; address on the stack, then near CALL an IRET jmp $+3 ;; Continue on with main flow of the code after the IRET iret ;; This pops the far return address and the flags endm
Realizing that JCXZ can do a conditional jump without needing to zap the flags would allow us to ditch the POPFF macro and rewrite that code like this:
IFsensitive proc near cli ; ; Some uninterruptable code is here ;(1 byte) push cx (3 bytes) mov ax,0 (2 bytes) mov cx,ax (4 bytes) mov cl,byte ptr Flag (2 bytes) jcxz ContinueAX0 ContinueAX1: (2 bytes) mov al,1 ContinueAX0: (1 byte) pop cx ; ; More uninterruptable code is here ; sti ret IFsensitive endp
Even with all the jerking around to setup the CX register and save/restore it, this new sequence beat the one that used the POPFF macro by 3 bytes. It's going to run a lot faster too. Even in real mode or V86 mode on a Pentium, and IRET alone takes 8 clocks.
Suppose we had a couple of C functions that were accessing a global memory variable like this:
int x; void foo(int a, int b, int c, int d, int e) { if (a) x++; if (b) x++; if (c) x++; if (d) x++; if (e) x++; }void foo2(int a, int b, int c, int d, int e) { register int tempx = x; if (a) tempx++; if (b) tempx++; if (c) tempx++; if (d) tempx++; if (e) tempx++; x = tempx; }
Both of these functions have well defined entry and exit points. It would be perfectly valid for a C compiler to transform the code for foo() into code the way foo2() is written for the purposes of generating machine code.. The semantics of the code would be the same in both cases.
I compiled those functions with the 32 bit Watcom version 10.0 compiler using the -s (omit stack probes) and -os (favor space) switches and got this result though:
Module: D:\TESTS\track.c Group: 'DGROUP' CONST,CONST2,_DATA,_BSS Segment: _TEXT BYTE USE32 0000006b bytes 0000 55 foo_ push ebp 0001 89 e5 mov ebp,esp 0003 85 c0 test eax,eax 0005 74 06 je L1 0007 ff 05 00 00 00 00 inc dword ptr _x 000d 85 d2 L1 test edx,edx 000f 74 06 je L2 0011 ff 05 00 00 00 00 inc dword ptr _x 0017 85 db L2 test ebx,ebx 0019 74 06 je L3 001b ff 05 00 00 00 00 inc dword ptr _x 0021 85 c9 L3 test ecx,ecx 0023 74 06 je L4 0025 ff 05 00 00 00 00 inc dword ptr _x 002b 83 7d 08 00 L4 cmp dword ptr +8H[ebp],00000000H 002f 74 06 je L5 0031 ff 05 00 00 00 00 inc dword ptr _x 0037 5d L5 pop ebp 0038 c2 04 00 ret 0004H 003b 56 foo2_ push esi 003c 55 push ebp 003d 89 e5 mov ebp,esp 003f 89 c6 mov esi,eax 0041 a1 00 00 00 00 mov eax,_x 0046 85 f6 test esi,esi 0048 74 01 je L6 004a 40 inc eax 004b 85 d2 L6 test edx,edx 004d 74 01 je L7 004f 40 inc eax 0050 85 db L7 test ebx,ebx 0052 74 01 je L8 0054 40 inc eax 0055 85 c9 L8 test ecx,ecx 0057 74 01 je L9 0059 40 inc eax 005a 83 7d 0c 00 L9 cmp dword ptr +0cH[ebp],00000000H 005e 74 01 je L10 0060 40 inc eax 0061 a3 00 00 00 00 L10 mov _x,eax 0066 5d pop ebp 0067 5e pop esi 0068 c2 04 00 ret 0004H No disassembly errors
Remember what I said previously about looking at listing files for "wide" instructions? The generated code for foo() is 3Ah bytes long and has a bunch of awfully wide instructions there doing memory increments on the variable "x".
The generated code for foo2() where we introduced the temp register variable that's tracking the value of "x" through the life of the foo2() code is a lot better. It's only 2Fh bytes long (11 bytes smaller than foo()) and potentially touches memory a lot less depending on what kind of values are passed as parameters to foo2().
The same thing we tricked the Watcom compiler into doing can often be applied to assembler routines. Beginning programmers, or someone who is just in a hurry, will frequently treat the CPU as if it didn't have any registers. The result is fatter and slower code in many cases.
The rule of thumb here is: when you see a routine referencing a memory variable more than two times it may be a candidate for a transformation like the one above if there's a register free. Often it'll pay to force a register to be free by PUSH'ing and POP'ing something at the entry and exit to the routine. Remember, PUSH and POP are one byte instructions for registers, and on the newer CPU's, like the 486's and Pentiums, PUSH/POP run pretty fast too.
Don't hesitate to use PUSH/POP when accessing of memory variables is kind of grouped together in a routine and registers are scarce either. You may encounter code that looks something like this:
Var1 dd ? Var2 dd ? Var3 dd ? Var4 dd ? (10 bytes) mov Var1,1234 ( 6 bytes) add Var1,ebx ( 7 bytes) shl Var1,2 (10 bytes) mov Var2,5678 ( 6 bytes) add Var2,ebx ( 7 bytes) shl Var2,2 (10 bytes) mov Var3,4321 ( 6 bytes) add Var3,ebx ( 7 bytes) shl Var3,2 (10 bytes) mov Var4,8765 ( 6 bytes) add Var4,ebx ( 7 bytes) shl Var4,2 ( 5 bytes) mov eax, Var1 ( 6 bytes) add Var2, eax (103 bytes total)
Even if there's only one register available here for use, say EBP, it'll pay to "Scope" the value of Var1 onto the stack with a PUSH because it's going to be referenced later on where Var1 gets added to Var2 near the end of the sequence.
Tracking Var1, Var2, Var3, and Var4 all in EBP would look like this:
(1 byte) push ebp ; Save EBP. (5 bytes) mov ebp,1234 ; EBP is tracking Var1 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (1 byte) push ebp ; Var1 is on the stack now (5 bytes) mov ebp,5678 ; EBP is tracking Var2 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (6 bytes) mov Var2,ebp ; Var2 is in memory now. (5 bytes) mov ebp,4321 ; EBP is tracking Var3 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (6 bytes) mov Var3,ebp ; Var3 is in memory now. (5 bytes) mov ebp,8765 ; EBP is tracking Var3 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (6 bytes) mov Var4,ebp ; Var4 is in memory now. (1 byte) pop ebp ; EBP is tracking Var1 again (6 bytes) add Var2,ebp ; (6 bytes) mov Var1,ebp ; Var1 is in memory now. (1 byte) pop ebp ; Restore EBP value (74 bytes total)
Not too bad. We saved 29 bytes by tracking the variables it used in a register. This code is going to run a lot faster and be easier on the working set of whatever program it lived in.
Is this as small as it can get though? No.. We can also apply the concept of variable pooling introduced earlier to get something like this:
VarPool label dword V1off equ $-VarPool Var1 dd ? V2off equ $-VarPool Var2 dd ? V3off equ $-VarPool Var3 dd ? V4off equ $-VarPool Var4 dd ? (1 byte) push ebp ; Save EBP. (1 byte) push edi ; Save EDI. (6 bytes) lea edi,VarPool ; Aim EDI at the variable pool (5 bytes) mov ebp,1234 ; EBP is tracking Var1 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (1 byte) push ebp ; Var1 is on the stack now (5 bytes) mov ebp,5678 ; EBP is tracking Var2 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (3 bytes) mov [edi+V2off],ebp ; Var2 is in memory now. (5 bytes) mov ebp,4321 ; EBP is tracking Var3 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (3 bytes) mov [edi+V3off],ebp ; Var3 is in memory now. (5 bytes) mov ebp,8765 ; EBP is tracking Var3 now (2 bytes) add ebp,ebx (3 bytes) shl ebp,2 (3 bytes) mov [edi+V4off],ebp ; Var4 is in memory now. (1 byte) pop ebp ; EBP is tracking Var1 again (3 bytes) add [edi+V2ff],ebp ; (3 bytes) mov [edi+V1off],ebp ; Var1 is in memory now. (1 byte) pop edi ; Restore EDI value (1 byte) pop ebp ; Restore EBP value (67 bytes total)
Using the variable pool technique knocked another 7 bytes off for 67 total. Now this code is starting to get fairly tight for USE32 code which normally tends toward plumpness. If we could stipulate that the value in ECX could be destroyed in this section of code, we could do CL based SHL's rather than the immediate shifts being used now. That would be good for another two byte savings. The SHL EBP,CL instruction will be 2 bytes versus 3 for the immediate shifts. There's 4 shifts for a total of 12 bytes currently. Using CL shifts would be 8 bytes, plus another 2 to load the value 2 into CL with a MOV instruction.
When interfacing C/C++ code with assembler modules, we'll often see the assembler code setup a "standard" stack frame just like a C compiler would to get at any pushed parameters. In some cases this isn't necessary though and just bloats and slows down the code.
Here's a C test program and 4 assembler functions that demonstrate the differences in performance and size of using a "standard" frame to accesses parameters versus popping them off the stack and returning via a JMP.
/*---------------------------------------------------- SPTEST.C Demonstration of performance costs of different calling conventions and ways of accessing parameters on the stack. ----------------------------------------------------*/ #include <stdio.h> #include <time.h> typedef unsigned short USHORT; extern USHORT near SplitNibbles(USHORT val); extern USHORT near SplitNibbles2(USHORT val); extern USHORT near pascal PasSplitNibbles(USHORT val); extern USHORT near pascal PasSplitNibbles2(USHORT val); #define ITERATIONS 10000000L /* 10M */ int main(void) { unsigned long i; time_t start,stop,elapsed; #define TESTLOOP for (i=0L; i<ITERATIONS; i++) /* POP's and JMP using "standard" C conventions */ start = time(NULL); TESTLOOP { SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); SplitNibbles2(0x0099); } stop = time(NULL); elapsed = stop-start; printf("Using POP's and JMP = %lu seconds\n", (unsigned long)elapsed); /* BP frame access using "standard" C conventions */ start = time(NULL); TESTLOOP { SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); SplitNibbles(0x0099); } stop = time(NULL); elapsed = stop-start; printf("Using standard frame = %lu seconds\n", (unsigned long)elapsed); /* POP's and JMP using "pascal" conventions */ start = time(NULL); TESTLOOP { PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); PasSplitNibbles2(0x0099); } stop = time(NULL); elapsed = stop-start; printf("Using POP's and JMP(Pascal) = %lu seconds\n", (unsigned long)elapsed); /* BP frame access using "pascal" conventions */ start = time(NULL); TESTLOOP { PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); PasSplitNibbles(0x0099); } stop = time(NULL); elapsed = stop-start; printf("Using standard frame(Pascal) = %lu seconds\n", (unsigned long)elapsed); return 0; }
;-------------------------------------------- ; SPLIT.ASM ; ; Support functions to show the performance ; difference with different methods of ; accessing parameters and different calling ; conventions. ;-------------------------------------------- _TEXT segment use16 para public 'code' ; ; This function uses a "standard" stack ; frame to get at it's parameters and is ; called using "standard" calling conventions ; align 16 _SplitNibbles proc near public _SplitNibbles push bp ; 1 byte mov bp,sp ; 2 bytes mov ax,[bp+4] ; 3 bytes db 0D4h,10h ; 2 bytes pop bp ; 1 byte ret ; 1 byte _SplitNibbles endp ; ; This function uses a "standard" stack ; frame to get at it's parameters and is ; called using "pascal" calling conventions ; align 16 PASSPLITNIBBLES proc near public PASSPLITNIBBLES push bp ; 1 byte mov bp,sp ; 2 bytes mov ax,[bp+4] ; 3 bytes db 0D4h,10h ; 2 bytes pop bp ; 1 byte ret 2 ; 3 bytes PASSPLITNIBBLES endp ; ; This function uses the POP trick to ; get at it's parameters and is called ; using "standard" calling conventions ; align 16 _SplitNibbles2 proc near public _SplitNibbles2 pop cx ; 1 byte pop ax ; 1 byte push ax ; 1 byte db 0D4h,10h ; 2 bytes jmp cx ; 2 bytes _SplitNibbles2 endp ; ; This function uses the POP trick to ; get at it's parameters and is called ; using "pascal" calling conventions ; align 16 PASSPLITNIBBLES2 proc near public PASSPLITNIBBLES2 pop cx ; 1 byte pop ax ; 1 byte db 0D4h,10h ; 2 bytes jmp cx ; 2 bytes PASSPLITNIBBLES2 endp _TEXT ends end
I built the SPTEST.C program with the Microsoft version 8.0 C/C++ compiler and ran two trials on several test machines. The results were interesting. All timings are in seconds.
386SX-25 486SX-25 60mhz P5 486DX2-66 386DX-20 POP/JMP(Std C) 270/271 126/127 47/47 239/239 332/333 Std frame(Std C) 327/326 137/137 51/52 246/246 359/359 POP/JMP(Pascal) 229/229 112/112 48/48 207/206 302/302 Std frame(Pascal) 264/265 128/127 57/57 242/243 343/343
The POP/JMP scheme looks to be a winner when either Pascal or standard C calling conventions are used, with the Pascal calling convention form being the fastest (and the smallest too!).
What's most notable here is the fairly dramatic speed differences between standard C calling convention with standard stack frame access versus POP/JMP with Pascal calling conventions.
The 386SX-25 shows just over a 40% difference between the best and the worst, while the Pentium is only about 6%. The 486's and 386DX are in the middle around 20%.
The 80186 and subsequent processors have the ability to push and pop all the general registers with single byte instructions. The NEC V20/V30 can do this too by the way. When saving space is of paramount importance, this ability can come in very handy.
Often we'll see a code sequence that looks like this:
(1 byte) push ax (1 byte) push si (1 byte) push di ; ; Some code is here that causes AX,SI, and DI ; to be destroyed, but they need to be preserved. ; (1 byte) pop di (1 byte) pop si (1 byte) pop ax
That sequence works OK, but if we're willing to pay the price for sloshing all the general registers on and off the stack it can be reduced to this:
(1 byte) pusha ; ; Some code is here that causes AX,SI, and DI ; to be destroyed, but they need to be preserved. ; (1 byte) popa
This transformation saved 4 bytes over the version that pushed and popped the effected registers individually. Naturally we wouldn't want to employ a change like this one if one or more of the general registers were being used for passing back return values.
While the space savings are excellent with PUSHA/POPA, the performance cost can be pretty hefty if only a few registers need to be saved and restored. This ZenTimer program demonstrates the cost differences:
.186 ; PUSH/POP of 2 registers align 4 nop call ZtimerOn rept 1000 push si push di pop di pop si endm call ZtimerOff call ZtimerReport ; PUSH/POP of 3 registers align 4 nop call ZtimerOn rept 1000 push si push di push bx pop bx pop di pop si endm call ZtimerOff call ZtimerReport ; PUSH/POP of 4 registers align 4 nop call ZtimerOn rept 1000 push si push di push bx push cx pop cx pop bx pop di pop si endm call ZtimerOff call ZtimerReport ; PUSH/POP of 5 registers align 4 nop call ZtimerOn rept 1000 push si push di push bx push cx push dx pop dx pop cx pop bx pop di pop si endm call ZtimerOff call ZtimerReport ; PUSH/POP of 6 registers align 4 nop call ZtimerOn rept 1000 push si push di push bx push cx push dx push bp pop bp pop dx pop cx pop bx pop di pop si endm call ZtimerOff call ZtimerReport ; PUSH/POP of 7 registers align 4 nop call ZtimerOn rept 1000 push si push di push bx push cx push dx push bp push ax pop ax pop bp pop dx pop cx pop bx pop di pop si endm call ZtimerOff call ZtimerReport ; PUSHA/POPA align 4 nop call ZtimerOn rept 1000 pusha popa endm call ZtimerOff
Running this program on several test CPU's yielded the following results. All of these timings are in microseconds.
NEC V20 386SX-25 386DX-20 486SX-25 60mhz Pentium 2 push/pop 11602 714 926 440 70 3 push/pop 17467 1145 1430 743 110 4 push/pop 23313 1533 1894 993 158 5 push/pop 29093 1898 2396 1242(*) 182 6 push/pop 35198(*) 2260 2855 1480 320(*) 7 push/pop 40730 2648(*) 3352(*) 1744 352 pusha/popa 32490 2289 2934 1133 299 (*) - break even point where PUSHA/POPA outperform separate pushes and pops.
It's interesting that on all these CPU's there came a point where a PUSHA/POPA eventually outperformed separate pushes and pops. If performance is paramount in an application, I'd say picking a threshold of about 6 registers is a good compromise when choosing between separate pushes/pops or a pusha/popa. If the code is in a low performance path, then you may want to employ pusha/popa heavily to cut the size of the code. As always, let a profiler be the guide for where to apply transformations like this. When porting older 8086 code to take advantage of the newer CPU's features is where you'll likely see lots of opportunity for using pusha/popa.
Suppose we had a function that wanted to give a return value in one of the general purpose registers? Does this preclude using PUSHA/POPA as a short way to save and restore the other registers? Not necessarily!
There's no reason why we can't insert the value to be returned onto the stack so it will be popped off with the POPA. Let's take EAX as an example in some 32 bit code. We could do something like this:
(4 bytes) mov [esp+28],eax popad ret
Even though the MOV costs 4 bytes, there may still be a space payoff if more than 3 registers would have needed to be pushed and popped otherwise.
The order that PUSHA, and POPA process registers can sometimes be useful to know. Suppose a function was to return a value in the DI or EDI register? It turns out that ED/EDI are the last register that gets pushed, and the first to get popped. To return a value in DI or EDI we could use sequences like this:
USE16 code USE32 code (1 byte) pop ax pop eax ; Remove saved DI/EDI (1 byte) push di push edi ; Replace it with return value popa popad ret ret
In this case, returning a value in DI or EDI cost only two extra bytes. This would pay off if 3 or more registers would have needed to be saved and restored otherwise.
The 80186 and subsequent processors have the ability to use immediate values for shifts, rotates, and signed multiplies. The NEC V20/V30 can do this too. When porting code written for the 8086 to the newer processors, don't forget these new features. They can save a lot of space and make the code run faster.
One common sequence seen in older 8086 code is something like this one:
(2 bytes) shl cx,1 (2 bytes) shl cx,1 (2 bytes) shl cx,1
The 8086 couldn't do a CL based shift when the value to be shifted or rotated happened to be in the CX or CL register. A sequence like the one above should always be converted to use an immediate shift or rotate like this for the newer CPU's:
(3 bytes) shl cx,3
The only time you should use a CL based shift on the newer CPU's is when the shift count is unknown at runtime. One added benefit, besides being faster, is that the immediate shifts and rotates will free up the CL register for other uses. Having an extra register available for use often allows further speed and space optimizations. Even of you just moved a zero into CL and used it for zeroing byte variables in memory it's going to result in instructions that are a byte smaller than if you had to do a move of an immediate zero to a variable.
One thing programmers miss all the time is realizing that doing some operation may imply that a useful result was generated as a side effect. These serendipitous side effects can often result in a significant size and speed savings if we're alert enough to notice when they exist.
Consider this little code fragment:
(2 bytes) or ax,ax (2 bytes) jnz SomeWhere (2 bytes) xor dx,dx
By realizing that AX must have been zero for a fall through to happen on the conditional jump, we can apply the CWD trick to zero DX with a one byte instruction like this:
(2 bytes) or ax,ax (2 bytes) jnz SomeWhere (1 byte) cwd ; DX=0
Another one I see with annoying regularity in code is this little sequence:
(2 bytes) xor ax,ax ; Set return code=0 (1 byte) clc ; CF=0 for success (1 byte) ret ; Return
By realizing that one of the side effects of XOR'ing a register with itself is that the carry flag gets cleared, we can safely eliminate the CLC and turn that 4 byte sequence into a 3 byte sequence. For an instruction that does nothing but flip a bit in the flags, CLC is pretty expensive. It's a two clock instruction on all the CPU's.
One opportunity that occurs all the time is with repeated string instructions. You'll encounter sequences like this all the time:
(3 bytes) mov cx,5 (2 bytes) rep movsb (5 bytes) mov [ByteMemVar],0
By noticing that CX was zeroed after the REP MOV, that sequence can always be transformed into this:
(3 bytes) mov cx,5 (2 bytes) rep movsb (4 bytes) mov [ByteMemVar],cl
In any piece of code there's going to be hundreds, or even thousands, of side effect results produced as the code runs. The more you recognize, the tighter and faster the code is going to be.
Another possibility is that we might inject beneficial side effects into some functions we call. Suppose upon examining a program we noticed that a variable happens to be set to zero soon after calling a particular function most of the time? Maybe it would be useful to have that function, as a side effect, zero one of the registers on return rather than leaving it garbage. Then we could be assured that a zero was in that register every time we came back from that function - a zero that could be used to set the memory variable with.
We'll pay a space penalty to zero the register in one place, but enjoy the benefits of knowing the register is zero on return from every place that function is called.
Frequently we'll see code sequences like this:
call foo ret
If foo() is a near function and the RET is a near RET, or if foo() is a far function and the RET is a far RET, then it's usually a good bet that the sequence can be replaced by a JMP to foo() rather than the CALL. After all, foo() is going to execute a RET to get back to the caller, so foo()'s RET can do double duty and return for the caller as well.
WARNING: One instance where you won't want to do this, is when foo() is looking back onto the stack frame of it's caller to grope something off of it at a particular offset. Another is when there's a possibility that foo() may not return, and the place it jumps to expects a particular stack layout.
The benefits of a transformation like this are several. The JMP will be a byte smaller than the CALL/RET sequence. The code will also run faster because fewer return addresses are being sloshed on and off the stack. The pipeline and prefetchers are happier because they only get blown away by once by foo()'s RET rather than twice. On the 486's and Pentiums with their on chip caches, the JMP will also result in possibly one less cache line load if the cache line containing the CALL/RET had been purged by the time the foo() function returns. To force a cache line load just because we needed a single RET instruction is pretty tragic for performance when it happens. Remember back to the earlier chapters we saw just how sensitive the 486's and Pentiums are regarding their on chip caches? This merging of CALL/RET into a JMP is one of the few tricks that truly is a freebie when it can be applied. The code is always smaller, and it always runs faster. What more could you ask for?
A related topic to translating CALL/RET to JMP is the far call translation to a near call. Suppose we've got some far function like these:
func1 proc far ; Blah, blah, blah ret func1 endp func2 proc far ; Blah, blah, blah call func1 ; Blah, blah, blah ret func2 endp
Some assemblers will generate a 5 byte far CALL instruction where func2() calls func1(). We can do better here though since we know func1() and func2() are in the same code segment. We can force a near call by doing this:
func1 proc far ; Blah, blah, blah ret func1 endp func2 proc far ; Blah, blah, blah push cs call near ptr func1 ; Blah, blah, blah ret func2 endp
The end result of this transformation is the same as if a far CALL had been done, but it's going to be a byte smaller.
By the way, some newer assemblers will recognize that the far call can in fact be transformed by the push of CS followed by a near call. Check the code your assembler is generating to see of this is something you don't need to worry about.
There's significant performance benefits to doing a transformation like this as well. The PUSH CS/near CALL sequence almost always executes faster - on some CPU's it executes a LOT FASTER.
A far call in protected mode on a 286 takes 26 clock cycles in protected mode, but the 286 can execute the push followed by a near call in 10 clock cycles. A far call in real mode on a 286 takes 13 clock cycles, but the 286 can execute the push followed by a near call in the same 10 clock cycles in real mode.
A far call in protected mode on a 386 takes just over 34 clock cycles in protected mode, but the 386 can execute the push followed by a near call in just over 9 clock cycles. A far call in real mode on a 386 takes just over 17 clock cycles, but the 386 can execute the push followed by a near call in the same 9+ clock cycles in real mode.
A far call in protected mode on a 486 takes 20 clock cycles in protected mode, but the 486 can execute the push followed by a near call in 6 clock cycles. A far call in real mode on a 486 takes 18 clock cycles, but the 486 can execute the push followed by a near call in the same 6 clock cycles in real mode.
So, it should be obvious that transforming a far call whenever possible into a push CS followed by a near call is a solid winner.
Many linkers have an option to do a far call translation for you. They'll detect a far call instruction followed by a double word relocation address where the segment part of the relocation is the current segment. Recent versions of Borland's TLINK will do this by default and there's a switch to turn it off. Microsoft's LINK has the /FARCALLTRANSLATION and /NOFARCALLTRANSLATION options.
Having the linker do the translation has one downside though. Since the push CS followed by a near call is a byte shorter than a far call, the linker will typically pad the code out with a NOP to make everything line up.
This last topic is a real sleazy one and should probably be reserved for the most dire of space saving conditions because the performance penalty is really gruesome!. The payoff is only one or two bytes saved for each instance it's used too. This is the kind of thing you might try when every other trick imaginable has been tried and you've still come up a little short trying to cram some code into a small ROM chip for an embedded 80186 system.
All the Intel and compatible chips from the 80186 onward implement an illegal opcode exception on Trap #6. All the chips, including the 8086, implement the Int 3 breakpoint exception as well.
An Int 3 is a special form of software interrupt in that the instruction is only one byte, whereas all the other software interrupt instructions are two bytes. Having a single byte instruction that's capable of executing an unconditional control transfer opens up some interesting possibilities.
An Int 3 could be used as a special purpose instruction to CALL a particular function that's CALL'ed from many different places in the code. A near call is 3 bytes, so using the Int 3 would save two bytes per invocation. If 100 instances of CALL's to a particular function existed in a program, making those calls via an Int 3 could save 200 bytes minus the overhead of setting up the Int 3 vector initially. The called routine could return via an IRET if the returned flags value didn't matter, or via a far RET 2 instruction if the flags were important on return. Grabbing the Int 3 vector and aiming it at this special routine during initialization isn't going to cost anywhere near 200 bytes, so there'll be a measurable space savings if there were 100 instances of calls in a program. If the called function happened to be a far function previously, then using an Int 3 to call it can save 4 bytes per invocation.
Here's an example of a Turbo/Borland C program that uses an Int 3 as a one byte CALL. It prints out a value passed to the called function in the AX register.
#pragma inline /*--------------------------------------------------- CALL1.C - A demonstration of how to use an Int 3 for a one byte "call". ---------------------------------------------------*/ #include <stdio.h> #include <dos.h> void interrupt Function(void) { printf("Function() called with AX=%04X\n", _AX); } int main(void) { int i; void interrupt (*OldInt3)(void); /* Save old Int 3, set new Int 3 handler */ OldInt3 = getvect(3); setvect(3, Function); /* Loop 10 times displaying loop variable in called function */ for (i=0; i<10; i++) { /* Set AX register to the value in 'i' */ _AX = i; /* This next Int 3 is a one byte instruction */ asm int 3 } setvect(3, OldInt3); return 0; }
When run, this program displays the following output:
Function() called with AX=0000 Function() called with AX=0001 Function() called with AX=0002 Function() called with AX=0003 Function() called with AX=0004 Function() called with AX=0005 Function() called with AX=0006 Function() called with AX=0007 Function() called with AX=0008 Function() called with AX=0009
The code for doing calls with illegal opcode will look very similar to that in the CALL1.C example. The only difference is you'll be hooking Int 6 instead of 3 and using some illegal opcode rather than the Int 3. In the interrupt function itself one alteration will have to be made though. Before returning to the caller, the pushed return IP on the stack will have to be bumped up by however many bytes there are in the illegal instruction you've chosen.
Software interrupts, like Int 3, will push the address of the following instruction on the stack as a return address. An illegal opcode will push the address of the instruction that contains the illegal opcode.
If you're working on a ROM based system, or something where you could be guaranteed that there wouldn't be any programs running that did instructions expecting a numeric processor, then the 286 and above CPU's also offer the opportunity to do a similar trick using Trap 7. Trap 7 is the NPX (numeric processor extension) not present trap. By turning on the EM bit in the MSW you can enable Trap 7's to occur on any instruction that would normally have been sent to a coprocessor if one would have been present. The FWAIT instruction is also one byte like Int 3 was.
WARNING: The illegal opcode trick won't work in a DOS box under Windows 3.X in "enhanced" mode or Windows 95. Those systems don't reflect the illegal opcode exception back into the DOS box correctly. They'll insist that the app is misbehaving and say it should be terminated.
WARNING: The Trap 7 trick won't work in a DOS box under Windows 3.X in "enhanced" mode or Windows 95. Those systems don't reflect the Trap 7 exception back into the DOS box correctly. They'll insist that the app is misbehaving and say it should be terminated.
Given the two warnings about the illegal opcode and Trap 7 trick, I can't recommend using them in any general commercial application. These ploys are strictly for things like ROM based toaster controllers and alarm monitor boxes where you have total control over the software content of the product and can be certain those caveats won't be violated.
The Int 3 trick works fine in any environment. The only caveat when using it is that it's going to cause debuggers to go berserk if you've got any problems in the program. Reserve tricks like these for last ditch space saving efforts in fairly well debugged code.
Work slow and with a lot of care when using dastardly tricks like this.
By the way, the Int 3 trick is sometimes used in programs as one of the many and varied anti-hacking techniques people employ to try and defeat someone attempting to take their program apart with a debugger.
If you're working with some code that demands the floating point hardware, or have an application that runs in an environment where the system supplies a floating point emulator, then there's some interesting possibilities for serious space savings by recognizing the capabilities of the chip. Even the lowly old 8088 can directly do 32 or 64 bit integer arighmetic when paired with an 8087. If you've got a program that demands an 87' chip for other things, and/or is dragging around a floating point emulator already, why not use that capability to its fullest?
Opportunities abound when an 87' is present. 32 and 64 bit integer arithmetic are much simplified and smaller. Now you've got the ability to zero out up to 10 bytes of memory with a two instruction sequence like this one:
(2 bytes) fldz (4 bytes) fstp tbyte ptr MemVar
If those 10 bytes were in a variable "pool" as described earlier, and were next to each other in memory, another byte can be shaved off. Say BX was aimed at the pool and the 10 bytes were at offset 18h in the pool. Then this would work:
(2 bytes) fldz (3 bytes) fstp tbyte ptr [bx+18h]
5 bytes! We zeroed out 10 bytes worth of memory with only 5 bytes of code. One really cool side effect here is that we didn't have to burn a single general register accomplishing it! Just setting up to do a REP STOSW would take more than 5 bytes - and it's going to burn CX, DI, AX, and possibly ES if it isn't aimed at the right segment at the time.
A quad word of memory can be zeroed or set to one with similar instruction sequences:
fldz fstp qword ptr MemVar
fld1 fistp qword ptr MemVar
The setting of a quad word to one can be handy when there's two "long" (i.e. dword) variables next to each other where one is to be zeroed and the other set to one. If the variables are ordered in a convenient way, then this is a viable trick.