Chapter 11 - More tricks in C and Assembler code


Tail Merging Intro
The speed cost of JMP's on different CPU's
A typical tail merge opportunity
Rippling up the merges
Add some code to save some code
Reordering PUSH/POP's to create opportunity
Middle merging
Sleazy tricks with the carry flag
Front merging
Putting JCXZ and variable entry points to use
Setting one segment register to another - The speed/space tradeoff
One instruction subroutines
Let the LST file guide you!
Destructive compares with INC/DEC
Splitting nibbles in the AL register
Joining nibbles
Jump chaining
The missed ASSUME
Testing values in AH with LAHF
INC/DEC transformations
String operations can sometimes do more than just "strings"
LOOP isn't always for 'loops'
XCHG as a destructive MOV
LEA isn't just for addresses
Using JCXZ when you can't hit the flags
Tracking a variable in a register
Do we always need a stack frame to access parameters?
PUSHA/POPA on 186 or better
Immediate values on 186 or better CPU's
Recognizing convenient prior results
Merging CALL/RET into JMP
Far call translations
Custom instructions via illegal opcode and Int 3's
That NPX Thang!
Final thoughts for this chapter


Tail merging

Return to chapter index Tail merging is a powerful technique that can reduce the size of code quite a bit in some cases. The idea is to find common sequences of instructions at the end of several different functions and replace the duplicated code with a JMP so it only exists in one place.

Tail merging might be a performance hit on older CPU's as jumps are expensive instructions on pre-486 CPU's. The Pentium's branch prediction hardware allows it to do a branch in one cycle if the target is in the I-cache, and a 486 only takes 3 cycles if the target is in its onboard cache. As always, let a profiler guide your actions here. If the code isn't in a hot spot, then the merging is a good thing. It may even allow some code that is speed critical to remain parked in a CPU's internal cache by bringing things a little closer together. As we saw in earlier chapters, keeping heavily used sections of code and data in the 486 and Pentium's onboard cache can provide major speed wins. If things get sufficiently tighter, we may manage to pull some code back enough that a page of working set gets saved in a virtual memory system.

Use a little common sense here too. Tail merging exits from functions, where the function executes thousands of cycles before exiting, isn't going to hurt in any visible way even if the profiler says 99% of the time in the program is spent in that function. The cost of the tail merged exit is insignificant compared to the work going on in the rest of the function.

Let the purpose of the code guide your actions. Here's a real life war story:

While working on PC DOS 6.1 I decided to take a look at the PRINTER.SYS device driver to see if it could be slimmed down some. PRINTER.SYS was an important module for many customers in countries that work with code pages other than 437. PRINTER.SYS was fairly big as DOS device drivers go too. The previous DOS 5 version was in the 11.6K range when loaded. What's worse, is that the customers most likely to use PRINTER.SYS are the same customers who are probably loading TSR's and drivers like KEYB.COM, NLSFUNC.EXE, and DISPLAY.SYS. These folks are getting hammered on memory compared to most users in the USA who won't need the national language support modules. Any memory relief thrown their way is usually appreciated.

Well, after doing a bunch of tail merging and a few other tricks, the PC DOS 6.1 version of PRINTER.SYS's resident size had dropped by over 900 bytes when compared to the DOS 5 version. Now it was down to around 10.7K resident. It's goodness when 900 bytes can be stripped out of a device driver that consumes that precious DOS memory below the 1M point.- some people would kill for an extra 900 bytes of memory.

Was the slimmer PRINTER.SYS in PC DOS 6.1 (and all subsequent versions of IBM's DOS) a little slower than the DOS 5 version? Probably, but I could never tell the difference and there's been no complaints from customers so far. Printers are so slow compared to CPU's that any difference just isn't going to be noticeable. Would you rather have that 900 bytes of low DOS memory back, or have a print job complete a couple of microseconds faster? I'd have to say that for 99.999% of the users of PRINTER.SYS, the answer to that question is a no brainer.

Return to chapter index


The speed cost of JMP's on different CPU's

Return to chapter index

Let's get an idea of what jumps cost on some different CPU's so we can make rational decisions on the cost/bennefit of using merging techniques. Here's a section of code designed to work within the contest of Michael Abrash's Zen Timer.


;-------------------------------------------------------------
; A test program that can be run with Michael Abrash's Zen
; Timer to test what JMP instructions cost on different CPU's.
;-------------------------------------------------------------
        jmp     TestJMPs
;-------------------------------------------------------------
; IcacheFlush should cause the 8K internal cache on a 486 and
; the 8K I-cache on a Pentium to be purged of relevant data.
; On a 386, 286 or 8088 this function won't do anything.
;-------------------------------------------------------------
        align   16
IcacheFlush proc near
        db      8192 dup (90h)  ; 8K worth of NOP's
        ret
IcacheFlush endp

REPCOUNT = 800

DOJMP   macro
        local   target
        jmp     short target
        db      38 dup (0)
target:
        endm

TestJMPs label near
;------------------------------------------------------------
; Do jumps in a way that should be friendly to a 486's and
; Pentium's internal caches
;------------------------------------------------------------
        align   16
        call    IcacheFlush     ; Purge internal caches of anything useful
        call    ZtimerOn
        align   16
        rept    REPCOUNT
          jmp   $+2  ; The majority of these JMP's should run
                     ; out of the 486 or Pentium cache due to
                     ; the cache line loading.
        endm
        call    ZtimerOff
        call    ZTimerReport
;--------------------------------------------------------------
; Do jumps in a way that should be hostile to a 486 and
; Pentium's internal caches.
;--------------------------------------------------------------
        align   16
        call    IcacheFlush     ; Purge internal caches of anything useful
        call    ZtimerOn
        align   16
        rept    REPCOUNT
          DOJMP      ; These JMP's should cause a cache line
                     ; load on both 486's and Pentium most times.
        endm
        call    ZtimerOff

The program does 800 jump instructions two different ways. The first batch are right next to each other in memory. The second batch is scattered around. I ran the resultant test file on several test machines and got some interesting results. All timings are in microseconds from the Zen Timer.


                   First way  Second way
  4.77mhz 8088          3027        3028
  Uncached 386SX-25      353         387
  Uncached 386DX-20      416         421
  Uncached 486SX-25      165         548
  60mhz Pentium           81         354

Isn't it amazing that the 60mhz Pentium slowed down to the same speed as a 386SX-25 when the jumps were causing misses on its I-cache! Equally amazing is that the 486SX-25 actually ran markedly slower than the clunky old 386SX when it was missing the jumps in its cache too!

LESSON: These newer CPU's are really, really, really dependent on getting hits in their on-chip caches. When you're in the cache you're humming. When you're not, you can be slogging along slower than some of the older generation CPU's without on-chip caches.

Return to chapter index


A "typical" tail merge opportunity

Return to chapter index

Suppose we've got two functions f1 and f2 that look like this:


f1      proc    near
        push    ds
        push    es
        push    di
        call    foo
        pop     di
        pop     es
        pop     ds
        ret
f1      endp

f2      proc    near
        push    ds
        push    es
        push    di
        call    bar
        pop     di
        pop     es
        pop     ds
        ret
f2      endp

There's a common 4 byte exit sequence that happens in each of these functions:


        pop     di
        pop     es
        pop     ds
        ret

A standard tail merge transformation on these two functions would look like this:


f1      proc near
        push    ds
        push    es
        push    di
        call    foo
MergePoint1:
        pop     di
        pop     es
        pop     ds
        ret
f1      endp

f2      proc near
        push    ds
        push    es
        push    di
        call    bar
        jmp     MergePoint
f2      endp

Return to chapter index


Rippling up the merges

Return to chapter index

There's nothing that says you can merge other merges as well. Suppose there's a third function, f3, involved here?


f1      proc near
        push    ds
        push    es
        push    di
        call    foo
MergePoint1:
        pop     di
        pop     es
        pop     ds
        ret
f1      endp

f2      proc near
        push    ds
        push    es
        push    di
        call    bar
        jmp     MergePoint
f2      endp

f3      proc near
        push    ds
        push    es
        push    di
        call    frob
        call    bar
        jmp     MergePoint
f3      endp

We did the obvious tail merge on the pops and ret exit sequence, but there's more to be had. Notice that both f1 and f2 have this sequence in common now:


        call    bar
        jmp     MergePoint

This let's us merge again with a second merge like this:


f1 proc near
        push    ds
        push    es
        push    di
        call    foo
MergePoint1:
        pop     di
        pop     es
        pop     ds
        ret
f1 endp

f2 proc near
        push    ds
        push    es
        push    di
MergePoint2:
        call    bar
        jmp     MergePoint1
f2 endp

f3 proc near
        push    ds
        push    es
        push    di
        call    frob
        jmp     MergePoint2
f3 endp

A straight forward unmerged version of the f1,f2, f3 functions would take up 33 bytes (18 one byte PUSH/POP's, 3 one byte RET's, and four 3 byte CALL's). The fully merged version of the f1,f2,f3 functions only takes up 26 bytes. Not bad eh? The size of this code was chopped by around 20% and it does the exact same thing it always did before the merging.

Return to chapter index


Add some code to save some code

Return to chapter index

Often we'll see a situation similar to these two little functions:


func1 proc near
        push    ax
        push    bx
        push    dx
        push    ds
        ; Some action happens here
        pop     ds
        pop     dx
        pop     bx
        pop     ax
        ret
func1 endp

func2 proc near
        push    si
        push    ax
        push    bx
        push    dx
        push    ds
        ; Some action happens here
        pop     ds
        pop     dx
        pop     bx
        pop     ax
        pop     si
        ret
func2 endp

In this case, we can't just do an obvious tail merge with a jump. However, what if we were to change "func1" so that it pushes and pops the SI register just like "func2" does? Now we've created a situation where tail merging these two functions may be possible if saving/restoring SI is harmless for f2. The transformed code would look like this:


func1 proc near
        push    si
        push    ax
        push    bx
        push    dx
        push    ds
        ; Some action happens here
MergePoint1:
        pop     ds
        pop     dx
        pop     bx
        pop     ax
        pop     si
        ret
func1 endp

func2 proc near
        push    si
        push    ax
        push    bx
        push    dx
        push    ds
        ; Some action happens here
        jmp     MergePoint1
func2 endp

By adding 2 bytes in "func1", we were able to save 5 bytes if MergePoint1 was more than 127 bytes away, or 6 bytes if it was in range of a short JMP. The net savings would be 3 or 4 bytes.

Return to chapter index


Reordering PUSH/POP's to create opportunity

Return to chapter index

It's pretty common to see code where some functions are pushing and popping the same set of registers, just in a different order - like this:

func1 proc near
        push    dx
        push    si
        push    bx
        ; Something
        pop     bx
        pop     si
        pop     dx
        ret
func1 endp

func2 proc near
        push    si
        push    dx
        push    bx
        ; Something
        pop     bx
        pop     dx
        pop     si
        ret
func2 endp

By changing the order of the pushes and pops in "func1" so they match the order of those in "func2" we'll create a situation where we can do a tailmerge:

func1 proc near
        push    si
        push    dx
        push    bx
        ; Something
MergePoint1:
        pop     bx
        pop     dx
        pop     si
        ret
func1 endp

func2 proc near
        push    si
        push    dx
        push    bx
        ; Something
        jmp     MergePoint1
func2 endp

Reordering the pushes and pops in a function is usually an OK thing to do, and doesn't effect the code size. In this case it enabled a net win of 2 bytes. The only case where this might not be a desirable thing to do is if the function you'd like to reorder the pushes and pops for is one that's accessing, or calling code that looks back on the stack frame and expects certain register values to be saved at specific locations in the stack frame.

Return to chapter index


Middle merging

Return to chapter index

Sometimes there's going to be two functions with a blob of repeated code in the middle. The obvious approach, if the entry and exit points from the repeated blob are one way in and one way out, is to just make the repeated code a function on it's own. Suppose the code does things like jump to an error routine though? Extracting the code into a function will cause the stack layout to be altered and this might cause problems if the code at the multiple exit points isn't prepared to deal with extra return addresses on the stack. There's also situations where the common code may call a routine that depends on parameters on the stack and is frame dependent. Such a situation might look something like this:

f1      proc    near
        call    foo
        mov     bar,1
        call    blat            ; Common middle code starts here
        mov     si,1234
        mov     di,5678
        push    es
        pop     ds
        cld
        mov     cx,20
        rep     movsw
        call    barf  ;barf() looks for a parameter on the stack and constructs a frame to get at it.
        or      ax,ax
        jz      BarfOK1
        jmp     BarfBad
BarfOK1:                        ; Common middle code ends here
        call    Foobie
        call    Bletch
        ret
f1      endp

f2      proc    near
        call    frobotz
        mov     retch, 256
        call    blat            ; Common middle code starts here
        mov     si,1234
        mov     di,5678
        push    es
        pop     ds
        cld
        mov     cx,20
        rep     movsw
        call    barf   ; barf() looks for a parameter on the stack and constructs a frame to get at it.
        or      ax,ax
        jz      BarfOK2
        jmp     BarfBad
BarfOK2:                        ; Common middle code ends here
        call    Foobie2
        call    Bletch2
        ret
f2      endp

In this example, we've got both of those conditions mentioned above. There's the JMP outside the common code to "BarfBad", and the Barf() function creates a stack frame to get at a parameter. In either situation like this we'd possibly be altering the semantics of the code by extracting the common stuff into a function of it's own. This is a difficult situation for merging all of the common code. One straight forward and safe merge opportunity would be to just make this instruction sequence a function of it's own:

        mov     si,1234
        mov     di,5678
        push    es
        pop     ds
        cld
        mov     cx,20
        rep     movsw

That group of instructions is 14 bytes and doesn't depend on a stack frame. It's also straight line code that doesn't jump anywhere. Making it a function would cost the 14 bytes, plus a byte for a RET, plus 6 bytes for two CALL's to call it for a total of 21 bytes. This gives a 7 byte savings over having the group duplicated in two places. This version of the code would look like this:

CommonStuff proc near
        mov     si,1234
        mov     di,5678
        push    es
        pop     ds
        cld
        mov     cx,20
        rep     movsw
        ret
CommonStuff endp

f1      proc    near
        call    foo
        mov     bar,1
        call    blat            ; Common middle code starts here
        call    CommonStuff
        call    barf   ; barf() looks for parameter on stack and constructs a frame to get at it.
        or      ax,ax
        jz      BarfOK1
        jmp     BarfBad
BarfOK1:                        ; Common middle code ends here
        call    Foobie
        call    Bletch
        ret
f1      endp

f2      proc    near
        call    frobotz
        mov     retch, 256
        call    blat            ; Common middle code starts here
        call    CommonStuff
        call    barf      ; barf() looks for parameter on stack and constructs a frame to get at it.
        or      ax,ax
        jz      BarfOK2
        jmp     BarfBad
BarfOK2:                        ; Common middle code ends here
        call    Foobie2
        call    Bletch2
        ret
f2      endp

Return to chapter index


Tricks with the carry flag

Return to chapter index

This prior use of the CommonStuff function wasn't too bad, and it would be nice if the CommonStuff() function can be applied somewhere else in the code too. However, suppose the common sequence occurred only in two places and we know something special about the common code? Special you say? Yes, special. In this particular example, one thing about the code in the CommonStuff function is that it doesn't alter the flags register. More specifically, it doesn't alter the carry flag.

The carry flag is the only flag on the 80x86 family of CPU's that we can set and clear directly with a one byte instruction, that we can also do a conditional jump on. The direction flag can be set and cleared, but there's no way to jump based on DF's value.

Knowing that the common sequence doesn't alter the carry flag enables us to do a sleazy trick like this one:

f1      proc    near
        call    foo
        mov     bar,1
        call    blat
        stc                     ; CF=1 means entry was from f1(), CF=0 entry was from f2()
SleazeInFromF2:
        mov     si,1234 ; Common middle code starts here
        mov     di,5678 ;
        push    es              ;
        pop     ds              ;
        cld                     ;
        mov     cx,20           ;
        rep     movsw           ; Common middle code ends here
        jnc   short SleazeBackToF2
        call    barf   ; barf() looks for parameter on stack and constructs a frame to get at it.
        or      ax,ax
        jz      BarfOK1
        jmp     BarfBad
BarfOK1:
        call    Foobie
        call    Bletch
        ret
f1      endp

f2      proc    near
        call    frobotz
        mov     retch, 256
        call    blat
        clc
        jmp     short SleazeInFromF2    ; JMP to the merged code
SleazeBackToF2:                         ; Now we're back
        call    barf      ; barf() looks for parameter on stack and constructs a frame to get at it.
        or      ax,ax
        jz      BarfOK2
        jmp     BarfBad
BarfOK2:                    ; Common middle code ends here
        call    Foobie2
        call    Bletch2
        ret
f2      endp

What did this buy us here? Well, we traded two 3 byte CALL's and a one byte RET for two 2 byte JMP's plus a one byte CLC and a one byte STC. This version that used the carry flag is a byte shorter than the version that had the common code as a separate function.

If the common code sequence happened to be in a 32 bit code in a USE32 segment, the savings are going to be even greater using this carry flag trick. "near" CALL instructions in USE32 type segments are 5 byte instructions as opposed to near CALL's in USE16 segments which are 3 bytes. Doing the separate CommonStuff() function will have 11 bytes worth of overhead to implement in 32 bit code versus the 7 bytes of overhead we had in the USE16 version. In USE32 code, if you can use the carry flag trick, it's going to be 5 bytes shorter than a separate function.

TIP: If a common code sequence uses the INC or DEC instructions, these do not alter the carry flag

Suppose we've got a common blob of code in a USE32 code segment and it does alter the flags (specifically the carry flag)? Does this force us into having to go with the fatter separate function form of middle merging that's 5 bytes bigger than the carry flag trick? Not necessarily. If the entry and exit points are well defined and the common code doesn't depend on the state of the flags but only changes them incidentally, we might do this instead:

        stc
SleazeInFromWherever:
        ;Begin common code
        ...blah, blah
        pushfd
        ; The instruction(s) that alter the flags live here
        popfd
        ...blah, blah
        jnc     short SleazeBackToWherever
        ;End common code

The PUSHFD/POPFD in USE32 code (and PUSHF/POPF in USE16 code) are one byte instructions. We'd still be ahead by 3 bytes if we use PUSHFD/POPFD to keep the common sequence from changing the flags in USE32 code. Note that this technique would be a net loser by one byte in a USE16 code segment where near CALL instructions are only 3 bytes.

There's another consideration about 32 bit code here too. In a USE32 code segment, a conditional JMP (like JNC) can be either 2 bytes or 6 bytes. On a 386, a conditional jump in a USE16 segment will be either 2 bytes or 4 bytes. Here's part of a listing file generated by an assembler for a small test file that demonstrates this:

1                              .386
2     00000000                 jmps32     segment use32 public 'code'
3                                         assume cs:jmps32,ds:jmps32
4     00000000  0F 82 00000200            jc     foo32
5     00000006  0200*(??)                 db     512 dup(?)
6     00000206                   foo32      label near
7     00000206  90                              nop
8     00000207  72 FD                     jc     foo32
9     00000209                 jmps32     ends
10
11        0000                   jmps16     segment use16 public 'code'
12                                        assume cs:jmps16,ds:jmps16
13        0000  0F 82 0200                jc     foo16
14        0004  0200*(??)                 db     512 dup(?)
15        0204                 foo16      label  near
16        0204  90                        nop
17        0205  72 FD                     jc         foo16
18        0207                 jmps16     ends
19
20                             end

Notice how that 512 byte blob on line #5 caused the JC on line #4 to become one of those fat 6 byte conditional jumps. When you're forced into using one of the 6 byte jumps it's a byte worse than doing one of the 5 byte CALL's!

Similarly, in USE16 code, that 4 byte conditional JC on line #13 is a byte worse than a 3 byte CALL. We'll see more about these jumps on the 386 a little later.

Return to chapter index


Front merging

Return to chapter index

One thing we see all the time in code is two procedures that have a common entry sequence of instructions and then go off to do something different at some point like these two example functions:

            f1    proc  near
(1 byte)          push  ds
(1 byte)          push  es
(1 byte)          push  si
(1 byte)          push  di
(6 byte)          mov   word ptr [SomeThing],1234
                  ;
                  ; The code gets different here
                  ;
            ExitMerge:
(1 byte)          pop   di
(1 byte)          pop   si
(1 byte)          pop   es
(1 byte)          pop   ds
(1 byte)          ret
            f1    endp

            f2    proc  near
(1 byte)          push  ds
(1 byte)          push  es
(1 byte)          push  si
(1 byte)          push  di
(6 byte)          mov   word ptr [SomeThing],1234
                  ;
                  ; The code gets different here
                  ;
(2 byte)          jmp   ExitMerge
            f2    endp
(27 bytes total)

We did the obvious tail merge on the exit sequences, but is there some way we might smash the entry sequences together too? Again, the carry flag can come to the rescue in a situation like this one. Unless these functions depend on the state of the carry flag on entry, which is a pretty rare condition, we might do something like this:

            f1    proc  near
(1 byte)          clc
            EntryMerge:
(1 byte)          push  ds      ; Note that the flags are
(1 byte)          push  es      ; not hit by any of these instructions
(1 byte)          push  si      ; in the common prolog sequence
(1 byte)          push  di
(6 byte)          mov   word ptr [SomeThing],1234
(2 byte)          jc    EntryReturn
                  ;
                  ; The code gets different here
                  ;
            ExitMerge:
(1 byte)          pop   di
(1 byte)          pop   si
(1 byte)          pop   es
(1 byte)          pop   ds
(1 byte)          ret
            f1    endp

            f2    proc  near
(1 byte)          stc
(2 byte)          jmp   short EntryMerge
            EntryReturn:
                  ;
                  ; The code gets different here
                  ;
(2 byte)          jmp   ExitMerge
            f2    endp
(23 bytes total)

Return to chapter index


Putting JCXZ and variable entry points to use

Return to chapter index

Was 23 bytes for the last example the best we can do? If we can place one added constraint on the functions we can do a little better. What if it were OK for these functions to clobber the CX register on entry? This is a fairly easily met condition in most cases for functions that will be called by a C/C++ compiler.

Consider this dastardly transformation:

            f1    proc  near
(1 byte)          db    0B9h    ; The opcode byte for MOV CX,immediate
            f2    label near
(2 byte)          xor   cx,cx   ; 33h C9h
(1 byte)          push  ds
(1 byte)          push  es
(1 byte)          push  si
(1 byte)          push  di
(6 byte)          mov   word ptr [SomeThing],1234
(2 byte)          jcxz  f2MergeEntry
                  ;
                  ; The code gets different here
                  ;
            ExitMerge:
(1 byte)          pop   di
(1 byte)          pop   si
(1 byte)          pop   es
(1 byte)          pop   ds
(1 byte)          ret
            f1    endp

            f2MergeEntry label near
                  ;
                  ; The code gets different here
                  ;
(2 byte)          jmp   ExitMerge
            f2    endp
(22 bytes total)

That one is a little tricky and needs some explanation. Here we've got the entry points for f1() and f2() being located within a byte of each other. The reason this works is because the CPU just processes streams of bytes that get interpreted as instructions. Entering at f1 causes a MOV CX,0C933h to be executed, while entering at f1 causes an XOR CX,CX to be executed. Combining that behavior with a JCXZ instruction makes for a pretty sneaky combination.

If you've got two functions with more than 5 bytes of prolog in common and CX is expendable during the common entry sequence, then you can apply this trick. Alas, in USE32 segments, this trick is less valuable because a "MOV ECX,immediate" is a 5 byte instruction. For code in a USE32 segment, we'd have to add two pad bytes after the XOR ECX,ECX we'd be using in 32 bit code. An OR ECX,ECX instruction would do the trick in a case like that.

Return to chapter index


Setting one segment register to another - The speed/space tradeoff

Return to chapter index

Setting one segment register to the value in another one is a common occurrence in USE16 code. It happens a lot during a setup to do some string operations, and when the value in one is unknown at some point in the code.

We see 4 byte code sequences like this all the time:

        mov     ax,ds
        mov     es,ax

PUSH/POP

If the value in AX wasn't going to be used later, then the straight forward transformation to compact that sequence to 2 bytes would be:

        push    ds
        pop     es

Yes, it works for SS too!

For some reason many people seem hesitant to use the PUSH/POP sequence when SS is the target in a sequence like this one:

        mov     ax,ds
        mov     ss,ax
        mov     sp,NewSPval

The PUSH/POP thing does work here though, and there's no reason not to use it when space is the issue.

Using a PUSH/POP to replace a MOV/MOV chops the size of setting one segment register to another from 4 bytes to 2. What's the speed hit though? I used this little Zen Timer program to measure the difference on several different test systems:

;--------------------------------------------------------
; A Zen Timer test to see how well PUSH/POP performs
; compared to MOV/MOV for setting segment registers.
;--------------------------------------------------------
        call    ZtimerOn
        rept    5000
          push    ds     ;; 5000 PUSH/POP's
          pop     es
        endm
        call    ZtimerOff
        call    ZTimerReport

        db      32 dup (90h)
        call    ZtimerOn
        rept    5000
          mov     ax,ds  ;; 5000 MOV/MOV's
          mov     es,ax
        endm
        call    ZtimerOff

We would rightly expect the PUSH/POP sequence to run slower than the MOV/MOV sequence because it's sloshing stuff on and off the stack. How much slower it was on the different CPU types was interesting. All of these timings are in microseconds by the way.

Test CPU       PUSH/POP     MOV/MOV   Ratio
4.77mhz 8088      31057       18102     .58
386SX-25           2029         806     .39
386DX-20           2449        1100     .45
486SX-25           1310        1200     .92
486DX2-66          2235         450     .20
60mhz Pentium       330         249     .75

The effects of the 486DX2-66's clock doubler were really apparent on this test. Keeping things in registers really makes the clock doubling technology shine. Equally notable was the performance of the 486SX-25. It wasn't hurt too badly at all by the PUSH/POP sequence.

So when would we want to trade a 4 byte MOV/MOV for a 2 byte PUSH/POP? As always, a profiler and some common sense should guide our actions. Let's say the segment registers are being set at the beginning of a repeated string operation that's going to be moving several hundred or several thousand bytes around. In a case like this, the added cost of the PUSH/POP is going to be negligible compared to the cost of the string move, so you might as well save the space with a PUSH/POP. One of the advantages of the PUSH/POP sequence is that you don't have to burn a register to do it either. Depending on how the rest of the code is organized, this might be quite handy.

LES/LDS are powerful instructions

We often see code that looks like this sequence in programs:

          Thing   dd  ?
                      ...
(3 bytes)         mov ax, word ptr [Thing]
(4 bytes)         mov dx, word ptr [Thing+2]
(7 bytes total)

If we're in real mode (or V86 mode on a 386), then we don't need to worry about what the contents of a segment register is when we load it (like we would have to in protected mode where a GP-fault might be generated).

In a case like this, if we can waste the ES register, the code could be transformed to something like this:

(4 bytes)   les   ax, dword ptr [Thing]
(2 bytes)   mov   dx, es
(6 bytes total)

Often we'll be seeing sequences like this in code where a bunch of variables in memory are being set:

(3 bytes)  mov  ax, word ptr [Stuff]
(3 bytes)  mov  word ptr [StuffSave], ax
(3 bytes)  mov  ax, word ptr [Stuff+2]
(3 bytes)  mov  word ptr [StuffSave+2], ax
(12 bytes total)

Again, if we can waste the ES register, that can be transformed into this:

(4 bytes)  les   ax, dword ptr [Stuff]
(3 bytes)  mov   word ptr [StuffSave], ax
(4 bytes)  mov   word ptr [StuffSave+2], es
(11 bytes total)

Now, suppose the sequence is setting more than a couple of variables, but we can't waste the ES register.

            ;
            ; We can't alter ES during this sequence
            ;
(3 bytes)   mov   ax, word ptr [OtherStuff]
(3 bytes)   mov   word ptr [OtherStuffSave], ax
(3 bytes)   mov   ax, word ptr [OtherStuff+2]
(3 bytes)   mov   word ptr [OtherStuffSave+2], ax
(3 bytes)   mov   ax, word ptr [OtherStuff2]
(3 bytes)   mov   word ptr [OtherStuffSave2], ax
(3 bytes)   mov   ax, word ptr [OtherStuff2+2]
(3 bytes)   mov   word ptr [OtherStuffSave2+2], ax
(3 bytes)   mov   ax, word ptr [Stuff]
(3 bytes)   mov   word ptr [StuffSave], ax
(3 bytes)   mov   ax, word ptr [Stuff+2]
(3 bytes)   mov   word ptr [StuffSave+2], ax
(36 bytes total)

Now the sequence is big enough that we could think about pushing/popping ES as a way to preserve it through the sequence. This would allow us to waste it temporarily to produce this sequence:

            ;
            ; We can't alter ES during this sequence
            ;
(1 byte)    push        es
(4 bytes)   les   ax, dword ptr [OtherStuff]
(3 bytes)   mov   word ptr [OtherStuffSave], ax
(4 bytes)   mov   word ptr [OtherStuffSave+2], es
(4 bytes)   les   ax, dword ptr [OtherStuff2]
(3 bytes)   mov   word ptr [OtherStuffSave2], ax
(4 bytes)   mov   word ptr [OtherStuffSave2+2], es
(4 bytes)   les   ax, dword ptr [Stuff]
(3 bytes)   mov   word ptr [StuffSave], ax
(4 bytes)   mov   word ptr [StuffSave+2], es
(1 byte)    pop   es
(35 bytes total)

Occasionally I see code that looks like this next sequence. It's usually a result of someone who was just in a hurry when it was written, or a beginning assembler programmer who didn't have the LES and LDS instructions down yet.

SomeDwordPointer   dd    ?
                   ...
(4 bytes)          mov   di, word ptr [SomeDwordPointer]
(4 bytes)          mov   ds, word ptr [SomeDwordPointer+2]

If you ever see code written like this and you're looking to save space, then replace it with:

(4 bytes)          lds   di, dword ptr [SomeDwordPointer]

Return to chapter index


One instruction subroutines

Return to chapter index

There are occasions when space saving is a paramount concern. For instance, when ROM'ing code and everything needs to fit in to a small 32K or 64K ROM. Another might be a "pop up" TSR that interacts with the user at human typing speeds. You may even find yourself in a situation where saving a few bytes in an executable is enough to save a diskette in a product, or would pull a component in a system back under a cluster boundary in DOS's FAT file system. In situations like these, exceptional means are called for. One of the easier ways to knock bytes out of code is to look for plump instructions and turn them into a one instruction subroutines. In USE16 code, a near CALL is only 3 bytes. However, there's a lot of instructions on the 80x86 family that are bigger than 3 bytes. Moving immediate constants to memory locations is a common example. In USE32 code, the situation gets worse. Moving an immediate constant to a dword in memory is a whopping 10 byte instruction. Near calls in USE32 code are 5 bytes.

Here's an example from a TSR I wrote a few years ago that faked a HIMEM driver for XT's and AT's that only had EMS 4.0 memory and no extended memory. The driver was a good enough ersatz HIMEM that the 4.0 and later versions of the SMARTDRV disk cache could run with it. The complete source for the FAKEHI driver is included in one of the appendixes. It uses several of the techniques discussed in the book to keep it's resident size as small as possible.

;------------------------------------------------------------
; Putting this instruction in a proc saves some space.
; Immediate compares to memory are BIG instructions.
;------------------------------------------------------------
CmpAllocFlag0 proc near
        cmp     AllocFlag, 0
        ret
CmpAllocFlag0 endp

If there's enough instances of a large instruction in the code extracting that instruction into a subroutine of it's own can pay off.

Realize that this isn't something you'll want to do in a high performance section of code though. The CALL/RET are terribly slow compared to inlining the instruction.

Return to chapter index


Let the LST file guide you!

Return to chapter index

When you're looking for opportunities to do a one instruction subroutines, one of the best ways to find them is to look at an assembler listing file. Down the left hand side of the listing will be the generated code. You want to kind of scan the listing looking for instructions that generated 4 or more bytes of code. Plump things really stand out in a listing file.

This one-instruction subroutine trick isn't just for assembler code either. It works well in C/C++ code as well when you're really trying to squeeze something down. Take this function for example.

/*
  1INST.C  Example for 1 instruction subroutines in C
*/
int var1;
int var2;
int var3;

#ifdef CALL
void near ZeroVar1(void)
   {
   var1 = 0;
   }
#endif

void foo(int x, int y, int z)
   {
   var1 &= 255;
   if (var1)
#ifdef CALL
      ZeroVar1();
#else
      var1 = 0;
#endif
   else
      {
      if (x)
#ifdef CALL
         ZeroVar1();
#else
         var1 = 0;
#endif
      }
   if (y)
#ifdef CALL
      ZeroVar1();
#else
      var1 = 0;
#endif
   else
      {
      if (var2)
#ifdef CALL
         ZeroVar1();
#else
         var1 = 0;
#endif
      if (z & var2)
#ifdef CALL
         ZeroVar1();
#else
         var1 = 0;
#endif
      }
   if (var3 == 7)
      {
#ifdef CALL
      ZeroVar1();
#else
      var1 = 0;
#endif
      var2 = 1;
      return;
      }
   if (var1 || var2 || var3)
#ifdef CALL
      ZeroVar1();
#else
      var1 = 0;
#endif
   }

I compiled this function with the Microsoft version 8.0 compiler using the /Os /O1 /Fa and /Gs switches. /Gs eliminates stack probes, /Fa generates an assembler listing, and /Os and /O1 tell the compiler to go for max space optimizations.

Without the symbol "CALL" being defined, the compiler generated this assembler code:

;       Static Name Aliases
;
        TITLE   1inst.c
        .286p
        .287
INCLUDELIB      SLIBCE
INCLUDELIB      OLDNAMES.LIB
_TEXT   SEGMENT  WORD PUBLIC 'CODE'
_TEXT   ENDS
_DATA   SEGMENT  WORD PUBLIC 'DATA'
_DATA   ENDS
CONST   SEGMENT  WORD PUBLIC 'CONST'
CONST   ENDS
_BSS    SEGMENT  WORD PUBLIC 'BSS'
_BSS    ENDS
DGROUP  GROUP   CONST, _BSS, _DATA

        ASSUME DS: DGROUP, SS: DGROUP
_BSS      SEGMENT
COMM NEAR       _var1:  BYTE:    2
COMM NEAR       _var2:  BYTE:    2
COMM NEAR       _var3:  BYTE:    2
_BSS      ENDS

_TEXT      SEGMENT
        ASSUME  CS: _TEXT
        PUBLIC  _foo
_foo    PROC NEAR       ;  COMDAT
; Line 18
        push    bp
        mov     bp,sp
;       x = 4
;       y = 6
;       z = 8
; Line 20
        and     WORD PTR _var1,255      ;00ffH
        jne     $L121
; Line 28
        cmp     WORD PTR [bp+4],0       ;x
        je      $I112
; Line 32
$L121:
        mov     WORD PTR _var1,0
; Line 34
$I112:
; Line 35
        cmp     WORD PTR [bp+6],0       ;y
        jne     $L122
; Line 43
        cmp     WORD PTR _var2,0
        je      $I116
; Line 47
        mov     WORD PTR _var1,0
; Line 49
$I116:
        mov     ax,WORD PTR _var2
        test    WORD PTR [bp+8],ax      ;z
        je      $I115
; Line 53
$L122:
        mov     WORD PTR _var1,0
; Line 55
$I115:
; Line 56
        cmp     WORD PTR _var3,7
        jne     $I118
; Line 61
        mov     WORD PTR _var1,0
; Line 63
        mov     WORD PTR _var2,1
; Line 64
        leave
        ret
; Line 66
$I118:
        cmp     WORD PTR _var1,0
        jne     $I120
        cmp     WORD PTR _var2,0
        jne     $I120
        cmp     WORD PTR _var3,0
        je      $EX110
$I120:
; Line 70
        mov     WORD PTR _var1,0
; Line 72
$EX110:
        leave
        ret
_foo    ENDP

_TEXT   ENDS
END

Even with all the best space optimizations possible, the compiler still generated 4 MOV instructions to "Var1" with immediate constants of zero. The generated code from assembling the ASM listing was 6Ah bytes long.

Compiling the function again, this time with the /DCALL switch to enable the one instruction subroutine generated this listing file:

;       Static Name Aliases
;
        TITLE   1inst.c
        .286p
        .287
INCLUDELIB      SLIBCE
INCLUDELIB      OLDNAMES.LIB
_TEXT   SEGMENT  WORD PUBLIC 'CODE'
_TEXT   ENDS
_DATA   SEGMENT  WORD PUBLIC 'DATA'
_DATA   ENDS
CONST   SEGMENT  WORD PUBLIC 'CONST'
CONST   ENDS
_BSS    SEGMENT  WORD PUBLIC 'BSS'
_BSS    ENDS
DGROUP  GROUP   CONST, _BSS, _DATA
        ASSUME DS: DGROUP, SS: DGROUP
_BSS      SEGMENT
COMM NEAR       _var1:  BYTE:    2
COMM NEAR       _var2:  BYTE:    2
COMM NEAR       _var3:  BYTE:    2
_BSS      ENDS
_TEXT      SEGMENT
        ASSUME  CS: _TEXT
        PUBLIC  _ZeroVar1
_ZeroVar1       PROC NEAR       ;  COMDAT
; Line 13
        mov     WORD PTR _var1,0
; Line 14
        ret

_ZeroVar1       ENDP
        PUBLIC  _foo
_foo    PROC NEAR       ;  COMDAT
; Line 18
        push    bp
        mov     bp,sp
;       x = 4
;       y = 6
;       z = 8
; Line 20
        and     WORD PTR _var1,255      ;00ffH
        jne     $L123
; Line 28
        cmp     WORD PTR [bp+4],0       ;x
        je      $I114
; Line 30
$L123:
        call    _ZeroVar1
; Line 34
$I114:
; Line 35
        cmp     WORD PTR [bp+6],0       ;y
        jne     $L124
; Line 43
        cmp     WORD PTR _var2,0
        je      $I118
; Line 45
        call    _ZeroVar1
; Line 49
$I118:
        mov     ax,WORD PTR _var2
        test    WORD PTR [bp+8],ax      ;z
        je      $I117
; Line 51
$L124:
        call    _ZeroVar1
; Line 55
$I117:
; Line 56
        cmp     WORD PTR _var3,7
        jne     $I120
; Line 59
        call    _ZeroVar1
; Line 63
        mov     WORD PTR _var2,1
; Line 64
        leave
        ret
; Line 66
$I120:
        cmp     WORD PTR _var1,0
        jne     $I122
        cmp     WORD PTR _var2,0
        jne     $I122
        cmp     WORD PTR _var3,0
        je      $EX112
$I122:
; Line 68
        call    _ZeroVar1
; Line 72
$EX112:
        leave
        ret
_foo    ENDP
_TEXT   ENDS
END

It was good that the Microsoft compiler was smart enough to recognize that the ZeroVar1() function didn't need any sort of function prolog/epilog code. The compiler was also nice enough to recognize that ZeroVar1() didn't need to save or restore SI/DI which are commonly used as register variables. The generated code from assembling the ASM listing was 62h bytes long. By making the assignment of Var1 a function, we saved 8 bytes.

TIP: The Borland 4.52 C++ compiler behaved a little differently on this test program and insisted on pushing/popping SI/DI at the beginning and end of the one instruction function even though those registers weren't altered by the code. To force the Borland compiler to omit the push/pop of SI/DI a #pragma option can be used like this:

#pragma option -r-      // Temporarily turn off register variables
void near ZeroVar1(void)
   {
   var1 = 0;
   }
#pragma option -r.      // Restore the register variable state

Here's a section of the assembler code produced by the Borland compiler with the #pragma's used as mentioned above:

_TEXT   segment byte public 'CODE'
   ;
   ;    void near ZeroVar1(void)
   ;
        assume  cs:_TEXT,ds:DGROUP
_ZeroVar1       proc    near
   ;
   ;       {
   ;       var1 = 0;
   ;
        mov     word ptr DGROUP:_var1,0
   ;
   ;       }
   ;
        ret
_ZeroVar1       endp

   ;
   ;    void foo(int x, int y, int z)
   ;
        assume  cs:_TEXT,ds:DGROUP
_foo    proc    near
        push    bp
        mov     bp,sp
        push    si
        push    di
   ;
   ;       {
   ;       var1 &= 255;
   ;
        and     word ptr DGROUP:_var1,255

Notice how no push/pop of SI/DI was generated within the ZeroVar1() function when the #pragma was used. We can also see that the register variables were made active again when the code for the body of the foo() function was generated because pushes of SI and DI were generated there.

Return to chapter index


Destructive compares with INC/DEC

Return to chapter index

One sequence we'll often see in code is a comparison of a register with the value 1 or -1 like this:

        cmp     edi,-1  ; cmp di,-1 in USE16 code
        je      SomeWhere
        ; Blah, blah, blah...
SomeWhere:
        ; Blah, blah, blah...

In USE16 and USE32 code this will be encoded by most assemblers as a 3 byte instruction. In special situations we can transform this into a one byte INC instruction by recognizing that one added to minus one is zero (i.e. an INC will set the Z flag):

        inc     edi             ; inc  di in USE16 code
        je      SomeWhere
        ; Blah, blah, blah...
SomeWhere:
        ; Blah, blah, blah...

This is a nice space saver when we can apply it. When can we apply it though? If the code on the "fall through" and at the target don't depend on the value of the register after the comparison, then this trick is a winner. There may still be an opportunity if the code at either the fall through or the target depends on the register if it's only one of them. For example, let's say the code on the fall through in the example above depended on the value in EDI remaining intact, but the code at the branch target didn't. We can still save a byte by doing this:

    inc edi         ; inc di in USE16 code
    je  SomeWhere
    dec edi
    ; Blah, blah, blah...
SomeWhere:
    ; Blah, blah, blah...

The DEC EDI will restore the value in EDI back to what it was before the test.

When testing for the value 1, the concept is the same as testing for -1 only we'll be doing a DEC instead. Similarly, recovering the register value, if one of the paths needs it, would be done with an INC in this situation rather than a DEC.

Return to chapter index


Splitting nibbles in the AL register

Return to chapter index

The thing people typically do to split the nibbles in the AL register into AH and AL might look something like this piece of code:

(2 bytes) mov ah,al
(2 bytes) mov cl,4
(2 bytes) shr ah,cl
(3 bytes) and ax,0F0Fh

This is nice and it works OK. However, it is 9 bytes long. There's a way to tighten this up into a single two byte instruction though. In all the Intel CPU documentation prior to the Pentium, the claim was made that the AAM instruction only worked in base 10 decimal. In the Pentium programmer's reference, Intel has finally confessed to how the AAM (and AAD) instructions really work. While this confession occurred in the Pentium programmer's reference, AAM has functioned this way in every 80x86 CPU since the original 8086. Here's a pseudo-C version of how AAM works in reality:

AAM(unsigned char NumericBase)
   {
   unsigned char Temp8;
   Temp8 = AL;
   AH = Temp8 / NumericBase;
   AL = Temp8 % NumericBase;
   }

In fact, the AAM instruction is capable of dealing with numbers in most any numeric base you might want to work in. One of the more convenient bases we may want to have the AAM work with is base 16 (i.e. hexadecimal).

If you were to just code up an AAM instruction, an assembler is going to encode it like this:

db 0D4h,0Ah       ; Opcode for AAM is 0D4h. 

;Second byte is the numeric base. In this case 0Ah = 10 decimal.

If we hand assemble an AAM with a second byte that's different than 0Ah, then we've got an AAM that's going to work in whatever base we encode as the second byte. If we want AAM to work in hexadecimal, then we'd encode the second byte as 10h (16 decimal). You might want to do a macro for convenience, like this.

;
; MAAM (Mutant AAM)
;
MAAM macro NumericBase
   db 0D4h, NumericBase
endm

Let's do a mental run of a nibble in AL through the pseudo-C code for AMM using hexadecimal as the numeric base to see how this works. Suppose the AL register contains the value 37h.

37h divided by 16 is 3. The AH register will get the value 03h.

37h mod 16 is 7. The AL register will get the value 07h

So for our test value of 37h in AL, after a mutated AAM using base 16, the AX register contains the value 0307h. The mutated AAM split those nibbles cleanly into AH and AL.

There's a downside to this AAM trick though...

check.gif (139 bytes)The performance of AAM is pretty slow on all the CPU's.

The previous 9 byte, 4 instruction sequence that split the nibbles is actually faster than the 2 byte mutated AAM. On a Pentium, the AAM instruction takes 10 cycles. It's one of the most expensive integer instructions you can do on a Pentium. As always, we should let a profiler and some common sense tell us if the AAM trick is appropriate for any given piece of code.

Return to chapter index


Joining nibbles

Return to chapter index

Just as we were able to use a mutated AAM to split nibbles, we can also use another mutated instruction to join them. If the two byte AAD instruction is mutated so the second byte is 10h (16 decimal), it will join the low nibble in AH with the low nibble in AL. The result will have the low nibble of AH being the high nibble of AL and the low nibble of AL will remain unchanged.

Suppose the AX register contained the value 0109h. Doing one of these hexadecimal mutated AAD instructions will result in AL containing the value 19H. The AH register is always set to zero after any form of the AAD instruction.

skull.gif (534 bytes)WARNING: If the high order nibbles of AL or AH could possibly be non-zero, you'll need to do an "AND AX,0F0Fh" before doing the mutated base 16 AAD to join the nibbles.

skull.gif (534 bytes)WARNING: The mutated AAD trick does not work correctly on some of the NEC V series CPU's.

I've tried it on a V20 and the V20 always behaves as if the second byte of the AAM were 0Ah (10 decimal). If you're writing code that's going to run on a 286 or better CPU, then use the mutated AAD trick and sleep well. If you're writing code that may be run on an XT class machine with one of the NEC V20 or V30 chips installed, then you'd better avoid this trick. I've not tried the AAD trick on a V40 or V50 chip yet. On all the Intel CPU's, and variants thereof, that are 286 class or better, the AAD trick will work OK. By the way, if you're looking for an easy way to detect a V20 chip, this is one way to do it.

Return to chapter index


Jump chaining

Return to chapter index

Frequently we'll encounter some code where a series of error conditions are tested for as it flows along where all the error paths lead to one place. The code might look something like this:

ErrorExit:
   mov  ax,-1
   ret

FooBar  proc near
   call Check1
   jc   ErrorExit
;
;The next JC to ErrorExit is still in range of a
;short jump.
;
   call Check2
   jc   ErrorExit
;
; Some code is here that causes a subsequent JC to
; ErrorExit to be out of range of a short jump. This
; forced the programmer to do a "jump around the
; jump" construct.
;
   call Check3
   jnc  Check3ok  (2 bytes)
   jmp  ErrorExit (3 bytes)
Check3ok:
   xor  ax,ax
   ret
FooBar endp

In a case like this one, if the code after the call to Check3() is within short jump range of one of the other JC instructions, the code can be transformed like this:

FooBar  proc near
   call Check1
   jc   ErrorExit
;
;The next JC to ErrorExit is still
; in range of a short jump.
;
   call Check2
JC_to_ErrorExit:
   jc   ErrorExit
;
; Some code is here that causes a
; subsequent JC to ErrorExit to be
; out of range of a short jump.
; This forced the programmer to do
; a "jump around the jump" construct.
;
   call Check3
   jc   JC_to_ErrorExit
   xor  ax,ax
   ret
FooBar  endp

The state of the carry flag isn't going to change after a conditional jump, so introducing the label "JC_to_ErrorExit" allows us to transform a 5 byte "jump around the jump" sequence into a nice small 2 byte jump on carry.

One added benefit of doing this is that the non-error path through the code isn't forced to take a jump now. Falling through conditional jumps in the mainstream code is always a good thing. Jumps not taken are invariably faster than jumps taken.

If the routine in question is sufficiently long, with lots of tests, you'll get a nice space savings out of this trick, and the code will be running faster in the normal case too.

Sometimes there's no way to avoid a jump around the jump in a real long routine. That's life. When it happens though, see if you can apply the jump chaining trick to subsequent checks in the routine that may be within short jump range of that jump around jump you were forced to use.

The 386 and later CPU's are special here by the way. They implement a new form of the conditional jumps that can span the range of a 64K segment in USE16 code, and span a 4G segment in USE32 code. In USE16 code, this form of conditional jump is going to be a 4 byte instruction. In USE32 code it will be a 6 byte instruction.

Here's a little assembler module that demonstrates these new forms of conditional jumps:

   .386
x  segment use32 'code32'
   assume  cs:x
   jc      foo32
   db      200 dup (90h) ; Force jump out of short range
foo32 label near
x  ends

y  segment use16 'code16'
   assume  cs:y
   jc      foo16
   db      200 dup (90h) ; Force jump out of short range
foo16 label near
y  ends
   end

Assembling this module yielded this listing file that shows how these new form jumps are 6 and 4 bytes (on lines 4 and 11 of the listing).

Turbo Assembler Version 3.1 11/19/95 22:09:01 Page 1
jumps.ASM
1                               .386
2 00000000                x     segment use32 'code32'
3                               assume  cs:x
4 00000000 0F 82 000000C8       jc      foo32
5 00000006 C8*(90)              db      200 dup (90h)
6 000000CE                foo32 label   near
7 000000CE                x     ends
8
9  0000                   y     segment use16 'code16'
10                              assume  cs:y
11 0000 0F 82 00C8              jc      foo16
12 0004 C8*(90)                 db      200 dup (90h)
13 00CC                   foo16 label   near
14 00CC                   y     ends
15                              end

I need to mention a gotcha here regarding these new 386 forms of jumps. Some assemblers are not going to be very smart and may assemble what could be a short jump into one of those longer 4 or 6 byte forms. Some will later realize they could do the short form and will generate that, but pad the extra bytes they initially allocated with NOP's.

Here's a slightly altered version of the above example that illustrates this pitfall:

       .386
x      segment use32 'code32'
       assume  cs:x
       jc      foo32
       db      200 dup (90h)
       jc      foo32
       db      100 dup (90h)
foo32  label   near
x      ends

y      segment use16 'code16'
       assume  cs:y
       jc      foo16
       db      200 dup (90h)
       jc      foo16
       db      100 dup (90h)
foo16  label   near
y      ends
       end

The second JC in each segment here is within 127 bytes of the target label and can rightly be assembled as the 2 byte short form of the conditional jump. However, when I ran this code with no special switches, other than to generate a list file, through the version of TASM that came with the Borland version 3.1 C/C++ compiler, I got this result:

Turbo Assembler Version 3.1 11/19/95 22:36:17 Page 1
jumps2.ASM
1 .386
2 00000000                   x     segment use32 'code32'
3                                  assume  cs:x
4 00000000 0F 82 00000132          jc      foo32
5 00000006 C8*(90)                 db      200 dup (90h)
6 000000CE 72 68 90 90 90 90       jc      foo32
7 000000D4 64*(90)                 db      100 dup (90h)
8 00000138                   foo32 label   near
9 00000138                   x     ends
10
11 0000                      y     segment use16 'code16'
12                                 assume  cs:y
13 0000 0F 82 0130                 jc      foo16
14 0004 C8*(90)                    db      200 dup (90h)
15 00CC 72 66 90 90                jc      foo16
16 00D0 64*(90)                    db      100 dup (90h)
17 0134                      foo16 label   near
18 0134                      y     ends
19                                 end

Notice how TASM generated the 4 and 6 byte conditional jumps on lines 6 and 15 of this listing. TASM just blindly allocated 4 and 6 bytes for the jumps and then went back and filled in the gaps with NOP's (90h's) when it realized a short version of the jump would work.

Are TASM users doomed to this kind of behavior? No.. There's two different ways around this situation. One is to use the /m switch when assembling and tell TASM to use multiple passes to resolve forward references. When I rebuilt the example and added a /m3 switch TASM generated this much better code:

Turbo Assembler Version 3.1 11/19/95 22:44:56 Page 1
jumps2.ASM
1                            .386
2 00000000                x  segment use32 'code32'
3                            assume cs:x
4 00000000 0F 82 0000012E    jc     foo32
5 00000006 C8*(90)           db     200 dup (90h)
6 000000CE 72 64             jc     foo32
7 000000D0 64*(90)           db     100 dup (90h)
8 00000134                foo32     label near
9 00000134                x  ends
10
11 0000                   y  segment use16 'code16'
12                           assume cs:y
13 0000 0F 82 012E           jc     foo16
14 0004 C8*(90)              db     200 dup (90h)
15 00CC 72 64                jc     foo16
16 00CE 64*(90)              db     100 dup (90h)
17 0132                   foo16     label near
18 0132                   y  ends
19                           end

smile.gif (241 bytes)See how the jumps on lines 6 and 15 are nice short 2 byte ones and that nasty NOP padding is gone.

Return to chapter index


The missed ASSUME

Return to chapter index

One thing that can really pork out the code for DOS device drivers and TSR's is missing an ASSUME directive for the DS register. Here's a section of the assembler listing from the FAKEHI driver I mentioned previously. On closer examination, I realized that I'd blown it when I first wrote the driver.

318                    ;---------------------------------------------------
319                    ; Translate Src pointer to EMS form (if necessary)
320                    ;---------------------------------------------------
321 020C 2E: 80 3E 0094 00 cmp  EMSxbuf.src_type, 0 ; Lowmem?
322 0212 74 12             je   ChkDstForXlat ; No xlat in this case...
323
324 0214 2E: C4 3E 0097    les  di, dword ptr EMSxbuf.src_offset
325 0219 E8 FEAE           call XlatProc
326 021C 2E: 89 3E 0097    mov  word ptr EMSxbuf.src_offset, di
327 0221 2E: 8C 06 0099    mov  word ptr EMSxbuf.src_offset+2, es
328
329 0226                ChkDstForXlat:
330                   ;---------------------------------------------------
331                   ; Translate Dst pointer to EMS form (if necessary)
332                   ;---------------------------------------------------
333 0226 2E: 80 3E 009B 00 cmp  EMSxbuf.dst_type, 0 ; Lowmem?
334 022C 74 12             je   XlatingDone ; No xlat in this case...
335
336 022E 2E: C4 3E 009E    les  di, dword ptr EMSxbuf.dst_offset
337 0233 E8 FE94           call XlatProc
338 0236 2E: 89 3E 009E    mov  word ptr EMSxbuf.dst_offset, di
339 023B 2E: 8C 06 00A0    mov  word ptr EMSxbuf.dst_offset+2, es
340
341 0240              XlatingDone:
342 0240 0E                push cs ; DS:SI --> EMS xlated info
343 0241 1F                pop  ds ;
344 0242 BE 0090           mov  si, offset EMSxbuf ;

The function XlatProc won't alter the DS register in this section of code by the way, nor does it depend on the value in DS.  In the listing, take a look at the code the assembler generated on lines 321, 324, 326, 327, 333, 336, 338, and 339.

check.gif (139 bytes)Notice the '2E:'s. Those are CS: segment override bytes on the instructions -- 8 of them.

Now notice the PUSH CS and POP DS down on lines 342 and 343. If I'd moved the push/pop of CS and DS up to the top of this section of code, I would have been able to use an ASSUME on the DS register through this section. That would have allowed the assembler to eliminate those 8 bytes of segment overrides.

Fiddling with the code a little yielded this new listing:

286 020C 0E                push   cs
287 020D 1F                pop    ds
288                        assume ds:fakehi
289                    ;---------------------------------------------------
290                    ; Translate Src pointer to EMS form (if necessary)
291                    ;---------------------------------------------------
292 020E 80 3E 0094 00     cmp    EMSxbufDS.src_type, 0 ; Lowmem?
293 0213 74 0F             je     ChkDstForXlat ; No xlat in this case...
294
295 0215 C4 3E 0097        les    di, dword ptr EMSxbufDS.src_offset
296 0219 E8 FEAE           call   XlatProc
297 021C 89 3E 0097        mov    word ptr EMSxbufDS.src_offset, di
298 0220 8C 06 0099        mov    word ptr EMSxbufDS.src_offset+2, es
299
300 0224               ChkDstForXlat:
301                    ;---------------------------------------------------
302                    ; Translate Dst pointer to EMS form (if necessary)
303                    ;---------------------------------------------------
304 0224 80 3E 009B 00     cmp    EMSxbufDS.dst_type, 0 ; Lowmem?
305 0229 74 0F             je     XlatingDone ; No xlat in this case...
306
307 022B C4 3E 009E        les    di, dword ptr EMSxbufDS.dst_offset
308 022F E8 FE98           call   XlatProc
309 0232 89 3E 009E        mov    word ptr EMSxbufDS.dst_offset, di
310 0236 8C 06 00A0        mov    word ptr EMSxbufDS.dst_offset+2, es
311
312 023A               XlatingDone:

What appears to be a structure, EMSxbufDS, is actually an equate to a memory location that gets relocated into the program's PSP when it loads, so I had to jiggle that around a little in this section. The previous EMSxbuf symbol had a hard wired in CS: override, so that had to be changed to DS: in recognition that DS was actually usable in this section of code now.

Notice how the CS: override bytes are gone in this section of code now that these adjustments were made. Let the assembler's listing file guide you like in this example from FAKEHI.

check.gif (139 bytes)If you see a section of code with a bunch of segment override bytes associated with it, there may be a way to eliminate them by making DS available like I did in this new version of FAKEHI.

Return to chapter index


Testing values in AH with LAHF

Return to chapter index

There's a unique situation that arises occasionally when we'd like to test a particular bit in the AH register and then make a conditional jump based on it. The LAHF instruction will load the AH value into the low byte of the flags register for us.

The bits in the low byte of the flags register that correspond to testable bits in the AH register are as follows:

Low flags byte:(msb) SF ZF ## AF ## PF ## CF (lsb)
AH Register    (msb) b7 b6 b5 b4 b3 b2 b1 b0 (lsb)

## = this flags bit is indeterminate after a LAHF instruction.

So, knowing this correspondence often allows us to replace code that might look like this:

(3 bytes) test ah,00000001b
(2 bytes) jnz  Somewhere

with something like this:

(1 byte)  lahf
(2 bytes) jc Somewhere ; Jump if bit 0 in AH is set

Similarly, if we wanted to test the bit 6 in AH we could do this:

(1 byte)  lahf
(2 bytes) jz Somewhere ; Jump if bit 6 in AH is set

check.gif (139 bytes)These situations aren't going to occur all the time, but when they do, this trick is usually good for a 2 byte savings.

Return to chapter index


INC/DEC transformations

Return to chapter index

Whenever you see code that's adding or subtracting one to a register or memory location, check and see if the code following the ADD or SUB instruction depends on the carry flag. If the subsequent code isn't dependent on the state of the carry flag, then it's a safe bet that the ADD or SUB of one can be replaced with an INC or DEC instruction.

Often we'll se some code that looks like this:

(2 bytes) inc al

If we could stipulate that the value in AL was known to be less than 255, guaranteeing that no overflow into AH is possible, then this INC AL could be replaced with an INC AX which is a one byte instruction in USE16 code. If the value in AH is expendable, then an INC/DEC of AX may be possible if the resultant flags don't matter.

check.gif (139 bytes)Sometimes people's minds get stuck thinking about 8 bit values and they don't recognize when AH is in fact expendable and an INC/DEC of a 16 bit register could have been used.

check.gif (139 bytes)This same concept can be applied in USE32 code. If we know the value in AL is in no danger of overflowing into AH (or an overflow doesn't matter), then the INC AL can be replaced with an INC EAX.

Another possible opportunity exists in 32 bit code too. There, someone might be trying to INC the AX register, or ADD something to it. This occurs a lot in older 8086 or 286 code that's been ported to take some advantage of the 386, but hasn't been totally reworked with the 386 in mind. In cases like this, accessing the AX register from USE32 code incurs the penalty of a 66h size override prefix byte on the instruction. If you can stipulate that the value in AX won't roll overflow into the high order word of EAX, or maybe that would be harmless if it did happen, then doing the operation on EAX rather than AX will eliminate the size override prefix byte.

Another situation with ADD/INC that crops up frequently is when we're reworking some older 8086 or 286 code to take advantage of a 386's capabilities. Often we'll see sequences of 8088 code that look like this:

foo    dd    ?
       ...blah, blah, blah...
       add   word ptr foo,   1
       adc   word ptr foo+2, 0

The idea in this fragment is to add one to the double word memory variable 'foo'. When doing 386 conversions on older code you'd usually want to replace that sequence with:

inc dword ptr foo

skull.gif (534 bytes)The only exception would be if for some reason the state of the carry flag matters after the sequence. An INC won't alter the carry flag, but the ADD/ADC sequence will. In those odd situations, you'd want to do an ADD of 1 to the variable.

Return to chapter index


String operations can sometimes do more than just "strings"

Return to chapter index

Suppose we've got a situation where we've got two structures and one is being copied to the other, with the exception of some fields in the middle of the structures. We'll often see code that looks like this example of some "flat" 32 bit code.

lea esi, SourceStructure
lea edi, DestinationStructure
mov ecx,8
rep movsd  ; Copy 8 dwords
add edi, 4 ; Skip the next dword field (3 byte instruction)
add esi, 4 ; Skip the next dword field (3 byte instruction)
mov cl,4   ; ECX was zero from REP MOVSD
rep movsd  ; Copy 4 more dwords
add edi, 2 ; Skip the next word field (3 byte instruction)
add esi, 2 ; Skip the next word field (3 byte instruction)
movsd      ; Move the last field

If this isn't a particularly speed critical section of code, we could replace it with this sequence:

lea esi, SourceStructure
lea edi, DestinationStructure
mov ecx,8
rep movsd ; Copy 8 dwords
cmpsd     ; Skip dword field (1 byte instruction)
mov cl,4  ; ECX was zero from REP MOVSD
rep movsd ; Copy 4 more dwords
cmpsw     ; Skip word field (2 byte instruction in USE32 code)
movsd     ; Move the last field

In this example, we can knock off a couple of more bytes by realizing that the count of 8 initially being loaded into ECX is a small number. That MOV immediate to ECX is a 5 byte instruction. If we replaced it with:

push 8
pop  ecx

we'd have a 3 byte sequence.

Return to chapter index


LOOP isn't always for 'loops'

Return to chapter index

When space is at a premium and we see the sequence:

dec cx (or ECX in 32 bit code)
jnz

it can usually be replaced by a LOOP instruction which will be a byte smaller. After all, what LOOP does is decrement CX or ECX and jump if the result is non zero.

skull.gif (534 bytes)The one thing to watch out for when doing a transformation is if the code involved is depending on the state of the flags after the DEC instruction. LOOP doesn't alter the flags, but DEC does.

skull.gif (534 bytes)Don't do this transformation on in very speed critical code that may be running on 386 or later CPU's. The DEC CX/JNZ sequence is actually faster than a LOOP on many of the newer CPU's.

Here's a little Zen Timer program to see how LOOP versus DEC/JNZ performs:

       call ZtimerOn
       mov  cx, 32000
       even
loop1: loop loop1
       call ZtimerOff
       call ZTimerReport
       call ZtimerOn
       mov  cx, 32000
       even
loop2: dec  cx
       jnz  loop2
       call ZtimerOff

Running this on several different test machines yielded these results (all timings are in microseconds):

                  LOOP    DEC CX/JNZ
Pentium 60mhz     4255          2659
386DX-20         19754         18122
386SX-25         18063         16705
486SX-25          8955          5118
486DX2-66        10574         17302

The timings for the 486DX2-66 looked a little out of place, so I went back and changed the "EVEN" directive in the test to be an " ALIGN 16" to make sure the loop tops were paragraph aligned. Curiously, this didn't change anything for the 486SX-25, but it made a big difference on the 486DX2-66! With the paragraph aligned loop top, the 486DX2-66's timing on the DEC CX/JNZ sequence was 9613 microseconds. The timing for the sequence using the LOOP instruction stayed the same.

What's real odd here is that the 486DX2-66 is getting beat by the 486SX-25 on this test - even when everything was paragraph aligned for the 486DX2-66. I know from benchmarking the motherboard the 486DX2-66 is in the past, that it's a fairly slow motherboard compared to other systems in the same class. It's external cache hardware is less effective than that of most other 486DX2-66 systems. Even so, the looping code in this test should be parked in the CPU's internal cache for this test. The 486SX-25 motherboard has no external cache at all, so there's no possibility that it's a factor in it's timings.

By the way, Appendix B has some further Zen Timer investigation of the strange performance characteristics of my 486DX2-66. The DX2 is indeed an odd beast.

Return to chapter index


XCHG as a destructive MOV

Return to chapter index

The XCHG instruction can be a powerful weapon for squeezing bytes. When XCHG is used with AX (or EAX in 32 bit code) and one of the other registers, then XCHG is a one byte instruction, while moving one register to another is a two byte instruction. For example:

XCHG EAX,EDX ; This is 1 byte  in 32 bit code.
XCHG SI,AX   ; This is 1 byte  in 16 bit code.
MOV  EAX,EDX ; This is 2 bytes in 32 bit code.
MOV  SI,AX   ; This is 2 bytes in 16 bit code.

check.gif (139 bytes)In fact, the one byte NOP instruction's encoding (90h) is actually that of an XCHG (E)AX,(E)AX instruction.

Suppose we've got an instruction sequence like this one:

mov ax,dx
mov dx,Something

It wants AX to get DX's value and then causes a reload of DX right after. Sequences like that can be transformed with XCHG to good effect like this:

xchg ax,dx        ; AX=DX, DX=don't care
mov dx,SomeThing

This sequence is a byte smaller than the one using the MOV we just saw.

check.gif (139 bytes)When using XCHG as a destructive move like this, I recommend adding a comment similar to that up above. This helps to clarify what the code is really doing. Saying that DX was a "don't care" makes it crystal clear that you're using the XCHG as a destructive one byte move. Comments like this are particularly useful when the instruction that reloads the destroyed register isn't right next to the XCHG instruction like in the examples.

Another stylistic thing that helps when using XCHG as a destructive move is to order the operands as if it were a MOV instruction like in the above examples. That XCHG example above could have been written as:

xchg dx,ax

The code would work perfectly well, but this form just doesn't seem as clear to me.

check.gif (139 bytes)With operands ordered like a MOV's and with a comment added, there's no question about what's going on in the code.

Return to chapter index


LEA isn't just for addresses

Return to chapter index

The LEA instruction is good for loading effective addresses, but there's no requirement that the values it deals with actually be addresses though. LEA also works well as a multi component arithmetic instruction.

Suppose we have an instruction sequence in some USE16 like this one:

(2 bytes) mov ax, si
(2 bytes) add ax, bx
(4 bytes) add ax, 1234h

That sequence can be reduced to a single LEA instruction:

(4 bytes) lea ax, [bx+si+1234h]

check.gif (139 bytes)TIP: The LEA instruction doesn't alter the flags. This is often be a very convenient behavior

The real power of the LEA instruction for compacting code shows up in 386 specific code. With the 386 came the ability to have the EAX, ECX, and EDX registers participate in an LEA. The 386 also added the ability to have a scaling factor of 2,4,or 8 applied to one of the component registers like in this next example assembler listing:

1                              .386
2 00000000                x32  segment use32 para public 'code32'
3                              assume cs:x32
4 00000000 66| 8D 84 88 +      lea ax,[eax+ecx*4+1234h]
5          00001234
6
7 00000008                x32  ends
8
9  0000                x16 segment use16 para public 'code32'
10                         assume cs:x16
11 0000 67| 8D 84 88 +     lea ax,[eax+ecx*4+1234h]
12      00001234
13 0008 51                 push cx
14 0009 C1 E1 02           shl cx,2
15 000C 03 C1              add ax,cx
16 000E 59                 pop cx
17 000F 05 1234            add ax,1234h
18 0012                x16 ends
19                         end

Here we've got a USE32 and USE16 code segments with a couple of LEA instructions that are doing a lot of work. The LEA's on lines 4 and 11 are using the scaling feature of LEA on the 386 to do a multiply by 4 on the value in the ECX register during the additions that LEA does..

If we had to get the value of CX * 4 on a pre-386 CPU, while not destroying CX, we would be forced into doing something like the sequence starting on line 13 that approximates the behavior of the LEA on line 11. If it were stipulated that the flags couldn't be altered in that sequence, then a PUSHF/POPF costing 2 bytes would need to be wrapped around the whole thing too.

Even without the PUSHF/POPF, using the new 386 form of the LEA turned out to be 2 bytes shorter than the sequence starting on line 13.

The previous example's LEA's are kind of worst case versions too. They're causing a 4 byte embedded constant 00001234h to be generated in the instructions (see lines 5 and 12). The LEA instruction can also work with small sign extended constants like in this next listing:

1 .386
2 00000000                 x32 segment use32 para public 'code32'
3                              assume cs:x32
4 00000000 66| 8D 44 88 33     lea ax,[eax+ecx*4+33h]
5
6 00000005                 x32 ends
7
8  0000                    x16 segment use16 para public 'code32'
9                              assume cs:x16
10 0000 67| 8D 44 88 33        lea ax,[eax+ecx*4+33h]
11 0005 51                     push cx
12 0006 C1 E1 02               shl cx,2
13 0009 03 C1                  add ax,cx
14 000B 59                     pop cx
15 000C 05 0033                add ax,33h
16 000F                    x16 ends
17                             end

In this example listing the LEA's are 3 bytes shorter than in the previous listing because the constant 33h fits in a sign extended byte. Now the LEA in the USE16 segment is 1/2 the size of the equivalent sequence that was forced to preserve the value in CX..

In the USE32 version of LEA on line 4 notice the 66h prefix byte on the instruction. This is the cost of specifying AX as a target in USE32 code. If we could stipulate that the high word of EAX was expendable, and just do the LEA using EAX as a target, then the instruction would be a byte smaller.

Another interesting property of the 386's new 32 bit form of LEA is that the "index" and "base" register can be the same thing. This opens up all sorts of possibilities for doing multiply operations on what previously were inconvenient numbers - like for example 10. If we wanted to multiply EAX by 10, one straight forward way would be:

(5 bytes) mov edx,10
(2 bytes) mul edx ; EDX:EAX = EAX*10

That's 7 bytes, and kind of porky if we're trying to save space. We could get it down to 5 bytes with the PUSH/POP trick examined in a prior chapter like this:

(2 bytes) push 10
(1 byte)  pop  edx
(2 bytes) mul  edx

5 bytes is better, but there's still some potentially undesirable side effects. EDX is getting clobbered by the multiply because a 64 bit result is being generated. If we know the numbers involved happened to be small and we weren't going to get any significant result in EDX, then we just burned a register for nothing.

Suppose we used an LEA here like this though:

(3 bytes) lea eax,[eax+eax*4] ; EAX = EAX*5
(2 bytes) add eax,eax         ; EAX = EAX*10

This is still 5 bytes, but we've avoided burning the EDX register this time. There's an important performance advantage to this last sequence too. MUL instructions are real expensive, even on the Pentium where they cost about 10/11 clock cycles. On a 486, a MUL can cost as little as 13 clocks all the way up to 42 clocks for a 32 bit multiply.

smile.gif (241 bytes)486's and Pentiums can knock off an LEA instruction in a single cycle though! The register add of EAX,EAX in the last code fragment only costs a single cycle on a 486 and Pentium too.

So, anytime you can replace an expensive unsigned multiply (MUL) with a short sequence of LEA's and ADD's, you're going to be winning bigtime on the newer CPU's.

If you know for certain that you could live with doing a signed multiply, then the way to go for small numbers would be an IMUL instruction. IMUL has a special form that allows multiplying a register by an immediate value. For example:

(3 bytes) imul eax,10 

This is 2 bytes better than the LEA/ADD combination, but it does do a signed multiply!

Back to top


Using JCXZ when you can't hit the flags

One situation that crops up with annoying regularity in code, is needing to do a conditional jump based on a register or memory variable being zero or not, but where the state of the flags needs to be preserved across the conditional jump. This frequently lot when functions give a return value in the ZF or CF flag and the code is running with interrupts disabled. An OR or CMP followed by a JZ/JE or JNZ/JNE would alter the flags, so often we'll see code that does some sort of twiddle involving PUSHF/POPF to preserve the flags like this code fragment:

      IFsensitive proc near
          cli
      ;
      ; Some uninterruptable code is here
      ;
(1 byte)  pushf
(2 bytes) xor ax,ax
(4 bytes) cmp byte ptr Flag,al
(2 bytes) je ContinueAX0
      ContinueAX1:
(1 byte)  inc ax
      ContinueAX0:
(8 bytes) POPFF
      ;
      ; More uninterruptable code is here
      ;
          sti
          ret
      IFsensitive endp

Notice the POPFF macro that was used rather than a straight POPF instruction. The stipulation was that interrupts are disabled, so using a straight POPF in this section of code would have made it vulnerable to the POPF bug that exists on early stepping levels of the 80286 chips. POPFF macros are typically gruesome in terms of performance and size. There's a CALL, a couple of PUSH's, a JMP, and an IRET (a very slow instruction) involved. A typical POPFF macro implementation might look something like this 8 byte sequence:

POPFF macro
  pushf    ;; Get the flags
  push cs  ;; and a far return
  call $+5 ;; address on the stack, then near CALL an IRET
  jmp  $+3 ;; Continue on with main flow of the code after the IRET
  iret     ;; This pops the far return address and the flags
endm

Realizing that JCXZ can do a conditional jump without needing to zap the flags would allow us to ditch the POPFF macro and rewrite that code like this:

      IFsensitive proc near
          cli
      ;
      ; Some uninterruptable code is here
      ;
(1 byte)  push cx
(3 bytes) mov  ax,0
(2 bytes) mov  cx,ax
(4 bytes) mov  cl,byte ptr Flag
(2 bytes) jcxz ContinueAX0
      ContinueAX1:
(2 bytes) mov  al,1
      ContinueAX0:
(1 byte)  pop cx
      ;
      ; More uninterruptable code is here
      ;
          sti
          ret
      IFsensitive endp

Even with all the jerking around to setup the CX register and save/restore it, this new sequence beat the one that used the POPFF macro by 3 bytes. It's going to run a lot faster too. Even in real mode or V86 mode on a Pentium, and IRET alone takes 8 clocks.

Return to chapter index


Tracking a variable in a register

Return to chapter index

Suppose we had a couple of C functions that were accessing a global memory variable like this:

int x;

void foo(int a, int b, int c, int d, int e)
  {
  if (a) x++;
  if (b) x++;
  if (c) x++;
  if (d) x++;
  if (e) x++;
  }
void foo2(int a, int b, int c, int d, int e)
  {
  register int tempx = x;
  if (a) tempx++;
  if (b) tempx++;
  if (c) tempx++;
  if (d) tempx++;
  if (e) tempx++;
  x = tempx;
  }

Both of these functions have well defined entry and exit points. It would be perfectly valid for a C compiler to transform the code for foo() into code the way foo2() is written for the purposes of generating machine code.. The semantics of the code would be the same in both cases.

I compiled those functions with the 32 bit Watcom version 10.0 compiler using the -s (omit stack probes) and -os (favor space) switches and got this result though:

Module: D:\TESTS\track.c
Group: 'DGROUP' CONST,CONST2,_DATA,_BSS
Segment: _TEXT BYTE USE32 0000006b bytes
0000 55           foo_ push ebp
0001 89 e5             mov ebp,esp
0003 85 c0             test eax,eax
0005 74 06             je L1
0007 ff 05 00 00 00 00 inc dword ptr _x
000d 85 d2 L1          test edx,edx
000f 74 06             je L2
0011 ff 05 00 00 00 00 inc dword ptr _x
0017 85 db L2          test ebx,ebx
0019 74 06             je L3
001b ff 05 00 00 00 00 inc dword ptr _x
0021 85 c9 L3          test ecx,ecx
0023 74 06             je L4
0025 ff 05 00 00 00 00 inc dword ptr _x
002b 83 7d 08 00    L4 cmp dword ptr +8H[ebp],00000000H
002f 74 06             je L5
0031 ff 05 00 00 00 00 inc dword ptr _x
0037 5d L5             pop ebp
0038 c2 04 00          ret 0004H
003b 56          foo2_ push esi
003c 55                push ebp
003d 89 e5             mov ebp,esp
003f 89 c6             mov esi,eax
0041 a1 00 00 00 00    mov eax,_x
0046 85 f6             test esi,esi
0048 74 01             je L6
004a 40                inc eax
004b 85 d2          L6 test edx,edx
004d 74 01             je L7
004f 40                inc eax
0050 85 db          L7 test ebx,ebx
0052 74 01             je L8
0054 40                inc eax
0055 85 c9          L8 test ecx,ecx
0057 74 01             je L9
0059 40                inc eax
005a 83 7d 0c 00    L9 cmp dword ptr +0cH[ebp],00000000H
005e 74 01             je L10
0060 40                inc eax
0061 a3 00 00 00 00 L10 mov _x,eax
0066 5d                pop ebp
0067 5e                pop esi
0068 c2 04 00          ret 0004H
No disassembly errors

Remember what I said previously about looking at listing files for "wide" instructions? The generated code for foo() is 3Ah bytes long and has a bunch of awfully wide instructions there doing memory increments on the variable "x".

The generated code for foo2() where we introduced the temp register variable that's tracking the value of "x" through the life of the foo2() code is a lot better. It's only 2Fh bytes long (11 bytes smaller than foo()) and potentially touches memory a lot less depending on what kind of values are passed as parameters to foo2().

The same thing we tricked the Watcom compiler into doing can often be applied to assembler routines. Beginning programmers, or someone who is just in a hurry, will frequently treat the CPU as if it didn't have any registers. The result is fatter and slower code in many cases.

The rule of thumb here is: when you see a routine referencing a memory variable more than two times it may be a candidate for a transformation like the one above if there's a register free. Often it'll pay to force a register to be free by PUSH'ing and POP'ing something at the entry and exit to the routine. Remember, PUSH and POP are one byte instructions for registers, and on the newer CPU's, like the 486's and Pentiums, PUSH/POP run pretty fast too.

Don't hesitate to use PUSH/POP when accessing of memory variables is kind of grouped together in a routine and registers are scarce either. You may encounter code that looks something like this:

        Var1  dd ?
        Var2  dd ?
        Var3  dd ?
        Var4  dd ?
(10 bytes)      mov Var1,1234
( 6 bytes)      add Var1,ebx
( 7 bytes)      shl Var1,2
(10 bytes)      mov Var2,5678
( 6 bytes)      add Var2,ebx
( 7 bytes)      shl Var2,2
(10 bytes)      mov Var3,4321
( 6 bytes)      add Var3,ebx
( 7 bytes)      shl Var3,2
(10 bytes)      mov Var4,8765
( 6 bytes)      add Var4,ebx
( 7 bytes)      shl Var4,2
( 5 bytes)      mov eax, Var1
( 6 bytes)   add Var2, eax
(103 bytes total)

Even if there's only one register available here for use, say EBP, it'll pay to "Scope" the value of Var1 onto the stack with a PUSH because it's going to be referenced later on where Var1 gets added to Var2 near the end of the sequence.

Tracking Var1, Var2, Var3, and Var4 all in EBP would look like this:

(1 byte)  push ebp ; Save EBP.
(5 bytes) mov ebp,1234 ; EBP is tracking Var1 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(1 byte)  push ebp ; Var1 is on the stack now
(5 bytes) mov ebp,5678 ; EBP is tracking Var2 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(6 bytes) mov Var2,ebp ; Var2 is in memory now.
(5 bytes) mov ebp,4321 ; EBP is tracking Var3 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(6 bytes) mov Var3,ebp ; Var3 is in memory now.
(5 bytes) mov ebp,8765 ; EBP is tracking Var3 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(6 bytes) mov Var4,ebp ; Var4 is in memory now.
(1 byte)  pop ebp ; EBP is tracking Var1 again
(6 bytes) add Var2,ebp ;
(6 bytes) mov Var1,ebp ; Var1 is in memory now.
(1 byte)  pop ebp ; Restore EBP value
(74 bytes total)

Not too bad. We saved 29 bytes by tracking the variables it used in a register. This code is going to run a lot faster and be easier on the working set of whatever program it lived in.

Is this as small as it can get though? No.. We can also apply the concept of variable pooling introduced earlier to get something like this:

        VarPool label dword
        V1off   equ   $-VarPool
        Var1    dd    ?
        V2off   equ   $-VarPool
        Var2    dd    ?
      V3off   equ   $-VarPool
      Var3    dd    ?
      V4off   equ   $-VarPool
      Var4    dd    ?

(1 byte)  push ebp ; Save EBP.
(1 byte)  push edi ; Save EDI.
(6 bytes) lea edi,VarPool ; Aim EDI at the variable pool
(5 bytes) mov ebp,1234 ; EBP is tracking Var1 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(1 byte)  push ebp ; Var1 is on the stack now
(5 bytes) mov ebp,5678 ; EBP is tracking Var2 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(3 bytes) mov [edi+V2off],ebp ; Var2 is in memory now.
(5 bytes) mov ebp,4321 ; EBP is tracking Var3 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(3 bytes) mov [edi+V3off],ebp ; Var3 is in memory now.
(5 bytes) mov ebp,8765 ; EBP is tracking Var3 now
(2 bytes) add ebp,ebx
(3 bytes) shl ebp,2
(3 bytes) mov [edi+V4off],ebp ; Var4 is in memory now.
(1 byte)  pop ebp ; EBP is tracking Var1 again
(3 bytes) add [edi+V2ff],ebp ;
(3 bytes) mov [edi+V1off],ebp ; Var1 is in memory now.
(1 byte)  pop edi ; Restore EDI value
(1 byte)  pop ebp ; Restore EBP value
(67 bytes total)

Using the variable pool technique knocked another 7 bytes off for 67 total. Now this code is starting to get fairly tight for USE32 code which normally tends toward plumpness. If we could stipulate that the value in ECX could be destroyed in this section of code, we could do CL based SHL's rather than the immediate shifts being used now. That would be good for another two byte savings. The SHL EBP,CL instruction will be 2 bytes versus 3 for the immediate shifts. There's 4 shifts for a total of 12 bytes currently. Using CL shifts would be 8 bytes, plus another 2 to load the value 2 into CL with a MOV instruction.

Return to chapter index


Do we always need a stack frame to access parameters?

Return to chapter index

When interfacing C/C++ code with assembler modules, we'll often see the assembler code setup a "standard" stack frame just like a C compiler would to get at any pushed parameters. In some cases this isn't necessary though and just bloats and slows down the code.

Here's a C test program and 4 assembler functions that demonstrate the differences in performance and size of using a "standard" frame to accesses parameters versus popping them off the stack and returning via a JMP.

/*----------------------------------------------------
  SPTEST.C
  Demonstration of performance costs of different
  calling conventions and ways of accessing parameters
  on the stack.
----------------------------------------------------*/
#include <stdio.h>
#include <time.h>

typedef unsigned short USHORT;

extern USHORT near SplitNibbles(USHORT val);
extern USHORT near SplitNibbles2(USHORT val);
extern USHORT near pascal PasSplitNibbles(USHORT val);
extern USHORT near pascal PasSplitNibbles2(USHORT val);

#define ITERATIONS 10000000L /* 10M */

int main(void)
   {
   unsigned long i;
   time_t start,stop,elapsed;
   #define TESTLOOP for (i=0L; i<ITERATIONS; i++)
   /*
     POP's and JMP using "standard" C conventions
   */
   start = time(NULL);
   TESTLOOP
      {
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      SplitNibbles2(0x0099);
      }
   stop = time(NULL);
   elapsed = stop-start;
   printf("Using POP's and JMP = %lu seconds\n",
       (unsigned long)elapsed);
   /*
     BP frame access using "standard" C conventions
   */
   start = time(NULL);
   TESTLOOP
      {
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      SplitNibbles(0x0099);
      }
   stop = time(NULL);
   elapsed = stop-start;
   printf("Using standard frame = %lu seconds\n",
       (unsigned long)elapsed);
   /*
     POP's and JMP using "pascal" conventions
   */
   start = time(NULL);
   TESTLOOP
      {
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      PasSplitNibbles2(0x0099);
      }
   stop = time(NULL);
   elapsed = stop-start;
   printf("Using POP's and JMP(Pascal) = %lu seconds\n",
       (unsigned long)elapsed);
   /*
     BP frame access using "pascal" conventions
   */
   start = time(NULL);
   TESTLOOP
      {
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      PasSplitNibbles(0x0099);
      }
   stop = time(NULL);
   elapsed = stop-start;
   printf("Using standard frame(Pascal) = %lu seconds\n",
      (unsigned long)elapsed);
   return 0;
   }

;--------------------------------------------
; SPLIT.ASM
;
; Support functions to show the performance
; difference with different methods of
; accessing parameters and different calling
; conventions.
;--------------------------------------------
_TEXT segment use16 para public 'code'
;
; This function uses a "standard" stack
; frame to get at it's parameters and is
; called using "standard" calling conventions
;
        align 16
_SplitNibbles proc near
        public _SplitNibbles
        push    bp        ; 1 byte
        mov   bp,sp     ; 2 bytes
      mov   ax,[bp+4] ; 3 bytes
      db    0D4h,10h  ; 2 bytes
      pop   bp        ; 1 byte
      ret             ; 1 byte
_SplitNibbles endp

;
; This function uses a "standard" stack
; frame to get at it's parameters and is
; called using "pascal" calling conventions
;
        align 16
PASSPLITNIBBLES proc near
        public PASSPLITNIBBLES
        push bp         ; 1 byte
        mov bp,sp       ; 2 bytes
        mov ax,[bp+4] ; 3 bytes
        db 0D4h,10h ; 2 bytes
        pop bp  ; 1 byte
        ret 2   ; 3 bytes
PASSPLITNIBBLES endp
;
; This function uses the POP trick to
; get at it's parameters and is called
; using "standard" calling conventions
;
        align 16
_SplitNibbles2 proc near
        public _SplitNibbles2
        pop cx  ; 1 byte
        pop ax  ; 1 byte
        push ax         ; 1 byte
        db 0D4h,10h ; 2 bytes
        jmp cx  ; 2 bytes
_SplitNibbles2 endp
;
; This function uses the POP trick to
; get at it's parameters and is called
; using "pascal" calling conventions
;
        align 16
PASSPLITNIBBLES2 proc near
        public PASSPLITNIBBLES2
        pop cx  ; 1 byte
        pop ax  ; 1 byte
        db 0D4h,10h ; 2 bytes
        jmp cx  ; 2 bytes
PASSPLITNIBBLES2 endp
_TEXT ends
        end

I built the SPTEST.C program with the Microsoft version 8.0 C/C++ compiler and ran two trials on several test machines. The results were interesting. All timings are in seconds.

                   386SX-25 486SX-25 60mhz P5 486DX2-66 386DX-20
POP/JMP(Std C)     270/271  126/127   47/47   239/239   332/333
Std frame(Std C)   327/326  137/137   51/52   246/246   359/359
POP/JMP(Pascal)    229/229  112/112   48/48   207/206   302/302
Std frame(Pascal)  264/265  128/127   57/57   242/243   343/343

The POP/JMP scheme looks to be a winner when either Pascal or standard C calling conventions are used, with the Pascal calling convention form being the fastest (and the smallest too!).

What's most notable here is the fairly dramatic speed differences between standard C calling convention with standard stack frame access versus POP/JMP with Pascal calling conventions.

The 386SX-25 shows just over a 40% difference between the best and the worst, while the Pentium is only about 6%. The 486's and 386DX are in the middle around 20%.

Return to chapter index


PUSHA/POPA on 186 or better

Return to chapter index

The 80186 and subsequent processors have the ability to push and pop all the general registers with single byte instructions. The NEC V20/V30 can do this too by the way. When saving space is of paramount importance, this ability can come in very handy.

Often we'll see a code sequence that looks like this:

(1 byte) push ax
(1 byte) push si
(1 byte) push di
;
; Some code is here that causes AX,SI, and DI
; to be destroyed, but they need to be preserved.
;
(1 byte) pop di
(1 byte) pop si
(1 byte) pop ax

That sequence works OK, but if we're willing to pay the price for sloshing all the general registers on and off the stack it can be reduced to this:

(1 byte) pusha
;
; Some code is here that causes AX,SI, and DI
; to be destroyed, but they need to be preserved.
;
(1 byte) popa

This transformation saved 4 bytes over the version that pushed and popped the effected registers individually. Naturally we wouldn't want to employ a change like this one if one or more of the general registers were being used for passing back return values.

While the space savings are excellent with PUSHA/POPA, the performance cost can be pretty hefty if only a few registers need to be saved and restored. This ZenTimer program demonstrates the cost differences:

.186
; PUSH/POP of 2 registers
align 4
nop
call ZtimerOn
rept 1000
push si
push di
pop di
pop si
endm
call ZtimerOff
call ZtimerReport
; PUSH/POP of 3 registers
align 4
nop
call ZtimerOn
rept 1000
push si
push di
push bx
pop bx
pop di
pop si
endm
call ZtimerOff
call ZtimerReport
; PUSH/POP of 4 registers
align 4
nop
call ZtimerOn
rept 1000
push si
push di
push bx
push cx
pop cx
pop bx
pop di
pop si
endm
call ZtimerOff
call ZtimerReport
; PUSH/POP of 5 registers
align 4
nop
call ZtimerOn
rept 1000
push si
push di
push bx
push cx
push dx
pop dx
pop cx
pop bx
pop di
pop si
endm
call ZtimerOff
call ZtimerReport
; PUSH/POP of 6 registers
align 4
nop
call ZtimerOn
rept 1000
push si
push di
push bx
push cx
push dx
push bp
pop bp
pop dx
pop cx
pop bx
pop di
pop si
endm
call ZtimerOff
call ZtimerReport
; PUSH/POP of 7 registers
align 4
nop
call ZtimerOn
rept 1000
push si
push di
push bx
push cx
push dx
push bp
push ax
pop ax
pop bp
pop dx
pop cx
pop bx
pop di
pop si
endm
call ZtimerOff
call ZtimerReport
; PUSHA/POPA
align 4
nop
call ZtimerOn
rept 1000
pusha
popa
endm
call ZtimerOff

Running this program on several test CPU's yielded the following results. All of these timings are in microseconds.

            NEC V20   386SX-25   386DX-20   486SX-25  60mhz Pentium
2 push/pop   11602      714        926        440            70
3 push/pop   17467     1145       1430        743           110
4 push/pop   23313     1533       1894        993           158
5 push/pop   29093     1898       2396       1242(*)        182
6 push/pop   35198(*)  2260       2855       1480           320(*)
7 push/pop   40730     2648(*)    3352(*)    1744           352
pusha/popa   32490     2289       2934       1133           299

(*) - break even point where PUSHA/POPA outperform separate
      pushes and pops.

It's interesting that on all these CPU's there came a point where a PUSHA/POPA eventually outperformed separate pushes and pops. If performance is paramount in an application, I'd say picking a threshold of about 6 registers is a good compromise when choosing between separate pushes/pops or a pusha/popa. If the code is in a low performance path, then you may want to employ pusha/popa heavily to cut the size of the code. As always, let a profiler be the guide for where to apply transformations like this. When porting older 8086 code to take advantage of the newer CPU's features is where you'll likely see lots of opportunity for using pusha/popa.

Suppose we had a function that wanted to give a return value in one of the general purpose registers? Does this preclude using PUSHA/POPA as a short way to save and restore the other registers? Not necessarily!

There's no reason why we can't insert the value to be returned onto the stack so it will be popped off with the POPA. Let's take EAX as an example in some 32 bit code. We could do something like this:

(4 bytes)  mov [esp+28],eax
           popad
           ret

Even though the MOV costs 4 bytes, there may still be a space payoff if more than 3 registers would have needed to be pushed and popped otherwise.

The order that PUSHA, and POPA process registers can sometimes be useful to know. Suppose a function was to return a value in the DI or EDI register? It turns out that ED/EDI are the last register that gets pushed, and the first to get popped. To return a value in DI or EDI we could use sequences like this:

USE16 code                  USE32 code
(1 byte) pop ax       pop eax  ; Remove saved DI/EDI
(1 byte) push di      push edi ; Replace it with return value
         popa         popad
         ret          ret

In this case, returning a value in DI or EDI cost only two extra bytes. This would pay off if 3 or more registers would have needed to be saved and restored otherwise.

Return to chapter index


Immediate values on 186 or better CPU's

Return to chapter index

The 80186 and subsequent processors have the ability to use immediate values for shifts, rotates, and signed multiplies. The NEC V20/V30 can do this too. When porting code written for the 8086 to the newer processors, don't forget these new features. They can save a lot of space and make the code run faster.

One common sequence seen in older 8086 code is something like this one:

  (2 bytes) shl cx,1
  (2 bytes) shl cx,1
  (2 bytes) shl cx,1

The 8086 couldn't do a CL based shift when the value to be shifted or rotated happened to be in the CX or CL register. A sequence like the one above should always be converted to use an immediate shift or rotate like this for the newer CPU's:

  (3 bytes) shl cx,3

The only time you should use a CL based shift on the newer CPU's is when the shift count is unknown at runtime. One added benefit, besides being faster, is that the immediate shifts and rotates will free up the CL register for other uses. Having an extra register available for use often allows further speed and space optimizations. Even of you just moved a zero into CL and used it for zeroing byte variables in memory it's going to result in instructions that are a byte smaller than if you had to do a move of an immediate zero to a variable.

Return to chapter index


Recognizing convenient prior results

Return to chapter index

One thing programmers miss all the time is realizing that doing some operation may imply that a useful result was generated as a side effect. These serendipitous side effects can often result in a significant size and speed savings if we're alert enough to notice when they exist.

Consider this little code fragment:

  (2 bytes) or ax,ax
  (2 bytes) jnz SomeWhere
  (2 bytes) xor dx,dx

By realizing that AX must have been zero for a fall through to happen on the conditional jump, we can apply the CWD trick to zero DX with a one byte instruction like this:

  (2 bytes) or  ax,ax
  (2 bytes) jnz SomeWhere
  (1 byte)  cwd ; DX=0

Another one I see with annoying regularity in code is this little sequence:

  (2 bytes) xor ax,ax ; Set return code=0
  (1 byte)  clc       ; CF=0 for success
  (1 byte)  ret       ; Return

By realizing that one of the side effects of XOR'ing a register with itself is that the carry flag gets cleared, we can safely eliminate the CLC and turn that 4 byte sequence into a 3 byte sequence. For an instruction that does nothing but flip a bit in the flags, CLC is pretty expensive. It's a two clock instruction on all the CPU's.

One opportunity that occurs all the time is with repeated string instructions. You'll encounter sequences like this all the time:

  (3 bytes) mov cx,5
  (2 bytes) rep movsb
  (5 bytes) mov [ByteMemVar],0

By noticing that CX was zeroed after the REP MOV, that sequence can always be transformed into this:

  (3 bytes) mov cx,5
  (2 bytes) rep movsb
  (4 bytes) mov [ByteMemVar],cl

In any piece of code there's going to be hundreds, or even thousands, of side effect results produced as the code runs. The more you recognize, the tighter and faster the code is going to be.

Another possibility is that we might inject beneficial side effects into some functions we call. Suppose upon examining a program we noticed that a variable happens to be set to zero soon after calling a particular function most of the time? Maybe it would be useful to have that function, as a side effect, zero one of the registers on return rather than leaving it garbage. Then we could be assured that a zero was in that register every time we came back from that function - a zero that could be used to set the memory variable with.

We'll pay a space penalty to zero the register in one place, but enjoy the benefits of knowing the register is zero on return from every place that function is called.

Return to chapter index


Merging CALL/RET into JMP

Return to chapter index

Frequently we'll see code sequences like this:


        call    foo
        ret

If foo() is a near function and the RET is a near RET, or if foo() is a far function and the RET is a far RET, then it's usually a good bet that the sequence can be replaced by a JMP to foo() rather than the CALL. After all, foo() is going to execute a RET to get back to the caller, so foo()'s RET can do double duty and return for the caller as well.

WARNING: One instance where you won't want to do this, is when foo() is looking back onto the stack frame of it's caller to grope something off of it at a particular offset. Another is when there's a possibility that foo() may not return, and the place it jumps to expects a particular stack layout.

The benefits of a transformation like this are several. The JMP will be a byte smaller than the CALL/RET sequence. The code will also run faster because fewer return addresses are being sloshed on and off the stack. The pipeline and prefetchers are happier because they only get blown away by once by foo()'s RET rather than twice. On the 486's and Pentiums with their on chip caches, the JMP will also result in possibly one less cache line load if the cache line containing the CALL/RET had been purged by the time the foo() function returns. To force a cache line load just because we needed a single RET instruction is pretty tragic for performance when it happens. Remember back to the earlier chapters we saw just how sensitive the 486's and Pentiums are regarding their on chip caches? This merging of CALL/RET into a JMP is one of the few tricks that truly is a freebie when it can be applied. The code is always smaller, and it always runs faster. What more could you ask for?

Return to chapter index


Far call translations

Return to chapter index

A related topic to translating CALL/RET to JMP is the far call translation to a near call. Suppose we've got some far function like these:

func1 proc far
   ; Blah, blah, blah
   ret
func1 endp

func2 proc far
   ; Blah, blah, blah
   call func1
   ; Blah, blah, blah
   ret
func2 endp

Some assemblers will generate a 5 byte far CALL instruction where func2() calls func1(). We can do better here though since we know func1() and func2() are in the same code segment. We can force a near call by doing this:

func1 proc far
   ; Blah, blah, blah
   ret
func1 endp

func2 proc far
   ; Blah, blah, blah
   push cs
   call near ptr func1
   ; Blah, blah, blah
   ret
func2 endp

The end result of this transformation is the same as if a far CALL had been done, but it's going to be a byte smaller.

By the way, some newer assemblers will recognize that the far call can in fact be transformed by the push of CS followed by a near call. Check the code your assembler is generating to see of this is something you don't need to worry about.

There's significant performance benefits to doing a transformation like this as well. The PUSH CS/near CALL sequence almost always executes faster - on some CPU's it executes a LOT FASTER.

A far call in protected mode on a 286 takes 26 clock cycles in protected mode, but the 286 can execute the push followed by a near call in 10 clock cycles. A far call in real mode on a 286 takes 13 clock cycles, but the 286 can execute the push followed by a near call in the same 10 clock cycles in real mode.

A far call in protected mode on a 386 takes just over 34 clock cycles in protected mode, but the 386 can execute the push followed by a near call in just over 9 clock cycles. A far call in real mode on a 386 takes just over 17 clock cycles, but the 386 can execute the push followed by a near call in the same 9+ clock cycles in real mode.

A far call in protected mode on a 486 takes 20 clock cycles in protected mode, but the 486 can execute the push followed by a near call in 6 clock cycles. A far call in real mode on a 486 takes 18 clock cycles, but the 486 can execute the push followed by a near call in the same 6 clock cycles in real mode.

So, it should be obvious that transforming a far call whenever possible into a push CS followed by a near call is a solid winner.

Many linkers have an option to do a far call translation for you. They'll detect a far call instruction followed by a double word relocation address where the segment part of the relocation is the current segment. Recent versions of Borland's TLINK will do this by default and there's a switch to turn it off. Microsoft's LINK has the /FARCALLTRANSLATION and /NOFARCALLTRANSLATION options.

Having the linker do the translation has one downside though. Since the push CS followed by a near call is a byte shorter than a far call, the linker will typically pad the code out with a NOP to make everything line up.

Return to chapter index


Custom instructions via illegal opcode and Int 3's

Return to chapter index

This last topic is a real sleazy one and should probably be reserved for the most dire of space saving conditions because the performance penalty is really gruesome!. The payoff is only one or two bytes saved for each instance it's used too. This is the kind of thing you might try when every other trick imaginable has been tried and you've still come up a little short trying to cram some code into a small ROM chip for an embedded 80186 system.

All the Intel and compatible chips from the 80186 onward implement an illegal opcode exception on Trap #6. All the chips, including the 8086, implement the Int 3 breakpoint exception as well.

An Int 3 is a special form of software interrupt in that the instruction is only one byte, whereas all the other software interrupt instructions are two bytes. Having a single byte instruction that's capable of executing an unconditional control transfer opens up some interesting possibilities.

An Int 3 could be used as a special purpose instruction to CALL a particular function that's CALL'ed from many different places in the code. A near call is 3 bytes, so using the Int 3 would save two bytes per invocation. If 100 instances of CALL's to a particular function existed in a program, making those calls via an Int 3 could save 200 bytes minus the overhead of setting up the Int 3 vector initially. The called routine could return via an IRET if the returned flags value didn't matter, or via a far RET 2 instruction if the flags were important on return. Grabbing the Int 3 vector and aiming it at this special routine during initialization isn't going to cost anywhere near 200 bytes, so there'll be a measurable space savings if there were 100 instances of calls in a program. If the called function happened to be a far function previously, then using an Int 3 to call it can save 4 bytes per invocation.

Here's an example of a Turbo/Borland C program that uses an Int 3 as a one byte CALL. It prints out a value passed to the called function in the AX register.

#pragma inline
/*---------------------------------------------------
CALL1.C - A demonstration of how to
use an Int 3 for a one byte "call".
---------------------------------------------------*/
#include <stdio.h>
#include <dos.h>

void interrupt Function(void)
   {
   printf("Function() called with AX=%04X\n", _AX);
   }

int main(void)
   {
   int i;
   void interrupt (*OldInt3)(void);
   /* 
     Save old Int 3, set new Int 3 handler
   */
   OldInt3 = getvect(3);
   setvect(3, Function);

   /*
     Loop 10 times displaying loop variable in called function
   */
   for (i=0; i<10; i++)
      {
      /* Set AX register to the value in 'i' */
     _AX = i;
      /* This next Int 3 is a one byte instruction */
      asm int 3
      }
   setvect(3, OldInt3);
   return 0;
   }

When run, this program displays the following output:

Function() called with AX=0000
Function() called with AX=0001
Function() called with AX=0002
Function() called with AX=0003
Function() called with AX=0004
Function() called with AX=0005
Function() called with AX=0006
Function() called with AX=0007
Function() called with AX=0008
Function() called with AX=0009

The code for doing calls with illegal opcode will look very similar to that in the CALL1.C example. The only difference is you'll be hooking Int 6 instead of 3 and using some illegal opcode rather than the Int 3. In the interrupt function itself one alteration will have to be made though. Before returning to the caller, the pushed return IP on the stack will have to be bumped up by however many bytes there are in the illegal instruction you've chosen.

Software interrupts, like Int 3, will push the address of the following instruction on the stack as a return address. An illegal opcode will push the address of the instruction that contains the illegal opcode.

If you're working on a ROM based system, or something where you could be guaranteed that there wouldn't be any programs running that did instructions expecting a numeric processor, then the 286 and above CPU's also offer the opportunity to do a similar trick using Trap 7. Trap 7 is the NPX (numeric processor extension) not present trap. By turning on the EM bit in the MSW you can enable Trap 7's to occur on any instruction that would normally have been sent to a coprocessor if one would have been present. The FWAIT instruction is also one byte like Int 3 was.

skull.gif (534 bytes) WARNING: The illegal opcode trick won't work in a DOS box under Windows 3.X in "enhanced" mode or Windows 95. Those systems don't reflect the illegal opcode exception back into the DOS box correctly. They'll insist that the app is misbehaving and say it should be terminated.

skull.gif (534 bytes) WARNING: The Trap 7 trick won't work in a DOS box under Windows 3.X in "enhanced" mode or Windows 95. Those systems don't reflect the Trap 7 exception back into the DOS box correctly. They'll insist that the app is misbehaving and say it should be terminated.

Given the two warnings about the illegal opcode and Trap 7 trick, I can't recommend using them in any general commercial application. These ploys are strictly for things like ROM based toaster controllers and alarm monitor boxes where you have total control over the software content of the product and can be certain those caveats won't be violated.

The Int 3 trick works fine in any environment. The only caveat when using it is that it's going to cause debuggers to go berserk if you've got any problems in the program. Reserve tricks like these for last ditch space saving efforts in fairly well debugged code.

check.gif (139 bytes)Work slow and with a lot of care when using dastardly tricks like this.

By the way, the Int 3 trick is sometimes used in programs as one of the many and varied anti-hacking techniques people employ to try and defeat someone attempting to take their program apart with a debugger.

Return to chapter index


That NPX Thang!

Return to chapter index

If you're working with some code that demands the floating point hardware, or have an application that runs in an environment where the system supplies a floating point emulator, then there's some interesting possibilities for serious space savings by recognizing the capabilities of the chip. Even the lowly old 8088 can directly do 32 or 64 bit integer arighmetic when paired with an 8087. If you've got a program that demands an 87' chip for other things, and/or is dragging around a floating point emulator already, why not use that capability to its fullest?

Opportunities abound when an 87' is present. 32 and 64 bit integer arithmetic are much simplified and smaller. Now you've got the ability to zero out up to 10 bytes of memory with a two instruction sequence like this one:

(2 bytes) fldz
(4 bytes) fstp tbyte ptr MemVar

If those 10 bytes were in a variable "pool" as described earlier, and were next to each other in memory, another byte can be shaved off. Say BX was aimed at the pool and the 10 bytes were at offset 18h in the pool. Then this would work:

(2 bytes) fldz
(3 bytes) fstp tbyte ptr [bx+18h]

5 bytes! We zeroed out 10 bytes worth of memory with only 5 bytes of code. One really cool side effect here is that we didn't have to burn a single general register accomplishing it! Just setting up to do a REP STOSW would take more than 5 bytes - and it's going to burn CX, DI, AX, and possibly ES if it isn't aimed at the right segment at the time.

A quad word of memory can be zeroed or set to one with similar instruction sequences:

fldz
fstp qword ptr MemVar

fld1
fistp qword ptr MemVar

The setting of a quad word to one can be handy when there's two "long" (i.e. dword) variables next to each other where one is to be zeroed and the other set to one. If the variables are ordered in a convenient way, then this is a viable trick.

Return to chapter index


Final thoughts for this chapter

Return to chapter index We've covered a lot of ground in this chapter. By no means is this a comprehensive encyclopedia of all possible tricks on the Intel 80x86 chips though! What it is, is a good start that I hope will get you thinking about possibilities. The exceptionally quirky nature of these chips, and their sometimes odd performance characteristics, offer the clever programmer a multitude of opportunities to shrink the size of code and/or make it run faster. The difference between "OK" and "very good" on these CPU's is usually in the 2X, 3X, or even 5X or 10X range. All of the things people love to hate about these CPU are sometimes hidden blessings - if only we can learn to see them as such. Even the simplest 2 or 3 instruction code sequence may offer startling opportunities like the FLDZ/FSTP trick does for zeroing 10 bytes in one whack.

Return to chapter index


Copyright © 1998, Tony Ingenoso e-mail graphic tonyi@ibm.net - Shut up and jump!
Last modified on Wednesday, Feb 3, 1999
This page produced the old fashoned way - with a text editor