Chapter 10 - Special numbers = special opportunities!)

In computer programs it seems like some numbers just occur more often than others. For instance, numbers like zero, one, two, ten, sixteen, and minus one are pretty heavy hitters in many programs. Programmers the world over are very fond of using zero and minus one as return codes.
Chapter index
Zeroing registers Instruction flipping Zeroing 32 bit 386 registers
Zeroing memory Zeroing memory on 386's Setting memory to -1 on 386's
Setting registers to -1 on 386's Zeroing segment registers 0 and 1 with C compilers
Loading small values in 32 bit regs Speed compromise for 1 & 2 386SX considerations
C compilers & 1/2    

Zeroing a register (index)

What's the most obvious way to zero one of the 80x86 processor's registers, for example say the DX register? A MOV instruction leaps to mind.

mov   dx, 0      (3 bytes)

This is nice, it works, and indeed does put a zero into the DX register. What's wrong with this MOV instruction though? It's too big in many cases, that's what's wrong with it. It takes 3 bytes.

If the code that follows this MOV to the DX register isn't dependent on the state of the CPU flags, then an easy substitution that only takes two bytes is to XOR the register with itself.

xor dx,dx       (2 bytes)

This is better (it's also faster than the MOV in almost every case).

skull WARNING: Don't blindly go replacing MOV's of zero to registers with an XOR! Take a look at the code that follows and assure yourself that its NOT relying on the state of the CPU flags, because the XOR instruction is going to change them.

check TIP: If you write some code where a MOV is really needed because you can't change the flags, then put a comment right on that line of code that says explicitly "Don't XOR here!". The person who maintains the code later will thank you for it. You may even wind up being that person...

check TIP: Use utilities like GREP to locate all the instances of ",0" or ", 0" in the code you're working with. Once located, you can examine the code around them to see if you can apply the XOR transformation safely.

Is the XOR the best that can be done with our DX register here? Maybe not!

DX is a "special" register on the 80x86 CPU's and has close ties with the AX register. Often, this close relationship can be put to very good use to save space in a program.

Suppose we could be guaranteed that the high order bit in the AH register was a zero bit? This is a very convenient situation as it allows us to use the CWD instruction to zero out the DX register, and CWD is a single byte instruction. CWD does a sign-extend of AX into DX. If the high order bit in AX is zero, then CWD zeros DX. If the high order bit of AX is a one, then CWD puts FFFFh into DX.

check TIP: The CWD instruction does NOT affect the CPU flags.

If you see a section of code that looks something like this:

mov   ah,12
<other irrelevent things may occur here>
xor   dx,dx    (or a "mov dx,0")

then you can probably replace the XOR or MOV with a CWD in most cases, and turn a 2 or 3 byte instruction into a 1 byte instruction. The thing to watch out for here if you see the XOR, is if the code depended on settings of the flags after the XOR. In a case like that, you'll need to bite the bullet and pay the price of the 2 bytes for the XOR.


The Flipping instructions Trick (index)

Often you'll encounter this sequence of instructions in code:

xor     dx,dx   (2 bytes)
xor     ax,ax   (2 bytes)

They zero both AX and DX, and do indeed work fine. The problem is that they're too big and take 4 bytes. Let's flip the order of these instructions to be:

xor     ax,ax   (2 bytes)
xor     dx,dx   (2 bytes)

Hey, look what just happened here! Now we're guaranteed that the high order bit of the AX register is zero, and we can apply the CWD trick and turn that sequence into this one and save a byte!

xor     ax,ax   (2 bytes)
cwd             (1 byte )

Flipping the order of those instructions allowed us to do a transformation that cut their size by 25%. Note also in this case that the flags are exactly the same after the transformation as they were before because CWD doesn't hit the flags. Suppose we add a condition to the sequence above and stipulate that we know that the AL register happened to be zero before the sequence. We have a situation like this one:

<something happens here that leaves AL zero>
xor     ax,ax   (2 bytes)
cwd             (1 byte )

In this case we can probably transform that XOR into a CBW instruction to save another byte:

<something happens here that leaves AL zero>
cbw             (1 byte)
cwd             (1 byte)

Here, the CBW instruction sign extends AL into AX (making AX zero), and the CWD instruction sign extends AX into DX, making DX zero too. So, if we luck into the not uncommon situation where we know AL is zero, we can transform that original 4 byte sequence that zeroed AX and DX into a 2 byte sequence that zeros AX and DX and cut the size of the original code by 50%.


Zeroing 32 bit 386 registers (index)

In 16 bit code, the worst space price you're ever going to pay to zero one of the 16 bit general registers is the 3 bytes for a MOV immediate instruction. You're also probably going to only do this when you can't afford to hit the flags register with an XOR instruction.

In 32 bit code, if you're faced with this situation of not being able to use an XOR because it would zap the flags, then the penalty becomes a lot more extreme for an immediate MOV -- In 32 bit "flat" code, immediate MOV's to a register cost 5 bytes each. If your code is using 32 bit registers and is running in a USE16 segment, then they'll cost 6 bytes because of the 66h size override byte that's required.

Is there any other way to get around this 5 byte space penalty? Indeed there is. You can trade off some speed (a lot actually) to get it down to 3 bytes. The 386 and later CPU's have a marvelous variation of the PUSH instruction that can push a sign extended byte constant value on the stack as a 32 bit value. So let's say we need to zero out the EBX register but can't hit the flags, a sequence like this one will fit the bill nicely:

push  dword ptr 0  (2 bytes in "flat" code, 3 in USE16 segment)
pop   ebx          (1 byte in "flat" code,  2 in USE16 segment)
                    3 bytes total "flat"    5 bytes total in USE16

The USE16 variant here is only saving one byte because it's paying the price for two 66h override bytes. Still, there are situations where sometimes every byte counts.


Zeroing memory (index)

check The AX(or EAX on a 386) Register Is Special For Direct Memory Moves!

Moving the AX/EAX register in a MOV that doesn't use an index or base register is 1 byte smaller (and sometimes faster) than MOV'ing one of the other CPU registers to memory.

Suppose we see a sequence of instructions that look like this:

mov     word ptr Var1, 0   (6 bytes)
mov     word ptr Var2, 0   (6 bytes)
mov     word ptr Var3, 0   (6 bytes)
                           (18 bytes total)

That sequence is zeroing out three memory variables and doing it in a porky kind of way. If we have a register to spare (say CX), one straight forward transformation would be:

xor     cx, cx            (2 bytes)
mov     word ptr Var1,cx  (4 bytes)
mov     word ptr Var2,cx  (4 bytes)
mov     word ptr Var3,cx  (4 bytes)
                          (14 bytes total)

That change was good for a 4 byte savings over the original sequence. We can do better though. If AX is available, then we could do this:

xor     ax, ax            (2 bytes)
mov     word ptr Var1,ax  (3 bytes)
mov     word ptr Var2,ax  (3 bytes)
mov     word ptr Var3,ax  (3 bytes)
                          (11 bytes total)

That knocked another 3 bytes off from what we got using CX for the zeros.

Suppose AX wasn't available though because it held some needed value, but SI was available for use and was expendable? Let's FORCE AX to become available using the magic of the XCHG instruction.

In the case of XCHG where AX is one of the participants, register exchanges are a one byte instruction! Knowing this, we can transform our sequence into this:

xchg    ax, si            (1 byte)    Save critical AX into SI
xor     ax, ax            (2 bytes)
mov     word ptr Var1,ax  (3 bytes)
mov     word ptr Var2,ax  (3 bytes)
mov     word ptr Var3,ax  (3 bytes)
xchg    ax, si            (1 byte)    Restore critical AX value
                          (13 bytes total)

This sequence, even with 2 XCHG instructions, beat the version that used CX by a byte!

Suppose we're real hard up for registers, and we can't spare any of them? No problem -- PUSH/POP are ALSO single byte instructions. We can PUSH/POP AX and get this sequence:

push    ax                (1 byte)    Save critical AX on stack
xor     ax, ax            (2 bytes)
mov     word ptr Var1,ax  (3 bytes)
mov     word ptr Var2,ax  (3 bytes)
mov     word ptr Var3,ax  (3 bytes)
pop     ax                (1 byte)    Restore critical AX
                          (13 bytes total)

This sequence is still only 13 bytes, and doesn't hit any of the registers except for the flags that got hit during the XOR. We're doing a lot better than the original 18 byte version that was moving constant zeros to memory.

Now, suppose that we have the added constraint that we CAN'T hit the flags in this particular section of code that's zeroing Var1, Var2, and Var3, AND all the registers are busy as well?

In this case, we're really stuck and have to resort to a genuine MOV of a zero to AX like this:

push    ax                (1 byte)    Save critical AX on stack
mov     ax, 0             (3 bytes)
mov     word ptr Var1,ax  (3 bytes)
mov     word ptr Var2,ax  (3 bytes)
mov     word ptr Var3,ax  (3 bytes)
pop     ax                (1 byte)    Restore critical AX
                          (14 bytes total)

The MOV, PUSH, and POP instructions don't hit the flags, so we've kept the flags intact, and we're still doing 4 bytes better than that original bloated sequence that took 18 bytes -- not too bad really.

There's one more opportunity here that might present itself that could save one byte in this last situation. Suppose we KNEW that the critical value in AX just happened to be one where the AL register was zero? Rather than the MOV AX,0 we could apply the CBW trick to zero out AX without altering the flags, and that would knock it down to 13 bytes again.

check TIP: Pay Attention To Values In AX and AL. Use Them To Your Advantage!

Special opportunities for zeroing memory on the 386 (index)

The 386 and higher processors have some interesting possibilities when it comes to zeroing out double word memory variables.

It's common to see puffy 386 specific code that looks like this:

mov    dword ptr Var1,0

frown It works, but it's a massive instruction. In real(or V86 mode), this is a 9 byte whopper of an instruction. In "flat" 386 code, this is a 10 byte monster instruction (egads, this is a really grim state of affairs).

There is relief available though. Suppose we can waste the EAX register? Then we can do this:

xor eax,eax
mov dword ptr Var1,eax
2 bytes "flat"
5 bytes "flat"
3 bytes in real mode
4 bytes in real mode

That change helped a lot - we knocked 3 bytes off that original MOV instruction! Suppose we can't waste EAX to do the MOV though? All is not lost, there's still hope in the form of the AND instruction. One of the nice things about the Intel architecture is that the AND instruction has a special form where a sign extended byte immediate constant can be used. The only downside to this ploy is that it alters the CPU flags.

By realizing that AND'ing something with zero gives a result that's also zero we can do this and not need a free register to zero out with XOR:

and   dword ptr Var1,0  (7 bytes "flat", 6 bytes in real mode)

The AND variation also alters the CPU flags like the XOR/MOV form did. If preservation of the flags is crucial, consider wrapping a PUSHF/POPF (or PUSHFD/POPFD in "flat" code) around one of the alternate methods above. These are one byte instructions, so you'll still wind up ahead by a byte (it's very slow, but it does save that byte). By the way, all of the popular assemblers that support the 386 are smart enough to recognize that they can generate the shorter sign extended byte form of the AND instruction. Even the AND and XOR/MOV methods for zeroing that double word memory variable are still pretty big at 6 or 7 bytes. There's not much that can be done about this though for a single instance of zeroing that variable - you'll have to pay the price. Later on I'll be discussing a method for cutting these down even further when there are multiple instances of an instruction like the one above.


Special opportunities for setting memory to -1 on the 386 (index)

Minus 1 is also a special number just like zero is. You'll often see code that looks like this:

mov   dword ptr Var1,-1

This is also one of those 9 or 10 byte monster instructions just like the MOV of zero was that we just examined. It too yields to similar space reductions methods like zero did.

Just as the Intel CPU has a sign extended byte constant form of the AND instruction, it has a sign extended byte constant form of the OR instruction too.

Our porky MOV of -1 instruction above can be transformed by doing something like this:

or    dword ptr Var1,-1

Since OR'ing all the bits in something with 1's sets them all to one, we get the same effect as the MOV, only the OR is a lot smaller just like the AND was when setting the memory variable to zero.


Special opportunity for setting registers to -1 on the 386 (index)

Setting memory variables to -1 on the 386 yielded to the OR trick, so does setting a 32 bit register. If you've been working with 386 assembler code for a while, you've undoubtedly encountered something like this next instruction. It's a common occurance when someone is setting up to do a REP SCAS string operation to search for something.

mov   ecx,-1   (5 bytes in "flat" code)

Well, that can usually be tightened up into a 3 byte instruction by using the OR trick so that it looks like this:

or       ecx,-1 (3 bytes in "flat" code)

The assembler will be smart enough to generate the short sign extended byte form of the OR instruction. If the subsequent code isn't sensitive to the flags, then this transformation is a solid win. It also works great with any of the 32 bit registers. ECX isn't special in this regard.

Unlike many of these space saving tricks in this book, this one is a freebie in terms of speed. In fact, in most cases it is going to be faster than the MOV. On 386DX and 486 CPUs, these two instructions both clock out the same according to the Intel documentation, but the advantage of the form using the OR is that it's shorter by two bytes which makes the 386's prefetcher, and the 486's pipeline, both very happy (and 386SX's will be simply euphoric over this change given their 16 bit interface to memory).


Experience Using 0 and -1 with some 32 bit C Compilers (index)

We just got done discussing 0 and -1 on 386's from an assembler perspective, so let's see how this knowlege applies to a couple of examples of a 32 bit C++ compilers, and how you can maybe tighten up your 32 bit C code armed with that knowlege of how the 386 AND and OR instructions can work. Here's a little example C++ program that I compiled using Borland's 32 bit OS/2 C++ compiler (version 1.5), and the Watcom version 10.0 compiler running under Windows NT.

Example for testing and/or behavior
static int foo;         // A 32 bit integer

void bar(void)          // Calls to this function prevent the compiler
        {               // from collapsing the assignment statement
        foo++;          // experiments below into a single assignment.
        }

void main(void)
        {
        foo = 0;        // Set "foo" to zero
        bar();
        foo = -1;       // Set "foo" to -1
        bar();
        foo &= 0;   // Set "foo" to zero
        bar();
        foo |= -1;      // Set "foo to -1
        bar();
        }

This program does four different experiments on assignment statements. Here's the relevent parts of the code the Borland 1.5 OS/2 compiler produced for the above program. I compiled with the option to do "space" optimizations. Notice how the compiler generated those massive 10 byte immediate MOV instructions for the first two straight forward assignments of zero and -1 to the "foo" variable. Notice also how it generated the much smaller 7 byte AND and OR instructions for the second two experiments.

Borland results
;               {
;               foo = 0;
;
        mov     dword ptr foo,0
;
;               bar();
;
        call    near ptr @bar$qv
;
;               foo =  -1;
;
        mov     dword ptr foo,-1
;
;               bar();
;
        call    near ptr @bar$qv
;
;               foo &=0;
;
        and       dword ptr foo,0
;
;               bar();
;
        call    near ptr @bar$qv
;
;               foo |=-1;
;
        or        dword ptr foo,-1
;
;               bar();
;
        call    near ptr @bar$qv

Now let's take a look at what the Watcom compiler generated when a ran the same little program through it on Windows NT. I compiled with the "-os" (optimize favoring space) and "-3r" (386 register calling conventions) switchs for this test. This code list is an excerpt from the output of Watcom's OBJ module disassembler utility.

Watcom results
0011  68 04 00 00 00    main_         push    00000004H
0016  e8 00 00 00 00                  call    __CHK
001b  c7 05 00 00 00 00 00 00 00 00   mov     dword ptr int near foo,00000000H
0025  e8 00 00 00 00                  call    void near bar()
002a  c7 05 00 00 00 00 ff ff ff ff   mov     dword ptr int near foo,0ffffffffH
0034  e8 00 00 00 00                  call    void near bar()
0039  c7 05 00 00 00 00 00 00 00 00   mov     dword ptr int near foo,00000000H
0043  e8 00 00 00 00                  call    void near bar()
0048  c7 05 00 00 00 00 ff ff ff ff   mov     dword ptr int near foo,0ffffffffH
 0052  eb ac                           jmp     void near bar()

Notice how the Watcom generated code differs from the Borland compiler's. We couldn't trick the Watcom compiler (even though I used the -os space optimization switch) into generating the smaller AND and OR instructions like we could with the Borland compiler.

The Watcom compiler's code generator looks to favor speed in this situation even though we're looking for hard core space optimizations. It looks like the Watcom compiler is going to defeat any sort of trickery whereas the Borland compilers code generator is more literal in the way it's compiling code.

Compile that little test program with your 32 bit compiler using it's space optimizations and see what kind of code comes out. If the generated code winds up using the AND and OR instructions for the first two assignments of zero and minus one, then your compiler's space optimizations are pretty darn good. If the code looks more like that from the Watcom compiler, then doing special things for zero and minus one is probably not going to be too easy.

If your compiler's code looks like the Borland compiler's, then you might consider using a scheme like the one described below in your code when you make assignments of zero and minus one to memory variables.

Suppose we create a header file called OPTIMIZE.H that you would include in all of your modules and it defines two macros called ZERO and MINUS1 that you use to set memory variables to zero and minus one. OPTIMIZE.H might look something like this:

SAVESPACE macros
#ifdef SAVESPACE
  #define ZERO(var)    (var &= 0)      // Space optimized zero assignment
  #define MINUS1(var)  (var |= -1)     // Space optimized -1 assignment
#else
  // For sign/magnitude CPU's use this set of #define's
  #define ZERO(var)    (var = 0)       // Compiler default zero assignment
  #define MINUS1(var)  (var = -1)      // Compiler default -1 assignment
#endif

skull WARNING: This scheme won't work if the code is ever ported to a CPU that uses sign/magnitude numbers rather than 2's complement like the Intel processors do. On CPU's with sign/magnitude numbers, you'd want to use the second set of #define's because a -1 on those CPU's won't be all 1 bits!

In your MAKEFILE or project definitions, those modules that need hard core space optimizations would cause the "SAVESPACE" symbol to be defined. Different compilers will do this symbol definition different ways, but almost all will have some way of letting you specify a symbol to be defined when compiling a particular module.

Here's an example of how OPTIMIZE.H might be used in 32 bit C/C++ module:

SAVESPACE example
//-------------------------------------------------------
// An example C/C++ module that uses the scheme outlined
// in OPTIMIZE.H for setting memory variables to zero
// and minus 1.
//-------------------------------------------------------
// If your compiler doesn't allow command line symbol
// definitions, then you can just #define the SAVESPACE
// symbol at the top of each module that needs it.
//-------------------------------------------------------
//#define SAVESPACE // Uncomment this line to force SAVESPACE
#include <stdio.h>
#include <stdlib.h>
#include "OPTIMIZE.H"
#include "whatever.h"

extern  int  Var1, Var2;  // A couple of 32 bit integers

void foo(void)
        {
        ZERO(Var1);    // Equivalent to:    Var1 = 0;
        MINUS1(Var2);  // Equivalent to:    Var2 = -1;
        }

OK, there were some nice space savings for some C compilers with 32 bit code where integers are 32 bit quantities. Can this same trick help on 16 bit C compilers? Yes, it is possible. Even when an integer is 16 bits like on the 8086, if your compiler can be tricked into using the AND and OR instructions, the scheme in OPTIMIZE.H can save one byte per assignment over an immediate move to memory.

The same code we used for the 32 bit example in OPTIMIZE.H should work fine with a 16 bit C/C++ compiler for an Intel processor.


Zeroing segment registers (index)

Segment registers are special in that you can't MOV a constant value into them. You can only move a memory variable or one of the 8 general purpose registers to them.

This poses a problem, for instance, when you want to aim ES or DS at the interrupt vector table or BIOS data area at 0040h:0000h. To get the zero (or 0040h in the case of the BDA) into the segment register you'll often see code that looks like one of these sequences:

   xor     ax,ax      (2 bytes)
   mov     ax,0040h   (3 bytes)
   mov     ds,ax      (2 bytes)
   mov     es,ax      (2 bytes)

or even worse, a sequence like this one:

  BDAseg    db    0040h  (2 byte constant in code segment somewhere)
            <blah, blah, blah...>
            mov     es,cs:BDAseg   (5 bytes)

If we can stipulate that this code is to run on an 80186 or better CPU (which isn't a horrible limitation in a lot of cases these days because the 8088/8086 machines are dropping off the radar scope real fast), then we can do better.

One of the handier new instructions that showed up in the 80186 processor (the NEC V20 chips also have this capability) is it's ability to PUSH an immediate constant value. The 80286, 80386, and all subsequent processors also have this push immediate capability.

This makes zeroing a segment register a 3 byte proposition:

   push    0     (2 bytes)
   pop     ds    (1 byte)

This sequence that uses the immediate PUSH also has the added advantage that it doesn't use any other registers like the XOR/MOV variant did up above.

For values less than 128, the PUSH immediate will typically be encoded by most assemblers as a 2 byte instruction. For values 128 or over, it will be encoded as a 3 byte instruction.

So, for zeros, and things like the BIOS data area (0040h), you'll wind up with a 2 byte PUSH immediate.


Loading small numbers into 32 bit registers (index)

So far we've seen how to load zeros, and minus one into 32 bit registers in a pretty compact way using your basic XOR's (for zeros), and the OR trick (for -1). What about other numbers though? Are we doomed to get whacked with big 5 byte immediate MOV's for everything else? Nope. There's relief in the form of the PUSH immediate instruction.

Previously we had some discussion of the PUSH trick regarding zeroing a 32 bit register in situations where you couldn't hit the flags with an XOR. The push immediate instruction is a powerful ally for things other than zero too.

Let's say we're in some flat code and have an instruction like this one:

mov   edi,10      (5 bytes in "flat" code)

This is one of those fat 5 byte immediate MOV's. The PUSH immediate trick works well here too, so we could transform this into:

push    dword ptr 10  (2 bytes in "flat code)
pop     edi           (1 byte in "flat" code)

Not bad, we saved two bytes on what was a pretty big 5 byte instruction.
check This same trick works with any small numbers that can be represented in a sign extended byte constant (i.e. 0 through 127, and -1 through -128).

Is the one byte sign extended constant the end of the line for what we can accomplish with small numbers though? Nope. We can weasel out 128 and -129 before we "hit the wall" and come up equal to the 5 byte immediate MOV's size. This next trick involves the INC/DEC instructions which are nice in that they're one byte instructions. Suppose we want to load 128 into the EBP register. Here's a 4 byte instruction sequence that accomplishes the task:

push    dword ptr 127   (2 bytes in "flat" code)
pop     ebp             (1 byte in "flat" code)
inc     ebp             (1 byte in "flat" code)

It took 3 instructions, but we did get the value 128 into EBP, and still managed to beat out a 5 byte immediate MOV by a byte. The only downside here is that the INC instruction will alter the flags. If you've got a situation where you can't alter the flags, then you're stuck and have to pay the price for one of those 5 byte immediate MOV's.


A speed compromise for loading 1 or 2 into a 32 bit register (index)

The 32 bit PUSH immediate then POP trick is a good one, but it has one downside in that its very slow on a 386 or 486. Pentiums crank that combination out pretty fast, but the 386 and 486 suffer with it. If you're willing to compromise some space to get a bit of speed back for the values 1 and 2, and your intended target is a 386 or 486, then there's this trick:

xor     ebx,ebx   (2 bytes in "flat" code) ; Zeros EBX
inc     ebx       (1 byte in "flat" code)  ; EBX=1
inc     ebx       (1 byte in "flat" code)  ; EBX=2

This combination is medium speed but still saves a byte over a 5 byte immediate MOV to one of the 32 bit registers. Looking a little closer here, we see that loading the value 1 will only cost 3 bytes, and it's going to run pretty fast as well (well, faster than the 3 byte PUSH/POP trick did at least).

If you're loading a 1, and can afford to zap the flags, the XOR/INC trick is clearly the way to go as it's the same size as the PUSH/POP trick for "flat" code.


The 386SX/SL type processors are special (index)

If you know that your target is a 386SX style CPU with a 16 bit bus (like maybe you're ROM'ing some code for a toaster controller where the target CPU is known up front), then you'll actually want to do some of these "slower" space optimizations even in code that is supposed to be "fast".

Having a 16 bit bus and no onboard cache gives the 386SX some curious performance characteristics compared to the 386DX, 486, and Pentium CPU's. In many ways the 386SX behaves a lot like an 8088 in that the 386SX is very memory bandwidth sensitive. The 16 bit path to memory causes the 386SX execution unit to outrun its prefetcher almost all the time. This means that the 386SX is a lot more sensitive to code size than the other CPU's. Porky code causes the 386SX's execution unit to be idle far more often than a DX style CPU's execution unit would be on the same piece of code.

Let's take for example the case of moving the value 1 into the EAX register. If you're going for speed on a DX type CPU, then the fastest way to accomplish this is to do one of those 5 byte immediate MOV's like this:

mov     eax,1   (5 bytes in flat code)

On the poor old 386SX though, this one instruction takes three 16 bit memory cycles just to fetch! Since most 386SX systems aren't designed with any sort of cache memory, and probably have slower/cheaper main memory as well, we're probably talking about adding a wait state here too.

On the 386SX it is almost always faster to apply the XOR/INC trick for loading the value 1 into one of the 32 bit registers in flat code. So on a 386SX, you'll be looking to do something like this most of the time:

xor     eax,eax   (2 bytes in flat code)
inc     eax       (1 byte in flat code)

check This XOR/INC sequence runs about 30% faster on a 386SX than the MOV immediate does.


32 bit C/C++ compilers and values like 1 and 2 (index)

OK, we've seen how the numbers 1 and 2 (especially 1) can be loaded into 32 bit registers in ways that are smaller than a straight forward 5 byte MOV immediate instruction. How do some 32 bit C/C++ compilers react when we ask them to do a space optimization where small numbers are involved? Here's a little test function I compiled with the Borland V1.5 OS/2 C++ compiler and the Watcom V10.0 32 bit C++ compiler.

extern int frob(int,int,int);

int foo(int bar)
   {
   register int x;
   register int y;

   x = 1;
   y = 2;
   return frob(x,y,bar);
   }

Let's take a look at the code that the Borland compiler generated and see if it managed to notice the XOR/INC trick (or any of the others for that matter). Here's the relevent part of the ASM file the compiler generated. I compiled this function with the -G- and -Os space optimization switches.

Borland results
        push    ebp
        mov     ebp,esp
   ;
   ;       {
   ;       register int x;
   ;       register int y;
   ;
   ;       x = 1;
   ;
        mov     eax,1
   ;
   ;       y = 2;
   ;
        mov     edx,2
   ;
   ;       return frob(x,y,bar);
   ;
        push    dword ptr [ebp+8]
        push    edx
        push    eax
        call    near ptr _frob
@8:
@8@0:
        pop     ebp
        ret     4

No joy here. Even with all the space optimization switches thrown, the Borland compiler insisted on generating those big 5 byte immediate MOV instructions. The Watcom compiler didn't fare much better in terms of space, here's what it produced when I ran the resultant OBJ file through the Watcom object module disassembler.

Watcom results
 Module: E:\watcom\binb\test.c
 Group: 'DGROUP' CONST,CONST2,_DATA,_BSS

 Segment: _TEXT  BYTE USE32  00000020 bytes
  0000  68 0c 00 00 00    foo_    push    0000000cH
  0005  e8 00 00 00 00            call    __CHK
  000a  53                        push    ebx
  000b  52                        push    edx
  000c  89 c3                     mov     ebx,eax
  000e  ba 02 00 00 00            mov     edx,00000002H
  0013  b8 01 00 00 00            mov     eax,00000001H
  0018  e8 00 00 00 00            call    frob_
  001d  5a                        pop     edx
  001e  5b                        pop     ebx
  001f  c3                        ret

Again, no joy. Watcom defaulted to a register pass calling convention, whereas Borland did the stack thing instead, but they're both generating those big 5 byte MOV immediate instructions. Even worse, take a look at that first instruction the Watcom compiler generated to pass a parameter to its stack probe routine __CHK. The Watcom compiler could have generated one of the sign extended 32 bit push immediate instructions that are only 2 bytes, because the value being pushed (000Ch) is less than 128. It chose to generate the fatter 5 byte form instead. In this case, the 5 byte form of the PUSH immediate instruction does nothing but slow down and puff up the code by sloshing 3 wasted bytes worth of zeros through the prefetcher or pipeline.

Think back to the discussion we just had about the 386SX processor and it's 16 bit bus and consider the impact that the Watcom compiler's choice of the 5 byte PUSH immediate has on the poor 386SX. That 5 PUSH takes the 386SX 3 memory cycles to fetch, whereas the short sign extended form of the PUSH immediate would have been fetched in one memory cycle on the 386SX.

Watcom's generated code took up 32 bytes for our test function. How much better might it be if the compiler was smarter about space optimizations? Let's take the compiler's output and apply some of the things we've learned so far about shrinking code.

First off, we'd want to replace that 5 byte PUSH immediate with the two byte form of the instruction because the pushed value fits within a sign extended byte constant. That transformation is going to save us 3 bytes and cost nothing on speed (and it's actually run faster on 386SX's).

The next candidate for a space saving transformation would be the loading of EBX with the parameter that got passed in the EAX register. The Watcom compiler chose to do it with a MOV EBX,EAX instruction which is the fast way, but not the small way.

Since EAX is going to be reloaded with the value 1 just a couple of instructions later, EAX is actually expendable here. That MOV could be replaced with an XCHG instruction. When you're doing an XCHG with EAX and another 32 bit register, and the code lives in a USE32 segment (like flat code does), then the XCHG instruction only takes one byte. OK, the XOR transformation saved us another byte. Now we're beating the compiler by 4 bytes and the size of our space optimized code is down to 28 bytes.

You might think the next candidate for a space optimization is the MOV of the value 2 to the EDX register. A straight forward transformation for that would be this sequence:

xor     edx,edx    (2 bytes)
inc     edx        (1 byte)
inc     edx        (1 byte)

This isn't bad, it does save one byte over what the MOV immediate of the 2 cost. However, remember the section earlier in the book where I talked about Instruction Flipping? Maybe we can apply an instruction flip trick here and save even more space!

Suppose we mentally reordered the loading of EDX and EAX so they're reversed like this:

mov     eax,00000001H
mov     edx,00000002H

Hmmm, a good thing just happened here! Now we know that EAX is 1 before we try to load 2 into EDX.

1? 2? hey... 2 is only one more than 1. Suppose we replaced the MOV of the constant 2 to EDX with this:

mov     edx,eax   (2 bytes)
inc     edx       (1 byte)

This is pretty good, we just knocked two bytes off what it takes to load EDX up with that value 2, and the size of our code is down to 26 bytes versus the 32 bytes the compiler generated.

Now we apply the XOR/INC trick to the loading of the value 1 in EAX to get this:

xor     eax,eax    (2 bytes)
inc     eax        (1 byte)

This knocks off another 2 bytes and we're down to 24 bytes versus the compiler's 32. This isn't too bad at all. With some simple and fairly obvious instruction transformations, we cut the code size by 25%. Here's what it looks like when we put it all together and compare them side by side:


           Watcom generated code                           Our space tuned equivalent

  0000  68 0c 00 00 00    foo_  push    0000000cH           push dword ptr 0Ch      (2 bytes)
  0005  e8 00 00 00 00          call    __CHK               <same>
  000a  53                      push    ebx                 <same>
  000b  52                      push    edx                 <same>
  000c  89 c3                   mov     ebx,eax             xchg   ebx,eax          (1 byte)
  000e  ba 02 00 00 00          mov     edx,00000002H       xor    eax,eax          (2 bytes)
  0013  b8 01 00 00 00          mov     eax,00000001H       inc    eax              (1 byte)
                                                            mov    edx,eax          (2 bytes)
                                                            inc    edx              (1 byte)
  0018  e8 00 00 00 00          call    frob_               <same>
  001d  5a                      pop     edx                 <same>
  001e  5b                      pop     ebx                 <same>
  001f  c3                      ret                         <same>         
                                32 bytes total              24 bytes total

To be fair to the Watcom folks, their compiler will generate the shorter form of the PUSH immediate instruction in all cases other than stack probes. The feeling must be that stack probes will be turned off for "production" code. This is probably true generally, but there are certainly cases where probes might be desired in certain production situations. Suppose we were writing a program that was going to control subway train scheduling, or manage a nuclear reactor? In a case like that, a programmer could provide their own __CHK routine to take early recovery action to avoid a crash due to a low stack situation.


Copyright © 1998, Tony Ingenoso e-mail graphic tonyi@ibm.net - Shut up and jump!
Last modified on Frida, Jan 8, 1999
This page produced the old fashoned way - with a text editor