80x86 Optimization Introduction

MAKING CODE WORK BETTER

(How to minimize the size of 80x86 code and sometimes make it faster)
By Tony Ingenoso e-mail graphic

Why?	Virtual Memory	Bloated=Slow	Assumptions	Real Life
Optimizations	Making Changes	Testing	Handy Stuff	What's Next

Why Bother?

Why bother trying to make programs (or sections of programs) written for the PC small? After all, machines today come with multiple megabytes of memory, and modern operating systems are all equipped to provide virtual memory.

The answer is performance and cost. Even today's machines with many megabytes of virtual memory run into memory related performance problems due to code size. Piggy code doesn't "play well with others". Bloated apps push other programs out of memory and slow the whole system down. They load slower than lean applications and consume more of the customer's disk space.

If a program is to be distributed on diskettes and more than a few thousand copies sold, the vendor will see significant distribution media costs as well. Diskettes aren't free. If millions of copies of a program are expected to be sold, and squeezing the code down a little allows a diskette to be dropped from the package, the vendor can wind up saving millions of dollars in manufacturing costs.

How much is it worth to save a couple of million dollars? For a vendor distributing millions of copies of something (or even a couple of hundred thousand), the effort needed to squeeze things down a little to save a diskette can payoff. It may even make economic sense to dedicate a small team who's sole job is to shrink the product!

Another fairly obvious advantage to keeping things as small as possible is that IT EXPANDS YOUR MARKET. It's a reality in today's software market that the installed base has a lot of clout. The bulk of the installed base is also always a few generations behind whatever the bleeding edge "hot box" of the day happens to be. As this book is being written, Pentium-2 400mhz systems with 64 megabytes of memory are the typical "hot" systems being touted in the trade press. This means there's still a LOT of original 60mhz Pentium class machines with 8 or 16 megabytes of memory chugging away doing their thing. Also, if you expect to sell your code into the government sector, then expect to see even older machines -- 286's, 386's, and 486's with a few megabytes of memory will be common. In these days of tight budgets, government agencies are trying to run lean and make due with what they've got on hand.

Virtual Memory

Virtual memory is nice, and it relieves many of the problems programmers lived with during the "plain DOS" era of PC computing. Overlays, EMS, XMS, etc all caused a lot of people a lot of headaches - and still do today. In fact, we'll be examining a few tricks later on that can help when dealing with the "plain DOS" world where that low 640K of memory is more precious than gold.

However, virtual memory isn't a panacea. It costs dearly when the memory needs for the set of programs running on a machine exceed available physical memory. When the space needed to run the Operating System, device drivers, and apps, exceeds the physical memory available on a system, it forces the OS's memory management code to start swapping or paging chunks of memory to a swap file on physical media - such as a hard disk, or over a LAN if the machine is a diskless workstation.

Local physical memory is fast. Its right in the machine and available in just nano-seconds. When a system is forced to swap in a segment or page of virtual memory that was swapped, we're talking about a trip through the OS's file system as it goes to the external media to retrieve the needed memory. Other bigger delays are associated with the speed of the external media the OS is swapping or paging to.

In many cases, virtual memory systems will perform worse than plain old DOS would running a similar program that used overlays, EMS, and XMS memory techniques. In the "bad old days" a program that used all the classic DOS memory techniques would, by its very nature, have a fairly good idea of how and where data and code were placed and managed. This is and was a pretty efficient scheme most of the time, but it placed all of the burden of managing the program's code and data on the application. With virtual memory, you've generally got less control over where an operating system keeps things in memory. The swapper or pager in a virtual memory environment is like a force of nature with a mind of its own. Parts of your program will be drifting in and out of memory behind your back as the dynamic needs of the system change over time.

Even a virtual memory system that's doing a lot of book keeping and trying to make good "guesses" about how your program is going to behave will only have a very rough idea of what to expect and how to respond in a way that gives the best performance.

All of the fundamental concepts programmers learned under DOS back in the bad old days, apply conceptually to virtual memory systems today. A swapper or pager may be a force of its own rampaging around a machine, but the programs running in that system can do a lot to make the swapper or pager a relatively benign force rather than a raging malevolent beast that brings powerful machines to their knees in fits of page thrashing.

Bloated = Slow

"fast" code that doesn't fit within the a given machine's physical memory isn't "fast" at all -- it's slow. Want to see your PC come to a screeching halt and see your disk drive light up like a Christmas tree? Run a big program on a virtual memory system where there's not enough physical memory to hold everything it needs. It makes a nice disk drive stress test. We'll be examining a sample program in later chapters called MSPIGGY that will drive a virtual memory system to its knees by accessing virtual memory in a particularly evil way.

Running OS/2 or Windows 95 on a 4 megabyte machine will also give you a quick appreciation for what the costs of virtual memory are, and why its not free too. Your programs are impacted by virtual memory the same way OS/2 and Windows 95 are.

Assumptions

This is not an introductory programming book on C/C++ or assembler. There's many fine examples of introductory programming books lining the shelves at your local book store. I'm not going to waste a lot of time traveling down that well beaten path.

In writing this book, I'm assuming you the reader, have some experience as a programmer using C or C++. You don't need to be a total C or C++ guru, but I'm not going to be giving tutorial lessons on structures and basic data types either.

I also assume you have a basic understanding of the 80x86 instruction set and have at least a casual ability to read some assembler code. I'll not be explaining in detail what a segment register or an ADD instruction is.

I assume you've got access to some compilers and assemblers so you can try out some of the various things that will be discussed within. I've tried to work in examples using C/C++ compilers from several major vendors (primarily Borland, Microsoft, and Watcom). This isn't a book about any one particular compiler though. The concepts usually apply to all of them even though a particular vendor's way of enabling some feature or optimization may be different than another vendor. Whenever there's a feature that's specific to vendor's compiler, or version of a compiler from a vendor, I'll mention that in the text.

I assume you have access to a code profiler and can use it to figure out the "critical path" through your code. This is the stuff that NEEDS to be fast no matter what. Many of the space saving tricks and techniques herein won't be applied to critical path code because they will slow it down too much. The good news though, is that the bulk of the code in most programs isn't actually in the critical path. The results from a profiler should guide almost everything you do in tuning code for minimum space, and maximum speed.

Real life

Many of the code reduction or performance techniques in this book are going to tend to obscure your code somewhat in the process of shrinking it or making it faster. I can't recommend using many of them on your first cut at newly developed code while its still being debugged. Get your stuff really debugged before you launch into any serious tweaking efforts.

I can't stress this enough. Get your basic code working before launching on any major tuning efforts. Doing performance work while code is unstable, or a design is still being hammered out, is usually a complete waste of time. This is not to say you shouldn't be thinking about performance goals for a program up front as part of the design work - you should do this if the program is one where performance will be an issue. Choosing good data structures and algorithms during the design phase always pays off better than the most expertly executed tuning of poor code late in the game. Strategic algorighm and data structure choices are likely going to account for orders of magnitude improvements in a program's behavior compared to the 2X-5X that would be more common for "tactical" tuning efforts.later in the game.

Optimization realism

It's nice when a project was well planned and executed from the start -- however this isn't always going to be the case. You may not have been the one to do the original design and coding work.

Often we'll be handed something that's in trouble and tasked with bailing out a behind schedule and over budget disaster -- this happens in real life all the time. In cases like this there just may not be enough time to do it right and redesign the project.

For whatever reason, management usually seems lothe to give up on what exists, and insists on pouring good money after bad. In situations like this we may be reduced to doing damage control just to get the project to a point where it works well enough to consider shipping. When this happens, any performance work falls into a category I call "turd polishing" -- the transformation of that which is truly awful into that which is simply bad.

When you're stuck with a barely functioning house of cards that is going to ship in a couple of weeks come hell or high water, priorities need to be set differently than we would under "ideal" circumstances. In a nightmare scenario like this we may be forced into doing a little bit of performance work before the code is truly stable -- because it's inevitable that the product is going to ship loaded with bugs anyway.

If you're stuck with one of these all too common situations, you're going to have to be real Draconian about triaging garden variety bugs versus performance problems. Suppose this hypothetical product from hell were a word processing package that takes 20 seconds to open up a trivial length file on a 200mhz Pentium Pro machine? The vast majority of potential users are going to consider this an unacceptable level of performance -- something NEEDS to be done about that performance problem before the product ships or the reviewers will throw up all over the product, then sales will be zilch and you'll be out of a job when the company goes bankrupt. At the same time, there's a number of garden variety functional bugs that exist in relatively obscure features -- i.e. things that just don't work right. None of these bugs is serious enough to crash the program or corrupt the user's data. This distasteful situation is one where you're going to be forced into violating the basic principle of not optimizing unstable code because the file opening problem is so severe it must be fixed. There's really only two options when faced with a disaster like this -- you can hunker down and polish that turd, or resign in disgust.

If the end is in sight, I'd grit my teeth and polish that turd. A successful turd polishing effort can actually be pretty rewarding from a perverse technical point of view. You'll learn a lot from doing it, and quite possibly be hailed as the hero who "saved the project"!

One technique I personally use a LOT when doing tactical fine tuning work is to IFDEF or #ifdef various tricks and techniques into the code when I make changes. This makes it a LOT easier for someone else to understand the thought process I went through when applying a particular code reduction technique. They can see the old code right next to the new code. If you were to look at IBM's PC DOS source code you'd see hundreds of conditionalized changes with ifdef's that look something like this:

        ifdef __TONYI__1
                <new smaller/faster stuff>
        else
                <old fat/slow stuff>
        endif

I also tend to do optimizations in several phases. I'll start grabbing the "low hanging fruit" and conditionalize that on an initial ifdef like:

        ifdef __TONYI__1

Then on subsequent and less obvious tuning changes I'll start using:

        ifdef __TONYI__2
        ifdef __TONYI__3
        etc...

Incremental Changes

I can't stress enough the value of incremental changes while doing the type of tuning work this book is about. You will inject bugs on occasion and you will get brain dead at times. The more you do, the better you'll get at it and mistakes will lessen over time, but nothing can replace running the test suite after making a group of changes. Using the phased IFDEF approach will help you narrow down the cause of a particular bug very quickly. I find its often quicker than using a debugger to find a problem!

Suppose you made 10 changes in a module conditionalized with:

        ifdef __PERF__6

Then the unit/function tests showed that this particular batch of changes caused the program to do something horrible - like crash the machine. The IFDEF's make it easy for you to find every change with your editor and then examine the code again. Often you'll find that a simple examination of that batch of changes will reveal a flaw in one of them.

If a quick visual examination doesn't make the bug obvious, then go back and convert 5 of the 10 "ifdef __PERF__6" conditional lines to:

        ifdef __PERF__6_omit

Using this kind of "divide and conquer" approach you can narrow down an area with a bug VERY quickly. Doing a quick rebuild, then rerunning the tests for a given area of code will be a lot easier than groping around with a debugger for hours trying to find out why the program is crashing mysteriously somewhere. Once you locate the exact conditional or few conditionals that are causing the problem, the bug almost invariably becomes obvious.

You won't be able to enjoy this technique if you're trying to tune code that hasn't stabilized yet though! You won't be able to tell if the tuning changes introduced a bug, or if it was there all along. Life will be miserable, and you'll be considering suicide. So, resist the urge to start tweaking up code that's under active development if at all possible.

Test Buckets

Your Test Bucket Is Your Friend, Use It And Abuse It For All Its Worth! (and if its worthless, fix the damn thing!)

If you find some change introduced a bug and it remains a mystery even after it been isolated, then just leave that potential optimization disabled. Move on to something else. Don't spend a day trying to find out why a particularly tricky change that should have saved 8 bytes doesn't work! The name of the game here is to produce safe and tangible results quickly. You'll probably get those 8 bytes back somewhere else for less effort than it takes to debug the mystery.

The only time you'll want to sweat out the stragglers, is if you're all done and still came up short on your goal and need "just a few more bytes" to save a diskette or to fit in a ROM (if you're ROM'ing your code).

This testing thing is real important. I'm not writing a book on software testing, or how to write testable software though. One book that touches on the topic is Steve Maguire's "Writing Solid Code" published by Microsoft Press. Its quite good and I recommend it highly. Steve will probably gag when he sees some of the tricks and swindles described in this book though.

Our goals are different however. Steve talks about making code work in a generic sense. I'm talking about applying the finishing touches to code that already works today. Specifically, I'm going to be talking about tuning efforts aimed at programs targeted for Intel 80x86 and compatible processors.

You should take everything that Steve has written to heart when you're creating a program, he's pitching concepts that can save you untold hours of grief and frustration during the creation effort.

If you conditionalize your tuning tricks like I've just outlined, then you won't be violating too badly the spirit of what Steve is saying in most cases. In other cases, some tuning tricks are going to introduce major dependencies in sections of code. This can be OK if you've defensively programmed around those dependencies so they'll show up instantly if the dependency is ever violated.

For instance, the LES and LDS instructions can often be used to load a two word values in real mode or V86 mode on the Intel processors like this:

        Var1 dw ?
        Var2 dw ?

        ; Get Var1 and Var2 into DX:AX

        les ax, dword ptr Var1
        mov dx, es

That sequence is a few bytes smaller than doing two MOV's from memory into AX and DX. However, we've introduced a dependency - Var1 and Var2 now NEED to be together in memory!

If some maintenance programmer happened to inject another variable between Var1 and Var2 we'd be in trouble!

        Var1 dw ?
        Foo  db ?
        Var2 dw ?

        ; Get Var1 and Var2 into DX:AX

        les ax, dword ptr Var1
        mov dx, es

Now the LES is going to pick up the Foo variable and half of Var2, and the result is going to be a horrible bug introduced into the program.

We can protect against this dire possibility though with a compile time check. Then the maintenance programmer will instantly know something is wrong the first time they try a build of the program.

        Var1 dw ?
        Foo  db ?
        Var2 dw ?

        ; Get Var1 and Var2 into DX:AX

        .ERRNZ (Var2 - Var1 - 2)
        les ax, dword ptr Var1
        mov dx, es

This time, the ".ERRNZ" compile time check will blow out because the relationship between Var1 and Var2 was violated by the introduction of the Foo variable.

Here's another example, using a C function that was written for a 16 bit C compiler this time.

        int SetHighBitOfInt(int value)
                {
                return (value | (int)0x8000);
                }

This looks easy and straight forward, people do it all the time. The problem though is that its assuming that an "int" is always a 16 bit value. If this code is ever ported to a 32 bit C compiler where an 'int' is typically 32 bits rather than 16, its not going to do what you want. It'll trash one of the bits in the middle of the integer rather than setting the high order bit. Ouch! This function can be coded to work with either a 16 or 32 bit compiler, and warn of something else is used on it.

        #include <limits.h>

        int SetHighBitOfInt(int value)
                {
                #if INT_MAX == 32767
                   /* 16 bit compiler */
                   #define MASK 0x8000
                #else
                   #if INT_MAX == 2147483647
                   /* 32 bit compiler */
                   #define MASK 0x80000000
                      #else
                         #error "We're in trouble here"
                      #endif
                   #endif
                return (value | (int)MASK);
                #undef MASK
                }

That version of the function knows exactly what environments it works in, and which it won't. When porting it to some platform that doesn't have 16 or 32 bit integers, the compiler is going to instantly barf and let you know the function needs to be worked on. This kind of stuff is cheap insurance folks - just do it!

While I'm talking about being defensive here, let me mention warning levels. Many C/C++ compilers today default to a minimum warning level to keep older K&R style code from generating a blizzard of warning messages. Get into the habit of cranking the warning levels to the max. You really do want to see a lot of those messages because there's often bugs hiding behind them! Don't treat the compiler like it's an annoyance to be avoided and dismissed. Modern C/C++ compilers all implement a very good set of warning messages. These are your friend, not the enemy.

Handy things

Have you ever bought some program and been unhappy with its performance on your machine when you got it home? You probably have unless you're sitting pretty with the latest Godzilla Pentium 400mhz machine with 128 megabytes of RAM, and a blazing fast 8 gigabyte or larger hard disk.

Programmers tend to favor killer machines for developing on and this is natural because nobody likes to wait around for a build to complete. Killer machines can also accelerate basic testing on programs and save a lot of time. There's nothing wrong with this as long as you realize that the whole world isn't out there running the latest hot box being touted in the trade rags.

So, dust off that old 486-25 from few years ago sitting in the closet and put it to good use as a test bed for your tuning efforts. The sweet spot in the market, as this book is being written, seems to be 233mhz and 200mhz Pentium machines with about 32M or 64M of memory. 16M low end Pentium 133 systems are still selling too. If you can tune up your code so that it runs acceptably well on an older 486-25 with 8M or 16M of memory, then it's going to be a screamer by default on the current crop of machines being sold.

Don't fall into the trap of just testing a program on a Godzilla machine without trying it out on some of the older generation hardware. Sure, it may seem to run OK on a 233mhz Pentium, but it may also be a real pig on a 486-25 running with a couple of wait states using slower memory. A little bit of effort tuning up a program so that it makes better use of that old 486-25's on chip cache may be all it takes to transform the slug into something that runs pretty good. Pentiums also respond very well to efforts aimed at keeping code and data in their on chip cache - this isn't effort that will be wasted! There's 2X, 3X and even larger performance gains to be had on 486's and Pentiums by paying some attention to their cache. All the newer generations of Intel, and compatible, CPU's that have on chip caches are highly dependent on that cache for getting maximum performance. I expect this architectural trend to continue into the foreseeable future.

What's coming

Utilities for locating problem areas in your programs and data files
Working with C/C++ compiler's
Code transformations in assembler code
Code transformations in C code
Tricks for packaging code so that its smaller
Special tactics for DOS device drivers and TSR's
Special numbers
Speed and space tricks for the 386 and better CPU's
"Working set", what it is, and how to chop it down so programs run better.
Sleazy tricks for when you need to get just a few more K to save the cost of a diskette.
A special 32 bit implementation of a near heap that can cut data space
Code generated by some popular C/C++ compilers to see how well they do
The dramatic performance effects the on chip caches on the 486 and Pentium have
Getting the most out of the HMA when DOS=HIGH.

tonyi@ibm.net - Shut up and jump!

Last modified on Monday, Feb 1, 1999