tonyi@ibm.net
| Why? | Virtual Memory | Bloated=Slow | Assumptions | Real Life |
| Optimizations | Making Changes | Testing | Handy Stuff | What's Next |
The answer is performance and cost. Even today's machines with many megabytes of virtual memory run into memory related performance problems due to code size. Piggy code doesn't "play well with others". Bloated apps push other programs out of memory and slow the whole system down. They load slower than lean applications and consume more of the customer's disk space.
If a program is to be distributed on diskettes and more than a few thousand copies sold, the vendor will see significant distribution media costs as well. Diskettes aren't free. If millions of copies of a program are expected to be sold, and squeezing the code down a little allows a diskette to be dropped from the package, the vendor can wind up saving millions of dollars in manufacturing costs.
How much is it worth to save a couple of million dollars? For a vendor distributing millions of copies of something (or even a couple of hundred thousand), the effort needed to squeeze things down a little to save a diskette can payoff. It may even make economic sense to dedicate a small team who's sole job is to shrink the product!
Another fairly obvious advantage to keeping things as small as possible is that IT EXPANDS YOUR MARKET. It's a reality in today's software market that the installed base has a lot of clout. The bulk of the installed base is also always a few generations behind whatever the bleeding edge "hot box" of the day happens to be. As this book is being written, Pentium-2 400mhz systems with 64 megabytes of memory are the typical "hot" systems being touted in the trade press. This means there's still a LOT of original 60mhz Pentium class machines with 8 or 16 megabytes of memory chugging away doing their thing. Also, if you expect to sell your code into the government sector, then expect to see even older machines -- 286's, 386's, and 486's with a few megabytes of memory will be common. In these days of tight budgets, government agencies are trying to run lean and make due with what they've got on hand.
However, virtual memory isn't a panacea. It costs dearly when the
memory needs for the set of programs running on a machine exceed
available physical memory. When the space needed to run the
Operating System, device drivers, and apps, exceeds the physical
memory available on a system, it forces the OS's memory management
code to start swapping or paging chunks of memory to a swap file
on physical media - such as a hard disk, or over a LAN if the
machine is a diskless workstation.
Local physical memory is fast. Its right in the machine and
available in just nano-seconds. When a system is forced to swap
in a segment or page of virtual memory that was swapped, we're
talking about a trip through the OS's file system as it goes to
the external media to retrieve the needed memory. Other bigger
delays are associated with the speed of the external media the
OS is swapping or paging to.
In many cases, virtual memory systems will perform worse than
plain old DOS would running a similar program that used overlays,
EMS, and XMS memory techniques. In the "bad old days" a
program that used all the classic DOS memory techniques would,
by its very nature, have a fairly good idea of how and where data
and code were placed and managed. This is and was a pretty
efficient scheme most of the time, but it placed all of the burden
of managing the program's code and data on the application. With
virtual memory, you've generally got less control over where an
operating system keeps things in memory. The swapper or pager in a
virtual memory environment is like a force of nature with a mind
of its own. Parts of your program will be drifting in and out of
memory behind your back as the dynamic needs of the system change
over time.
Even a virtual memory system that's doing a lot of book keeping
and trying to make good "guesses" about how your program
is going to behave will only have a very rough idea of what to
expect and how to respond in a way that gives the best performance.
All of the fundamental concepts programmers learned under DOS back
in the bad old days, apply conceptually to virtual memory systems
today. A swapper or pager may be a force of its own rampaging
around a machine, but the programs running in that system can do a
lot to make the swapper or pager a relatively benign force rather
than a raging malevolent beast that brings powerful machines to
their knees in fits of page thrashing.
Running OS/2 or Windows 95 on a 4 megabyte machine will also give
you a quick appreciation for what the costs of virtual memory are,
and why its not free too. Your programs are impacted by virtual
memory the same way OS/2 and Windows 95 are.
In writing this book, I'm assuming you the reader, have some
experience as a programmer using C or C++. You don't need to be a
total C or C++ guru, but I'm not going to be giving tutorial lessons
on structures and basic data types either.
I also assume you have a basic understanding of the 80x86 instruction
set and have at least a casual ability to read some assembler code.
I'll not be explaining in detail what a segment register or an ADD
instruction is.
I assume you've got access to some compilers and assemblers so you
can try out some of the various things that will be discussed within.
I've tried to work in examples using C/C++ compilers from several
major vendors (primarily Borland, Microsoft, and Watcom). This isn't
a book about any one particular compiler though. The concepts usually
apply to all of them even though a particular vendor's way of enabling
some feature or optimization may be different than another vendor.
Whenever there's a feature that's specific to vendor's compiler, or
version of a compiler from a vendor, I'll mention that in the text.
I assume you have access to a code profiler and can use it to figure
out the "critical path" through your code. This is the
stuff that NEEDS to be fast no matter what. Many of the space saving
tricks and techniques herein won't be applied to critical path code
because they will slow it down too much. The good news though, is
that the bulk of the code in most programs isn't actually in the
critical path. The results from a profiler should guide almost
everything you do in tuning code for minimum space, and maximum
speed.
I can't stress this enough. Get your basic code working before
launching on any major tuning efforts. Doing performance work while
code is unstable, or a design is still being hammered out, is
usually a complete waste of time. This is not to say you shouldn't
be thinking about performance goals for a program up front as part
of the design work - you should do this if the program is one where
performance will be an issue. Choosing good data structures and
algorithms during the design phase always pays off better than the
most expertly executed tuning of poor code late in the game.
Strategic algorighm and data structure choices are likely going to
account for orders of magnitude improvements in a program's
behavior compared to the 2X-5X that would be more common
for "tactical" tuning efforts.later in the game.
Often we'll be handed something that's in trouble and tasked with
bailing out a behind schedule and over budget disaster -- this
happens in real life all the time. In cases like this there just
may not be enough time to do it right and redesign the project.
For whatever reason, management usually seems lothe to give up
on what exists, and insists on pouring good money after bad. In
situations like this we may be reduced to doing damage control
just to get the project to a point where it works well enough to
consider shipping. When this happens, any performance work falls
into a category I call "turd polishing" -- the
transformation of that which is truly awful into that which is
simply bad.
When you're stuck with a barely functioning house of cards that
is going to ship in a couple of weeks come hell or high water,
priorities need to be set differently than we would
under "ideal" circumstances. In a nightmare scenario
like this we may be forced into doing a little bit of performance
work before the code is truly stable -- because it's inevitable
that the product is going to ship loaded with bugs anyway.
If you're stuck with one of these all too common situations, you're
going to have to be real Draconian about triaging garden variety
bugs versus performance problems. Suppose this hypothetical
product from hell were a word processing package that takes 20
seconds to open up a trivial length file on a 200mhz Pentium
Pro machine? The vast majority of potential users are going to
consider this an unacceptable level of performance -- something
NEEDS to be done about that performance problem before the product
ships or the reviewers will throw up all over the product, then
sales will be zilch and you'll be out of a job when the company
goes bankrupt. At the same time, there's a number of garden variety
functional bugs that exist in relatively obscure features -- i.e.
things that just don't work right. None of these bugs is serious
enough to crash the program or corrupt the user's data. This
distasteful situation is one where you're going to be forced into
violating the basic principle of not optimizing unstable code
because the file opening problem is so severe it must be fixed.
There's really only two options when faced with a disaster like
this -- you can hunker down and polish that turd, or resign in
disgust.
If the end is in sight, I'd grit my teeth and polish that turd.
A successful turd polishing effort can actually be pretty rewarding
from a perverse technical point of view. You'll learn a lot from
doing it, and quite possibly be hailed as the hero who "saved
the project"!
One technique I personally use a LOT when doing tactical fine
tuning work is to IFDEF or #ifdef various tricks and techniques
into the code when I make changes. This makes it a LOT easier
for someone else to understand the thought process I went through
when applying a particular code reduction technique. They can see
the old code right next to the new code. If you were to look at
IBM's PC DOS source code you'd see hundreds of conditionalized
changes with ifdef's that look something like this:
I also tend to do optimizations in several phases. I'll start
grabbing the "low hanging fruit" and conditionalize
that on an initial ifdef like:
Then on subsequent and less obvious tuning changes I'll start using:
Suppose you made 10 changes in a module conditionalized with:
Then the unit/function tests showed that this particular batch of
changes caused the program to do something horrible - like crash
the machine. The IFDEF's make it easy for you to find every change
with your editor and then examine the code again. Often you'll find
that a simple examination of that batch of changes will reveal a
flaw in one of them.
If a quick visual examination doesn't make the bug obvious, then
go back and convert 5 of the 10 "ifdef __PERF__6"
conditional lines to:
Using this kind of "divide and conquer" approach you can
narrow down an area with a bug VERY quickly. Doing a quick rebuild,
then rerunning the tests for a given area of code will be a lot
easier than groping around with a debugger for hours trying to find
out why the program is crashing mysteriously somewhere. Once you
locate the exact conditional or few conditionals that are causing
the problem, the bug almost invariably becomes obvious.
You won't be able to enjoy this technique if you're trying to tune
code that hasn't stabilized yet though! You won't be able to tell
if the tuning changes introduced a bug, or if it was there all along.
Life will be miserable, and you'll be considering suicide. So, resist
the urge to start tweaking up code that's under active development
if at all possible.
If you find some change introduced a bug and it remains a mystery
even after it been isolated, then just leave that potential
optimization disabled. Move on to something else. Don't spend a
day trying to find out why a particularly tricky change that should
have saved 8 bytes doesn't work! The name of the game here is to
produce safe and tangible results quickly. You'll probably get
those 8 bytes back somewhere else for less effort than it takes to
debug the mystery.
The only time you'll want to sweat out the stragglers, is if you're
all done and still came up short on your goal and need "just a
few more bytes" to save a diskette or to fit in a ROM (if
you're ROM'ing your code).
This testing thing is real important. I'm not writing a book on
software testing, or how to write testable software though. One
book that touches on the topic is Steve Maguire's "Writing
Solid Code" published by Microsoft Press. Its quite good
and I recommend it highly. Steve will probably gag when he sees
some of the tricks and swindles described in this book though.
Our goals are different however. Steve talks about making code
work in a generic sense. I'm talking about applying the finishing
touches to code that already works today. Specifically, I'm going
to be talking about tuning efforts aimed at programs targeted for
Intel 80x86 and compatible processors.
You should take everything that Steve has written to heart when
you're creating a program, he's pitching concepts that can save
you untold hours of grief and frustration during the creation
effort.
If you conditionalize your tuning tricks like I've just outlined,
then you won't be violating too badly the spirit of what Steve is
saying in most cases. In other cases, some tuning tricks are going
to introduce major dependencies in sections of code. This can be
OK if you've defensively programmed around those dependencies so
they'll show up instantly if the dependency is ever violated.
For instance, the LES and LDS instructions can often be used to
load a two word values in real mode or V86 mode on the Intel
processors like this:
That sequence is a few bytes smaller than doing two MOV's from
memory into AX and DX. However, we've introduced a
dependency - Var1 and Var2 now NEED to be together in memory!
If some maintenance programmer happened to inject another
variable between Var1 and Var2 we'd be in trouble!
Now the LES is going to pick up the Foo variable and half of Var2,
and the result is going to be a horrible bug introduced into the
program.
We can protect against this dire possibility though with a compile
time check. Then the maintenance programmer will instantly know
something is wrong the first time they try a build of the program.
This time, the ".ERRNZ" compile time check will blow out
because the relationship between Var1 and Var2 was violated by the
introduction of the Foo variable.
Here's another example, using a C function that was written for a
16 bit C compiler this time.
This looks easy and straight forward, people do it all the time.
The problem though is that its assuming that an "int" is
always a 16 bit value. If this code is ever ported to a 32 bit C
compiler where an 'int' is typically 32 bits rather than 16, its
not going to do what you want. It'll trash one of the bits in the
middle of the integer rather than setting the high order bit.
Ouch! This function can be coded to work with either a 16 or 32
bit compiler, and warn of something else is used on it.
That version of the function knows exactly what environments it
works in, and which it won't. When porting it to some platform
that doesn't have 16 or 32 bit integers, the compiler is going
to instantly barf and let you know the function needs to be
worked on. This kind of stuff is cheap insurance folks - just
do it!
While I'm talking about being defensive here, let me mention
warning levels. Many C/C++ compilers today default to a minimum
warning level to keep older K&R style code from generating
a blizzard of warning messages. Get into the habit of cranking
the warning levels to the max. You really do want to see a lot
of those messages because there's often bugs hiding behind them!
Don't treat the compiler like it's an annoyance to be avoided
and dismissed. Modern C/C++ compilers all implement a very good
set of warning messages. These are your friend, not the enemy.
Programmers tend to favor killer machines for developing on and
this is natural because nobody likes to wait around for a build
to complete. Killer machines can also accelerate basic testing
on programs and save a lot of time. There's nothing wrong with
this as long as you realize that the whole world isn't out there
running the latest hot box being touted in the trade rags.
So, dust off that old 486-25 from few years ago sitting in the
closet and put it to good use as a test bed for your tuning
efforts. The sweet spot in the market, as this book is being
written, seems to be 233mhz and 200mhz Pentium machines with
about 32M or 64M of memory. 16M low end Pentium 133 systems
are still selling too. If you can tune up your code so that it
runs acceptably well on an older 486-25 with 8M or 16M of memory,
then it's going to be a screamer by default on the current crop
of machines being sold.
Don't fall into the trap of just testing a program on a Godzilla
machine without trying it out on some of the older generation
hardware. Sure, it may seem to run OK on a 233mhz Pentium,
but it may also be a real pig on a 486-25 running with a couple
of wait states using slower memory. A little bit of effort tuning
up a program so that it makes better use of that old 486-25's on
chip cache may be all it takes to transform the slug into
something that runs pretty good. Pentiums also respond very well
to efforts aimed at keeping code and data in their on chip
cache - this isn't effort that will be wasted! There's 2X, 3X and
even larger performance gains to be had on 486's and Pentiums by
paying some attention to their cache. All the newer generations
of Intel, and compatible, CPU's that have on chip caches are
highly dependent on that cache for getting maximum performance. I
expect this architectural trend to continue into the foreseeable
future.
Virtual Memory
Virtual memory is nice, and it relieves many of the problems
programmers lived with during the "plain DOS" era of PC
computing. Overlays, EMS, XMS, etc all caused a lot of people a lot
of headaches - and still do today. In fact, we'll be examining a
few tricks later on that can help when dealing with the
"plain DOS" world where that low 640K of memory is more
precious than gold.Bloated = Slow
"fast" code that doesn't fit within the a given machine's
physical memory isn't "fast" at all -- it's slow.
Want to see your PC come to a screeching halt and see your disk
drive light up like a Christmas tree? Run a big program on a
virtual memory system where there's not enough physical memory to
hold everything it needs. It makes a nice disk drive stress test.
We'll be examining a sample program in later chapters called MSPIGGY
that will drive a virtual memory system to its knees by accessing
virtual memory in a particularly evil way.Assumptions
This is not an introductory programming book on C/C++ or assembler.
There's many fine examples of introductory programming books lining
the shelves at your local book store. I'm not going to waste a lot
of time traveling down that well beaten path. Real life
Many of the code reduction or performance techniques in this book
are going to tend to obscure your code somewhat in the process of
shrinking it or making it faster. I can't recommend using many of
them on your first cut at newly developed code while its still
being debugged. Get your stuff really debugged before you launch
into any serious tweaking efforts.Optimization realism
It's nice when a
project was well planned and executed from the start -- however
this isn't always going to be the case. You may not have been the
one to do the original design and coding work.
ifdef __TONYI__1
<new smaller/faster stuff>
else
<old fat/slow stuff>
endif
ifdef __TONYI__1
ifdef __TONYI__2
ifdef __TONYI__3
etc...
Incremental Changes
I can't stress enough the value of incremental changes while doing
the type of tuning work this book is about. You will inject bugs on
occasion and you will get brain dead at times. The more you do,
the better you'll get at it and mistakes will lessen over time, but
nothing can replace running the test suite after making a group of
changes. Using the phased IFDEF approach will help you narrow down
the cause of a particular bug very quickly. I find its often quicker
than using a debugger to find a problem!
ifdef __PERF__6
ifdef __PERF__6_omit
Test Buckets
Your Test Bucket Is Your Friend, Use It And Abuse It For All Its
Worth! (and if its worthless, fix the damn thing!)
Var1 dw ?
Var2 dw ?
; Get Var1 and Var2 into DX:AX
les ax, dword ptr Var1
mov dx, es
Var1 dw ?
Foo db ?
Var2 dw ?
; Get Var1 and Var2 into DX:AX
les ax, dword ptr Var1
mov dx, es
Var1 dw ?
Foo db ?
Var2 dw ?
; Get Var1 and Var2 into DX:AX
.ERRNZ (Var2 - Var1 - 2)
les ax, dword ptr Var1
mov dx, es
int SetHighBitOfInt(int value)
{
return (value | (int)0x8000);
}
#include <limits.h>
int SetHighBitOfInt(int value)
{
#if INT_MAX == 32767
/* 16 bit compiler */
#define MASK 0x8000
#else
#if INT_MAX == 2147483647
/* 32 bit compiler */
#define MASK 0x80000000
#else
#error "We're in trouble here"
#endif
#endif
return (value | (int)MASK);
#undef MASK
}
Handy things
Have you ever bought some program and been unhappy with its
performance on your machine when you got it home? You probably
have unless you're sitting pretty with the latest Godzilla
Pentium 400mhz machine with 128 megabytes of RAM, and a blazing
fast 8 gigabyte or larger hard disk.What's coming
tonyi@ibm.net
- Shut up and jump!
Last modified on Monday, Feb 1, 1999