Packaging code for size and speed

Chapter 12 - Packaging code for size and speed

The way we choose to "package" our code and data into executable programs can have some serious effects on the size of the executables, and the way they'll perform when they are executed.

Have you ever noticed that some applications having very large EXE files seem to load and get to their initial screen almost instantly? Then there are other similar sized applications that sit there and grind for several seconds before showing their initial screen. Neither of those applications may be in a condition to actually let the user actually do anything with them for several seconds, but there's the perception of speed in one case and not in the other. The difference may simply be how the code was packaged when the application was compiled and linked.

We're going to cover the mechanics of how applications are constructed in this chapter. This may be something you never paid much attention to in the past. It is pretty easy to ignore the details of this stuff these days and just take all the defaults provided by whatever development tools you happen to be using. Often taking those defaults isn't going to be the best way though. Sometimes, those defaults can screw us really bad when we're looking to squeeze a few more bytes out of a program, or reduce it's working set in memory.

Segments, groups, and memory models

There are two notions of what a "segment" is in the 16 bit world on the 80x86 processors. There's the physical notion of "segment" that everyone loves to hate. This is where one of the processor's segment registers is aimed at a chunk of memory that can be up to 64K in size. This memory can be virtual if running in protected mode, or physical if running in real mode.

In real mode, there are no hardware enforced restrictions on the range of bytes that can be accessed in a segment. In protected mode, a segment may have hardware enforced "limits" that restrict the range of valid addresses within the segment's potential 64K range. For the purposes of this discussion, a 386/486/Pentium running programs in V86 mode behaves just like real mode from a program's point of view. V86 mode is a special mode of operation where a 386 running in protected mode can mimic most of the behaviors of a 386 running in real mode. If you've ever run a "DOS session" under Windows 3.X in enhanced mode, or under OS/2, or under Windows 95, or Windows NT, then you were using that processor in V86 mode to run that DOS session. If you've ever used a DOS memory manager like EMM386, Quarterdeck's QEMM, Qualitas's 386Max, etc. then you were also running in V86 mode.

All of those memory managers actually run the CPU in protected mode with the paging hardware enabled. It's the paging hardware that enables them to do their tricks to create upper memory blocks (UMB's) that drivers and TSR's can be "loaded high" in.

Along with the physical manifestation of segments, there is the logical notion of a segment that assembler programmers are familiar with. The logical notion may or may not have a direct correspondence with the physical notion depending on how the code has been organized and packaged by compilers, assemblers, and linkers. A typical example of a "segment" from the logical perspective might look like this in assembler code:

foobar segment para public 'code'
;
; Things in the "foobar" segment go here
;
foobar ends

Logical segments like this always have a name of some sort - in this example its "foobar". These names will be created by default when using a C/C++ compiler. Commonly used names in current compilers are things like "_TEXT", "_DATA", and "_BSS". If you tell whatever linker or development tool you're using to create a linker MAP file for a C/C++ program, you'll see a whole bunch of these logical segment names in the MAP file listing. For example, I compiled this empty C program using all the default code generation switches for the Microsoft version 8.0 C/C++ compiler. I added the /Fm switch to force a MAP file to be created because its not by default.

/* EMPTY.C */
int main(void)
   {
   return 0;
   }

The first part of the MAP file that resulted looked like this:

Start  Stop   Length Name Class
00000H 00783H 00784H _TEXT CODE
00790H 00791H 00002H EMULATOR_TEXT CODE
00792H 00792H 00000H C_ETEXT ENDCODE
007A0H 007A0H 00000H EMULATOR_DATA FAR_DATA
007A0H 007E1H 00042H NULL BEGDATA
007E2H 00863H 00082H _DATA DATA
00864H 00865H 00002H XIQC DATA
00866H 00871H 0000CH DBDATA DATA
00872H 0087FH 0000EH CDATA DATA
00880H 00880H 00000H XIFB DATA
00880H 00880H 00000H XIF DATA
00880H 00880H 00000H XIFE DATA
00880H 00880H 00000H XIB DATA
00880H 00880H 00000H XI DATA
00880H 00880H 00000H XIE DATA
00880H 00880H 00000H XPB DATA
00880H 00880H 00000H XP DATA
00880H 00880H 00000H XPE DATA
00880H 00880H 00000H XCB DATA
00880H 00880H 00000H XC DATA
00880H 00880H 00000H XCE DATA
00880H 00880H 00000H XCFB DATA
00880H 00880H 00000H XCFCRT DATA
00880H 00880H 00000H XCF DATA
00880H 00880H 00000H XCFE DATA
00880H 00880H 00000H XIFCB DATA
00880H 00880H 00000H XIFU DATA
00880H 00880H 00000H XIFL DATA
00880H 00880H 00000H XIFM DATA
00880H 00880H 00000H XIFCE DATA
00880H 00880H 00000H CONST CONST
00880H 00887H 00008H HDR MSG
00888H 0095DH 000D6H MSG MSG
0095EH 0095FH 00002H PAD MSG
00960H 00960H 00001H EPAD MSG
00962H 00962H 00000H _BSS BSS
00962H 00962H 00000H XOB BSS
00962H 00962H 00000H XO BSS
00962H 00962H 00000H XOE BSS
00962H 00962H 00000H XOFB BSS
00962H 00962H 00000H XOF BSS
00962H 00962H 00000H XOFE BSS
00970H 0116FH 00800H STACK STACK
Origin Group
007A:0 DGROUP

Notice all the segment names under the "names" column in the listing. There are quite a few of them. That's quite a lot of stuff for a program that does nothing but return a 0! Also notice that quite a few of those "segments" are actually zero bytes long.

There's nothing that says one of these logical segments needs to contain any data or code, and often they don't.

In fact, the following segment appears to be present simply as marker to delineate the end of all the code in the program:

00792H 00792H 00000H C_ETEXT ENDCODE

See what came right after the "C_ETEXT" segment? It was this one:

007A0H 007A0H 00000H EMULATOR_DATA FAR_DATA

Now take a look at the last line in the listing there under the "Origin" column. We see that a group called "DGROUP" happens to start at that same location and there are all sorts of "segments" in this DGROUP group. This illustrates what the notion of a "group" is.

A "group" is a collection of one or more of these logical "segments". Some of those segments may contain nothing, others can contain actual code or data.

In 16 bit programs, there's one limitation on the size of any given "segment" - it can't be more than 64K in length. The limitation on a group is that the combined size of all the segments in that group can't total more than 64K.

It wouldn't be possible to "group" together two 40K segments. The total size of the group would be more than 64K. It would be permissible to group together two 16K segments like this:

Two 16K segments in a 32K group
SEG1 - 16K SEG2 - 16K

SEG1SEG2 group (size 32K)

Two 16K segments in a 32K group
SEG1 - 16K	SEG2 - 16K
SEG1SEG2 group (size 32K)

There's no requirement that groups be exactly 64K

Typically they're going to be something less than 64K. In fact, all of the segments in the DGROUP group from the EMPTY.C example program totaled far less than 64K. Subtracting the start of the DGROUP at 7A0h from where the "STACK" segment in that MAP file ends gives 2511 bytes for the DGROUP group. The bulk of that is in the STACK segment which is 2K.

In the previous MAP file listing there was a column titled "Class".

A class name is a way to get segments with different names located next to each other in an executable.

Linkers will always place segments containing identical class names next to each other when the segment names are the same.

If a segment is standalone and not in any group, then the class name doesn't matter much.

Playing with segment names, groups, and class names is how we can make executable files smaller and/or tune them so they'll work better in virtual memory environments. Remember the chapter where the performance impacts of the 486 and Pentium's on chip caches were discussed?

Diddling around with groups, names, and class names is how we're going to help locate critical path code and data to make the most effective use of those CPU caches.

Tiny model

The "tiny" memory model is the oldest and simplest of the 16 bit memory models available for packaging programs. In DOS terms, tiny model corresponds to a program with the ".COM" file extension.

In tiny model, all of a program's initial code and data live within a single 64K physical memory segment.

There may be lots of logical segments in a tiny model program as well and there usually are if the program was generated by a C/C++ compiler. In the previous example, that MAP file was for a "small" model program, but a MAP file for a tiny model version would look very similar.

Tiny model assembler programs will often contain just one logical segment that hold all the program's code and data.

This is an example of the smallest possible tiny model DOS program. It would generate a ".COM" file that's one byte long and does nothing but return to DOS when it's executed.

tiny  segment para public 'code'
      assume  cs:tiny,ds:tiny,es:tiny,ss:tiny
      org     100h
start proc    near
      ret     ; Return to DOS
start endp
tiny  ends
      end     start

In tiny model programs, the CS,DS,ES, and SS registers are normally all aimed at the same physical segment in memory, even though there may be many logical segment in the program.

        +------------ data group & code group are the same
        V
    Codeseg1     <--- DS, ES, CS, SS at program startup
    Dataseg1    
    Dataseg2    
    Codeseg2    
    Dataseg3    
   Empty space  
   Empty space  
   Empty space   <--- Initial SP value

The key thing about the tiny model scheme is there can never be more than 64K-256 bytes of combined code and data in the program's executable, and the segments must all be in the same group.

When DOS loads a ".COM" program, DOS performs the following actions:

Create a 256 byte header for the program called a program segment prefix (PSP).
Load the contents of the ".COM" file right after the PSP. This is done verbatim.
Set the CPU's segment registers to all point to the PSP.
Set AX,BX,CX,DX,SI,DI,BP to zero.
Set SP pointing to the end of the 64K for the .COM program.
JMP to offset 100h from the start of the PSP. This is the first byte of the program.

This scheme implies that the first byte in a genuine ".COM" program MUST BE AN EXECUTABLE INSTRUCTION - it can't be data because DOS is going to JMP to the first byte in the program.

The first instruction in many tiny model programs in a JMP that jumps around the data for the program - like this:

         jmp initialize ;<-- This is the first instruction in the program
foo1     dw  ?          ; This is the
foo2     dd  ?          ; data for the
bar1     db  ?          ; program
initialize label near
;
; The program's code goes here
;

Tiny model programs are not NORMALLY suitable for execution in protected mode environments.

With some trickery using segment aliases, the concept can be made to work, but no language products currently in existence allow the construction of a single segment protected mode C/C++ program.

If you were to build a DPMI (DOS protected mode interface) compliant program from scratch, not using a DOS extender, then tiny model protected mode programs are possible, and indeed this is basically what happens to a DPMI program when it first flips into protected mode.

For those readers interested such things, there's a sample tiny model DPMI "hello world" program written in assembler in Appendix B in the back of the book. appndx_b.htm

Small model

Small model programs are similar to tiny model programs. The main difference this time is that there can be 64K of code and 64K of data in a small model program as opposed to 64K total code and data for a tiny model program. Small model programs also use the ".EXE" format for their executables rather than the ".COM" format. As in tiny model, there can be many code and data segments. The initial entry point for a small model program doesn't need to be the first byte in the program either. A generic layout of a small model program might look like this:

        +------------------- code group
        V
    Codeseg1     <----- CS:
    Codeseg2    
    Codeseg3     <----- Program entry point


        +------------------- data group
        V
    Dataseg1     <----- DS: and SS:
    Dataseg2    
    Stackseg     <----- Initial SP value

There's many different ways to layout the various data segments and stack segments in a small model program, so this diagram is just a basic conceptual model. Different C/C++ compiler vendors all lay things out a little differently in their implementations.

The common elements though generally are:

All code is packaged together in one 64K or smaller group
All data is packaged together in one 64K or smaller group
The stack lives somewhere within the same group that general data lives in
Pointers to functions are 16 bit near pointers
Pointers to data are 16 bit near pointers

Medium model

Medium model programs are a hybrid memory model where more than 64K of code is allowed, but data is limited to 64K as in small model. Medium model programs always use the ".EXE" format for their executables. A generic layout of a medium model program might look like this:


        +-- code group1
        V
    Codeseg1     <--- CS when running code in this group
    Codeseg2    
    Codeseg3    

        +-- code group2
        V
    Codeseg4     <--- CS when running code in this group
    Codeseg5    
    Codeseg6     <--- Program entry point

        +-- code group3
        V
    Codeseg7     <--- CS when running code in this group
    Codeseg8    
    Codeseg9    

        +-- data group
        V
    Dataseg1     <--- DS and SS
    Dataseg2    
    Stackseg     <--- Initial SP value

There are many different ways to layout the various data segments and stack segments in a medium model program, so this diagram is just conceptual. Different C/C++ compiler vendors all lay things out a little differently in their implementations. However, the common elements generally are:

Multiple physical code segments.
All data is packaged together in one 64K or smaller group
The stack lives somewhere within the same group that general data lives in
Pointers to functions are 32 bit far pointers
Pointers to data are 16 bit near pointers

Compact model

Compact model is kind of the opposite concept of medium model. Where medium model is big code with small data, compact model is big data with small code.

Compact models for most compilers also allows for a "near" heap as well as a far heap.
When it's practical to use it, near heap items are faster to access than normal data items in compact model.
With some compilers there's going to be a 64K limit on the amount of statically allocated data.

Large model

Large model is an amalgam of the medium and compact memory models where there are multiple physical code segments, and multiple physical data segments allowed.

With some compilers there's going to be a 64K limit on the amount of statically allocated data in a large model program.

Huge model

In all of the previous memory models that allow for big data, there's going to be a restriction that any given item of data had to be less than 64K in length. This is due to the fact that while they'll use 32 bit far pointers for data accesses, the offset part of those pointers is going to wrap around at the 64K limit on segments. In other words, the compiler will generate code that only does 16 bit arithmetic on the 16 bit offset part of the 32 bit pointer. The 16 bit segment will remain untouched.

Huge model lifts this restriction on pointer wrapping by allowing for individual data items to be larger than 64K.

Take this C program for example:

#include <stdio.h>
char big1[500000L] = {0};
int main(void)
   {
   printf("size of char * = %d\n", sizeof(char *));
   printf("size of big1 array = %ld\n", (long)sizeof(big1));
   return 0;
   }

This C program declares 500K of static data in a single array of characters. If segment wrapping were to occur as in the other "big data" memory models, any code trying to access beyond the first 64K of the array would fail.

Huge model solves this wrapping effect problem by doing what's known as pointer normalizations. When a pointer to a data item larger than 64K is incremented, the segment part of the pointer gets adjusted so that the offset doesn't wrap around. This is a very slow process though.

In protected mode huge model programs, the procedure is conceptually similar to real mode, but the pointer arithmetic procedure is somewhat different. In protected mode, "huge" data items result in a sequential group of selectors being allocated. The pointer arithmetic process in protected mode involves computing what the next selector in the "huge" object will be. Normally, this process is simplified somewhat by having an operating system defined constant value that will be added to, or subtracted from, an existing selector to produce the next one in line.

Huge model programs pay a HORRIFIC speed penalty compared to plain large model programs which also pay a steep penalty compared to small data memory model programs. In real mode, constant reloading of segment registers is expensive, but is protected mode, loading a segment register is VERY expensive.

How data is declared and allocated can have a major effect on the size of the executables generated by compilers. In the previous example, the array of 500,000 characters was statically allocated. When this program was compiled and linked with the Microsoft version 8.0 compiler using the huge memory model an executable over 500K bytes in size resulted. frown

Suppose that array was allocated at runtime using malloc() though, like this:

#include <stdio.h>
#include <malloc.h>
char *pBig1;
int main(void)
   {
   // use farmalloc(500000L) for Borland compiler
   pBig1 = halloc(500000L,1);
   return 0;
   }

Compiling the program with the array allocated at runtime produced an executable file that was only 3.5K in size! smile I'd say there's quite a difference between 500K and 3.5K wouldn't you?

WARNING: Statically allocating a lot of data in programs, when dynamic allocation at runtime would have been sufficient, can cause the executable sizes to bloat out amazingly fast!!! If this practice can be avoided at all, then do so.

32 bit "Flat" model

On the 386 and later CPU's, the architecture has been extended to allow for segments being larger than the 64K the earlier CPU's were limited to. This can be good and it can be bad too. Along with the extended segment length capability, we also get a situation where all "near" pointers to memory are at least 4 bytes long.

Naturally this causes the data space needs of 32 bit program to grow over what an equivalent small model/tiny/medium model program would have been where "near" pointers are 2 byte things.

In its native 32 bit mode, the 386 still has segment registers

They still do the same things they did for 16 bit code. In fact "far" pointers still exist in the native 32 bit protected mode and they still have a segment part as well as an offset part. The segment part is still 2 bytes, and the offsets are 4 bytes.

A native "far" pointer on a 386 running in protected mode is a 6 byte variable.

The way a "flat" memory model is implemented in the native 32 bit environment is by making the CS,DS,SS, and ES registers all point to the same memory "segment". If this sounds a lot like a 16 bit tiny model program, its because it is. In the "flat" model case, the "segment" can theoretically be as large as 4 gigabytes though. Segment "limits" as seen in 16 bit protected mode programs still exist and are enforced by the CPU's protection hardware. In the flat model, those limits would normally be quite large - up to 4G. Most operating systems implementing a flat memory model set the segment limits for applications to something less than the full 4G though. This allows them some room to map private code and data into an application's space, but prevents applications from tromping on that data, or directly calling routines the OS doesn't want applications to call. Where there's no memory protection available in tiny model DOS programs, a flat model will typically implement some sort of protection scheme in a 4K page level using the 386's page tables.

         +------------- data & code are the same memory "segment"
         V
+----------------+ <--- DS, ES, CS, SS
|    Codeseg1    |
+----------------+
|    Dataseg1    |
+----------------+ [Addresses are all 32 bit offsets]
|   Empty pages  |
+----------------+
|    Dataseg2    |
+----------------+
|    Codeseg2    |
+----------------+
|   Empty pages  |
+----------------+
|    Dataseg3    |
+----------------+
|      Stack     |
+----------------+ <--- Initial ESP value
|    MoreCode    |
+----------------+
|    MoreData    |
+----------------+
|      ....      |
+----------------+ <--- Theoretical 4G limit(or specific OS limit)

With the flat model, there's no requirement that all the 4K pages in an application's space be allocated. Some may be empty. Touching an unallocated page would normally generate a page fault in an application. Under Windows 3.X, this would typically result in the ubiquitous GPF. Windows NT, 2.X and later versions of OS/2 allow apps to catch and handle page faults on their own if the app so desires.

Some OS's implement a scheme called "guard pages". These are unallocated pages placed around other pages containing genuine data. Faulting on a guard page isn't always fatal. Usually an application is allowed to intercept these faults. Guard pages can be useful for implementing things like dynamically expandable stacks. The initial stack for an application may only be a page or two. As the stack grows and guard pages are hit, memory gets "committed" for the new stack growth, and the guard page for the stack gets slid back in memory.

How segment alignment attributes effect executable size

One of the things that can have an effect on the size of the executable files in large projects is the segment alignment type specified when the segments for the object modules are created by a compiler or assembler. Take this little C function for example:

int foo(int bar)
   {
   return bar + 1;
   }

Compiling this with the Microsoft version 8.0 compiler and taking all the defaults puts the code in a segment named "_TEXT". I ran the Borland TDUMP utility on the OBJ file the Microsoft compiler generated and saw this as part of the output:

000077 SEGDEF 1 : _TEXT WORD PUBLIC Class 'CODE' Length: 001a

Here we can see that the compiled code was generated into a segment with a WORD alignment attribute.

WORD alignment on a segment means that the linker will align every OBJ module linked with that segment name on a WORD boundary.

The Borland 4.52 compiler generated a different segment alignment attribute by default:

000081 SEGDEF 1 : _TEXT BYTE PUBLIC Class 'CODE' Length: 000b

The Borland compiler is specifying byte alignment by default for segments it's creating. Now, consider this little C program that calls an assembler function named FOO():

/* FOO.C */
extern void FOO(void);

int main(void)
   {
   FOO();
   return 0;
   }

The assembler code for FOO() looks like this:

_TEXT segment byte public 'CODE'
;_TEXT segment para public 'CODE'
_FOO  proc   near
      public _FOO
      ret        ; Do nothing - just return
_FOO  endp
_TEXT ends
      end

When I built a tiny model version of FOO.C using the Borland 4.52 compiler and linked it with the little assembler module, I got an executable COM file that was 5586 bytes long. Then I switching the segment declarations on the assembler module so they looked like this and rebuilt the program:

;_TEXT segment byte public 'CODE'
_TEXT segment para public 'CODE'

Now the FOO() function lives in a paragraph aligned segment. The executable COM file that resulted from building this version was 5602 bytes long.

The version that included the paragraph aligned segment puffed up by 16 bytes.

If we had a program with a substantial number of paragraph aligned modules being linked in, the size impact on the code could be significant.

On average, 8 bytes of padding would be inserted by the linker for each paragraph aligned module.

Remember back to the chapter where we examined the effects of the 486 and Pentium on chip CPU caches on performance? Getting hits in the on chip caches was a critical aspect of making those CPU perform well.

Every byte of padding fluff injected by a linker into our code is going to contribute to having more "dead space" in the cache lines on 486's and Pentiums.

This is not to say we should always avoid paragraph or dword alignments in all modules of a program. Sometimes there will be performance gains to be had by paragraph aligning certain performance critical modules and functions. Done judiciously, the occasional paragraph alignment can assist in getting good hits in the on chip caches. Consider this little code fragment that is typical of many C/C++ function entry points:

;----------------------------------------
; The tail end of a low performance path
; function is just above foobar().
;----------------------------------------
foobar proc near
       push ebp    ;(1 byte) This instruction is on one cache line
;
; Paragraph alignment boundary happens to fall here between
; the "push ebp" and the "mov ebp,esp" instructions.
;
       mov ebp,esp ;(2 bytes) This instruction is on a different cache line
;
; The body of the foobar() function goes here.
;

Now, if the foobar() function entry point isn't paragraph aligned, as in this example, bad things happen to a 486's CPU cache every time foobar() is called. It potentially takes two 16 byte cache line loads just to get through the first two instructions of the foobar() function. What's worse, is that 15 bytes of the cache line that the push ebp instruction sits in are probably going to be wasted because they're the tail end of a low performance path function that's not likely to be executed nearly as often as foobar() might be.

In very high performance paths through the code, we'd like the cache lines involved to wind up packed with as much useful code as possible. Padding fluff from a linker hurts, as does missing a critical point where an alignment would help keep mostly dead space lines out of the cache as in the above example.

Remember - cache lines are a very scarce resource on both the 486 and Pentium.

486's have 512 of them, and they're combined instruction and data cache lines. The Pentium has 256 instruction cache lines, and 256 data cache lines. Pentium cache lines are 32 bytes. 486 cache lines are 16 bytes.

Be aware that many 32 bit C/C++ compilers are prone to assigning a DWORD segment alignment type to the modules they produce when they're generating 32 bit "flat" code. Many small functions in different source files will cause, on average, a 2 byte puff padding for each module linked.

How packing code and segment alignment affects executable size

Most modern compilers come with a linker that has the ability to "pack" code and data segments together in the resultant executable file. When a linker packs code and data segments, it takes what would have otherwise been separate code and data segments and merges them together. To accomplish this, the linker adjusts offset references in the various modules so that they reference the packed conglomeration of segments rather than an individual one. For example, consider these two code segments:

;-------------------------------------------
; PACK.ASM - a code packing demonstration
;-------------------------------------------
foo   segment para public 'foocode'
      assume  cs:foo
bar   proc    near
      public  bar
      ret
bar   endp
foo   ends

foo2  segment para public 'foo2code'
      assume  cs:foo2
bar2  proc    near
      public  bar2
      ret
bar2  endp
foo2  ends
      end

If the foo and foo2 segments were combined together in a GROUP:

foogroup group foo,foo2

then no segment packing would be necessary. In effect, you have told the assembler to do the equivalent of code packing already. The foo and foo2 segments will be automatically GROUP'ed (in essence "packed") together automatically.

When a linker packs together the foo segment and the foo2 segment, it's going to physically GROUP foo and foo2 together as if you had done this manually yourself.

I assembled the above code with the old IBM 2.0 version of Microsoft's Macro Assembler. This older assembler inserts less extraneous crud in an OBJ than current versions of most assemblers do. For this discussion's purposes, it's output will be less confusing when we look at what was produced.

Running Borland's TDUMP utility on the resultant OBJ produced this result:

Turbo Dump Version 4.1 Copyright (c) 1988, 1994 Borland International
Display of File PACK.OBJ
000000 THEADR A
000006 LNAMES
Name 1: ''
Name 2: 'FOO'
Name 3: 'FOO2'
Name 4: 'FOO2CODE'
Name 5: 'FOOCODE'
000025 SEGDEF 1 : FOO PARA PUBLIC Class 'FOOCODE' Length: 0001
00002F SEGDEF 2 : FOO2 PARA PUBLIC Class 'FOO2CODE' Length: 0001
000039 LEDATA Segment: FOO Offset: 0000 Length: 0001
0000: C3 .
000041 LEDATA Segment: FOO2 Offset: 0000 Length: 0001
0000: C3 .
000049 PUBDEF 'BAR' Segment: FOO:0000
000056 PUBDEF 'BAR2' Segment: FOO2:0000
000064 MODEND

Notice the two "SEGDEF" lines in TDUMP's output. The assembler has indeed created two distinct code segments of one byte each.

Now, if we run Borland's TLINK linker on PACK.OBJ and tell it to produce a Windows DLL (this is TLINK's /Twd switch), the resultant DLL is 529 bytes in size:

PACK DLL 529 12-18-95 11:54a

Here's the MAP file that resulted from linking PACK.OBJ:

       Start Length Name Class
-----> 0001:0000 0001H FOO FOOCODE
-----> 0001:0010 0001H FOO2 FOO2CODE
       Warning: No automatic data segment

We can ignore that warning. We're not going to actually run this DLL. We're just using it as a vehicle to explore a linker's behavior.

The interesting thing here is that the FOO and FOO2 segments have been combined into a single segment by default with TLINK. Notice the lines the arrows are pointing to. The start segments are the same, and the FOO2 segment's starting offset isn't zero.

Using the Microsoft LINK linker produced virtually identical results -- it looks like LINK is packing code segments by default as well. LINK requires a DEF file to produce a Windows DLL though. I used this DEF file:

LIBRARY PACK.DLL
EXETYPE WINDOWS

I ran LINK with this command line:

LINK pack,pack.dll,pack,pack,pack.def;

This is the linker MAP file that LINK produced:

pack.dll
Start Length Name Class
0001:0000 00001H FOO FOOCODE
0001:0010 00001H FOO2 FOO2CODE

The Microsoft LINK linker also produced a 529 byte DLL as did Borland's TLINK.

Now let's see how specifying a "don't pack code" switch to the linker effect the size of this DLL and the resultant MAP file. To disable packing of code segments TLINK uses the /P- switch (for Microsoft linkers, this will be the /NOPACKCODE command line option). I linked PACK.OBJ with TLINK using this command line:

TLINK /Twd /P- PACK.OBJ

Some significant changes occurred telling TLINK to not pack code segments. The DLL's executable size grew significantly.

PACK DLL 1,025 12-18-95 1:55p

The DLL grew by 496 bytes. It looks like TLINK is aligning segments within the DLL on 512 byte boundaries. We can verify this using Borland's TDUMP utility to dump out the contents of PACK.DLL. Here a partial listing of what TDUMP produced when run on this version of the DLL:

Initial Stack Size 0000h ( 0. )
Segment count 0002h ( 2. )
Module reference count 0000h ( 0. )
Moveable Entry Point Count 0000h ( 0. )
File alignment unit size 0200h ( 512. )

The interesting items here are the "Segment count" being 2. This verifies that TLINK did indeed keep the two code segment separate when run with the /P- switch. The other interesting item is the "File alignment unit size" being 512 by default. This means every physical segment described in this DLL is going to be aligned within the DLL on a 512 byte boundary in the DLL.

WARNING: Having large alignment sizes can really bloat out the size of an EXE or DLL when multiple physical segments are in that EXE or DLL.

By the way, the Microsoft LINK linker also defaults to 512 byte segment alignment within an executable unless directed otherwise. It too produced a similarly bloated out DLL when run with its /NOPACKCODE switch.

There is a way to get around this horrendous size bloat when using options to not pack code. The file alignment size is adjustable on these linkers. Small powers of 2 are good number to start with -- like 16. For TLINK, this is the "/A=nn" switch. For LINK its the "/A:nn" switch. Note the slight difference -- TLINK uses a "=" character as a seperator, LINK uses a ":" character. Other than this minor syntactic difference, they both do the same thing -- alter that File Alignment Size that's causing the DLL to bloat up so bad.

Running TLINK with this command line (don't pack code, and align on 16 byte boundaries in the DLL):

TLINK /Twd /P- /A=16 pack

produced this DLL:

PACK DLL 273 12-18-95 2:19p

smile That was much nicer wasn't it! The size of the DLL plunged from over 1K to 273 bytes. The MAP file produced by TLINK was as expected:

     Start Length Name Class
---> 0001:0000 0001H FOO FOOCODE
---> 0002:0000 0001H FOO2 FOO2CODE
     Warning: No automatic data segment

Notice the start points on the lines with the arrows. We do indeed have two physical segments in this DLL, and the first byte of the FOO2 segment is starting at offset 0 which we would expect when it isn't being "packed" in with other segments.

TIP: telling your linker to align segments on boundaries smaller than the default 512 in Windows (or 16 bit OS/2) programs can save a LOT on the size of an EXE or DLL.

How packing code affects working set

We've just seen how allowing a linker to "pack" code segments together can reduce the size of an executable file -- and that's a generally good thing. However, it can also have a downside. The segments that get packed together become part of a single physical segment after being packed together.

Remember back to an earlier chapter where segment swapping was discussed? When a byte in a segment is touched, the whole segment typically needs to be brought into physical memory. If the code segment packing is done indiscriminately, this can result in significant increases in working set for a program.

Consider this not uncommon scenario.

A program has a few routines (say maybe 10 or 20) where it spends the majority of its execution time according to the execution profiler.
These few routines are not terribly large -- maybe 10K out of perhaps 500K worth of code that comprises the application. The rest of the code is for infrequently used features and error handling code.
The data these few routines are processing is generally very large though -- on the order of hundreds of K or even megabytes.

Suppose these 20 routines, of maybe 500 bytes each, are randomly splattered around in the executable. Having them "packed" together with the less frequently used code could easily result in having one of these critical path routines wind up in each blob of code the linker has "packed" together into a single physical code segment. This means that for our 10K of "hot" code to be in memory and running, we could need virtually all of the 500K that makes up the application present in memory as well!

When the linker packs the 500K worth of application code together, there's only going to be about 8 approximately 64K sized physical code segments produced as a result. All it would take is for one of the 20 critical routines to be in any one of those approximately 64K sized groupings to cause the whole approximately 64K blob to be dragged into memory.

The net result of this scenario is that our hypothetical application could be consuming 400K+ more physical memory at runtime than it really needed to.

That's an extra 400K+ that the user could have used towards a bigger disk cache, or to allow the application to process larger sets of data before heavy swapping sets in and destroys performance (remember the MSPIGGY example from an earlier chapter?).

This scenario is bad, but not all that uncommon. If the OS is one that's capable of paging rather than segment swapping the situation will get a little better. In a paging environment, the worst that can happen is that each of those twenty 500 byte routines straddles a page boundary. In that case, each routine would cause 8K of physical memory to get burned at runtime. Even though that situation is better than 400K+ that could happen in a segment swapping system (like with some 16 bit DOS extenders, Windows 3.X in "standard" mode, or OS/2 1.X), 20 times 8K is still 160K worst case. This is still a pretty horrible bloat factor for 20 routines that by rights should fit just fine in three 4K pages if the routines were placed adjacent to each other when the application was linked together.

This hypothetical application could see a working set savings of hundreds of kilobytes of physical memory simply by moving the critical routines around so they're closer together!

Remember, in a paging system, all it takes is to touch a single byte on a page for the whole 4K page to be dragged into physical memory. In a segment swapping system, all it takes is to touch a single byte in a segment to cause the whole segment to be dragged into physical memory.

How packing code affects speed

We've just seen how sloppy code packing can cause an application's working set to bloat out fairly quickly. That obviously has a negative effect on overall system performance. Packing code can also have positive effects if the routines and segments being packed together are critical high usage ones identified by a code profiler.

Packing code, when done correctly, can also have a latent advantage in 16 bit programs where FAR calls are being made. Normally a FAR call is going to result in a segment register getting reloaded. This is one of the more expensive operations you can do on an 80x86 processor when it's running in protected mode. These inter-segment FAR call instructions can be translated by many modern linkers into a sequence similar to this one:

push cs
call near WhereEver
nop

When the code segments get packed together, all of a sudden any FAR call's between them can be translated to the PUSH CS, NEAR CALL sequence because the target addresses are now within near call range. Doing this translation prevents the CS register from getting reloaded for the call.

Recent versions of Borland's TLINK do this translation automatically for you when linking a program together. TLINK has an option to disable it if needed. Microsoft's LINK linker doesn't do the translations by default but has the /FA switch to enable it.

Allowing the linker to do far call translations is generally a good thing and you should do it if possible.

WARNING: A linker CAN be fooled if some data byte looks like a far call instruction opcode and it's followed by a relocation fixup that looks like a far call address. In a rare situcation like this, the linker will destroy the code doing a far call translation. If you enable far call translations and your program stops working correctly, you may have one of these rare situations in your code.

tonyi@ibm.net - Shut up and jump!

Last modified on Sunday, Dec 20, 1998
This page produced the old fashoned way - with a text editor