| Segments/Groups | Tiny model | Small model | Medium model | Compact model | Large model | Huge model |
Have you ever noticed that some applications having very large EXE files seem to load and get to their initial screen almost instantly? Then there are other similar sized applications that sit there and grind for several seconds before showing their initial screen. Neither of those applications may be in a condition to actually let the user actually do anything with them for several seconds, but there's the perception of speed in one case and not in the other. The difference may simply be how the code was packaged when the application was compiled and linked.
We're going to cover the mechanics of how applications are constructed in this chapter. This may be something you never paid much attention to in the past. It is pretty easy to ignore the details of this stuff these days and just take all the defaults provided by whatever development tools you happen to be using. Often taking those defaults isn't going to be the best way though. Sometimes, those defaults can screw us really bad when we're looking to squeeze a few more bytes out of a program, or reduce it's working set in memory.
In real mode, there are no hardware enforced restrictions on the range of bytes that can be accessed in a segment. In protected mode, a segment may have hardware enforced "limits" that restrict the range of valid addresses within the segment's potential 64K range. For the purposes of this discussion, a 386/486/Pentium running programs in V86 mode behaves just like real mode from a program's point of view. V86 mode is a special mode of operation where a 386 running in protected mode can mimic most of the behaviors of a 386 running in real mode. If you've ever run a "DOS session" under Windows 3.X in enhanced mode, or under OS/2, or under Windows 95, or Windows NT, then you were using that processor in V86 mode to run that DOS session. If you've ever used a DOS memory manager like EMM386, Quarterdeck's QEMM, Qualitas's 386Max, etc. then you were also running in V86 mode.
All of those memory managers actually run the CPU in protected mode with the paging hardware enabled. It's the paging hardware that enables them to do their tricks to create upper memory blocks (UMB's) that drivers and TSR's can be "loaded high" in.
Along with the physical manifestation of segments, there is the logical notion of a segment that assembler programmers are familiar with. The logical notion may or may not have a direct correspondence with the physical notion depending on how the code has been organized and packaged by compilers, assemblers, and linkers. A typical example of a "segment" from the logical perspective might look like this in assembler code:
foobar segment para public 'code' ; ; Things in the "foobar" segment go here ; foobar ends
Logical segments like this always have a name of some sort - in this example its "foobar". These names will be created by default when using a C/C++ compiler. Commonly used names in current compilers are things like "_TEXT", "_DATA", and "_BSS". If you tell whatever linker or development tool you're using to create a linker MAP file for a C/C++ program, you'll see a whole bunch of these logical segment names in the MAP file listing. For example, I compiled this empty C program using all the default code generation switches for the Microsoft version 8.0 C/C++ compiler. I added the /Fm switch to force a MAP file to be created because its not by default.
/* EMPTY.C */ int main(void) { return 0; }
The first part of the MAP file that resulted looked like this:
Start Stop Length Name Class 00000H 00783H 00784H _TEXT CODE 00790H 00791H 00002H EMULATOR_TEXT CODE 00792H 00792H 00000H C_ETEXT ENDCODE 007A0H 007A0H 00000H EMULATOR_DATA FAR_DATA 007A0H 007E1H 00042H NULL BEGDATA 007E2H 00863H 00082H _DATA DATA 00864H 00865H 00002H XIQC DATA 00866H 00871H 0000CH DBDATA DATA 00872H 0087FH 0000EH CDATA DATA 00880H 00880H 00000H XIFB DATA 00880H 00880H 00000H XIF DATA 00880H 00880H 00000H XIFE DATA 00880H 00880H 00000H XIB DATA 00880H 00880H 00000H XI DATA 00880H 00880H 00000H XIE DATA 00880H 00880H 00000H XPB DATA 00880H 00880H 00000H XP DATA 00880H 00880H 00000H XPE DATA 00880H 00880H 00000H XCB DATA 00880H 00880H 00000H XC DATA 00880H 00880H 00000H XCE DATA 00880H 00880H 00000H XCFB DATA 00880H 00880H 00000H XCFCRT DATA 00880H 00880H 00000H XCF DATA 00880H 00880H 00000H XCFE DATA 00880H 00880H 00000H XIFCB DATA 00880H 00880H 00000H XIFU DATA 00880H 00880H 00000H XIFL DATA 00880H 00880H 00000H XIFM DATA 00880H 00880H 00000H XIFCE DATA 00880H 00880H 00000H CONST CONST 00880H 00887H 00008H HDR MSG 00888H 0095DH 000D6H MSG MSG 0095EH 0095FH 00002H PAD MSG 00960H 00960H 00001H EPAD MSG 00962H 00962H 00000H _BSS BSS 00962H 00962H 00000H XOB BSS 00962H 00962H 00000H XO BSS 00962H 00962H 00000H XOE BSS 00962H 00962H 00000H XOFB BSS 00962H 00962H 00000H XOF BSS 00962H 00962H 00000H XOFE BSS 00970H 0116FH 00800H STACK STACK Origin Group 007A:0 DGROUP
Notice all the segment names under the "names" column in the listing. There are quite a few of them. That's quite a lot of stuff for a program that does nothing but return a 0! Also notice that quite a few of those "segments" are actually zero bytes long.
There's nothing that says one of these logical segments needs to contain any data or code, and often they don't.
In fact, the following segment appears to be present simply as marker to delineate the end of all the code in the program:
00792H 00792H 00000H C_ETEXT ENDCODE
See what came right after the "C_ETEXT" segment? It was this one:
007A0H 007A0H 00000H EMULATOR_DATA FAR_DATA
Now take a look at the last line in the listing there under the "Origin" column. We see that a group called "DGROUP" happens to start at that same location and there are all sorts of "segments" in this DGROUP group. This illustrates what the notion of a "group" is.
A "group" is a collection of one or more of these logical "segments". Some of those segments may contain nothing, others can contain actual code or data.
In 16 bit programs, there's one limitation on the size of any given "segment" - it can't be more than 64K in length. The limitation on a group is that the combined size of all the segments in that group can't total more than 64K.
It wouldn't be possible to "group" together two 40K segments. The total size of the group would be more than 64K. It would be permissible to group together two 16K segments like this:
Two 16K segments in a 32K group SEG1 - 16K SEG2 - 16K SEG1SEG2 group (size 32K)
There's no requirement that groups be exactly 64K
Typically they're going to be something less than 64K. In fact, all of the segments in the DGROUP group from the EMPTY.C example program totaled far less than 64K. Subtracting the start of the DGROUP at 7A0h from where the "STACK" segment in that MAP file ends gives 2511 bytes for the DGROUP group. The bulk of that is in the STACK segment which is 2K.
In the previous MAP file listing there was a column titled "Class".
A class name is a way to get segments with different names located next to each other in an executable.
Linkers will always place segments containing identical class names next to each other when the segment names are the same.
If a segment is standalone and not in any group, then the class name doesn't matter much.
Playing with segment names, groups, and class names is how we can make executable files smaller and/or tune them so they'll work better in virtual memory environments. Remember the chapter where the performance impacts of the 486 and Pentium's on chip caches were discussed?
Diddling around with groups, names, and class names is how we're going to help locate critical path code and data to make the most effective use of those CPU caches.
In tiny model, all of a program's initial code and data live within a single 64K physical memory segment.
There may be lots of logical segments in a tiny model program as well and there usually are if the program was generated by a C/C++ compiler. In the previous example, that MAP file was for a "small" model program, but a MAP file for a tiny model version would look very similar.
Tiny model assembler programs will often contain just one logical segment that hold all the program's code and data.
This is an example of the smallest possible tiny model DOS program. It would generate a ".COM" file that's one byte long and does nothing but return to DOS when it's executed.
tiny segment para public 'code' assume cs:tiny,ds:tiny,es:tiny,ss:tiny org 100h start proc near ret ; Return to DOS start endp tiny ends end start
In tiny model programs, the CS,DS,ES, and SS registers are normally all aimed at the same physical segment in memory, even though there may be many logical segment in the program.
+------------ data group & code group are the same
V
Codeseg1 <--- DS, ES, CS, SS at program startup
Dataseg1
Dataseg2
Codeseg2
Dataseg3
Empty space
Empty space
Empty space <--- Initial SP value
The key thing about the tiny model scheme is there can never be more than 64K-256 bytes of combined code and data in the program's executable, and the segments must all be in the same group.
When DOS loads a ".COM" program, DOS performs the following actions:
This scheme implies that the first byte in a genuine ".COM" program MUST BE AN EXECUTABLE INSTRUCTION - it can't be data because DOS is going to JMP to the first byte in the program.
The first instruction in many tiny model programs in a JMP that jumps around the data for the program - like this:
jmp initialize ;<-- This is the first instruction in the program foo1 dw ? ; This is the foo2 dd ? ; data for the bar1 db ? ; program initialize label near ; ; The program's code goes here ;
Tiny model programs are not NORMALLY suitable for execution in protected mode environments.
With some trickery using segment aliases, the concept can be made to work, but no language products currently in existence allow the construction of a single segment protected mode C/C++ program.
If you were to build a DPMI (DOS protected mode interface) compliant program from scratch, not using a DOS extender, then tiny model protected mode programs are possible, and indeed this is basically what happens to a DPMI program when it first flips into protected mode.
For those readers interested such things, there's a sample tiny model DPMI "hello world" program written in assembler in Appendix B in the back of the book. appndx_b.htm
+------------------- code group
V
Codeseg1 <----- CS:
Codeseg2
Codeseg3 <----- Program entry point
+------------------- data group
V
Dataseg1 <----- DS: and SS:
Dataseg2
Stackseg <----- Initial SP value
There's many different ways to layout the various data segments and stack segments in a small model program, so this diagram is just a basic conceptual model. Different C/C++ compiler vendors all lay things out a little differently in their implementations.
The common elements though generally are:
+-- code group1
V
Codeseg1 <--- CS when running code in this group
Codeseg2
Codeseg3
+-- code group2
V
Codeseg4 <--- CS when running code in this group
Codeseg5
Codeseg6 <--- Program entry point
+-- code group3
V
Codeseg7 <--- CS when running code in this group
Codeseg8
Codeseg9
+-- data group
V
Dataseg1 <--- DS and SS
Dataseg2
Stackseg <--- Initial SP value
There are many different ways to layout the various data segments and stack
segments in a medium model program, so this diagram is just conceptual.
Different C/C++ compiler vendors all lay things out a little differently in
their implementations. However, the common elements generally are:
Compact models for most compilers also allows for a "near" heap as well as a far heap.
When it's practical to use it, near heap items are faster to access than normal data items in compact model.
With some compilers there's going to be a 64K limit on the amount of statically allocated data.
With some compilers there's going to be a 64K limit on the amount of statically allocated data in a large model program.
Huge model lifts this restriction on pointer wrapping by allowing for individual data items to be larger than 64K.
Take this C program for example:
#include <stdio.h> char big1[500000L] = {0}; int main(void) { printf("size of char * = %d\n", sizeof(char *)); printf("size of big1 array = %ld\n", (long)sizeof(big1)); return 0; }
This C program declares 500K of static data in a single array of characters. If segment wrapping were to occur as in the other "big data" memory models, any code trying to access beyond the first 64K of the array would fail.
Huge model solves this wrapping effect problem by doing what's known as pointer normalizations. When a pointer to a data item larger than 64K is incremented, the segment part of the pointer gets adjusted so that the offset doesn't wrap around. This is a very slow process though.
In protected mode huge model programs, the procedure is conceptually similar to real mode, but the pointer arithmetic procedure is somewhat different. In protected mode, "huge" data items result in a sequential group of selectors being allocated. The pointer arithmetic process in protected mode involves computing what the next selector in the "huge" object will be. Normally, this process is simplified somewhat by having an operating system defined constant value that will be added to, or subtracted from, an existing selector to produce the next one in line.
Huge model programs pay a HORRIFIC speed penalty compared to plain large model programs which also pay a steep penalty compared to small data memory model programs. In real mode, constant reloading of segment registers is expensive, but is protected mode, loading a segment register is VERY expensive.
How data is declared and allocated can have a major effect on the size of
the executables generated by compilers. In the previous example, the array
of 500,000 characters was statically allocated. When this program was compiled
and linked with the Microsoft version 8.0 compiler using the huge memory
model an executable over 500K bytes in size resulted.

Suppose that array was allocated at runtime using malloc() though, like this:
#include <stdio.h> #include <malloc.h> char *pBig1; int main(void) { // use farmalloc(500000L) for Borland compiler pBig1 = halloc(500000L,1); return 0; }
Compiling the program with the array allocated at runtime produced an
executable file that was only 3.5K in size!
I'd say there's quite a difference between 500K and 3.5K wouldn't you?
WARNING: Statically allocating a lot of data in programs, when dynamic allocation at runtime would have been sufficient, can cause the executable sizes to bloat out amazingly fast!!! If this practice can be avoided at all, then do so.
Naturally this causes the data space needs of 32 bit program to grow over what an equivalent small model/tiny/medium model program would have been where "near" pointers are 2 byte things.
In its native 32 bit mode, the 386 still has segment registers
They still do the same things they did for 16 bit code. In fact "far" pointers still exist in the native 32 bit protected mode and they still have a segment part as well as an offset part. The segment part is still 2 bytes, and the offsets are 4 bytes.
A native "far" pointer on a 386 running in protected mode is a 6 byte variable.
The way a "flat" memory model is implemented in the native 32 bit environment is by making the CS,DS,SS, and ES registers all point to the same memory "segment". If this sounds a lot like a 16 bit tiny model program, its because it is. In the "flat" model case, the "segment" can theoretically be as large as 4 gigabytes though. Segment "limits" as seen in 16 bit protected mode programs still exist and are enforced by the CPU's protection hardware. In the flat model, those limits would normally be quite large - up to 4G. Most operating systems implementing a flat memory model set the segment limits for applications to something less than the full 4G though. This allows them some room to map private code and data into an application's space, but prevents applications from tromping on that data, or directly calling routines the OS doesn't want applications to call. Where there's no memory protection available in tiny model DOS programs, a flat model will typically implement some sort of protection scheme in a 4K page level using the 386's page tables.
+------------- data & code are the same memory "segment"
V
+----------------+ <--- DS, ES, CS, SS
| Codeseg1 |
+----------------+
| Dataseg1 |
+----------------+ [Addresses are all 32 bit offsets]
| Empty pages |
+----------------+
| Dataseg2 |
+----------------+
| Codeseg2 |
+----------------+
| Empty pages |
+----------------+
| Dataseg3 |
+----------------+
| Stack |
+----------------+ <--- Initial ESP value
| MoreCode |
+----------------+
| MoreData |
+----------------+
| .... |
+----------------+ <--- Theoretical 4G limit(or specific OS limit)
With the flat model, there's no requirement that all the 4K pages in an application's space be allocated. Some may be empty. Touching an unallocated page would normally generate a page fault in an application. Under Windows 3.X, this would typically result in the ubiquitous GPF. Windows NT, 2.X and later versions of OS/2 allow apps to catch and handle page faults on their own if the app so desires.
Some OS's implement a scheme called "guard pages". These are unallocated pages placed around other pages containing genuine data. Faulting on a guard page isn't always fatal. Usually an application is allowed to intercept these faults. Guard pages can be useful for implementing things like dynamically expandable stacks. The initial stack for an application may only be a page or two. As the stack grows and guard pages are hit, memory gets "committed" for the new stack growth, and the guard page for the stack gets slid back in memory.
int foo(int bar) { return bar + 1; }
Compiling this with the Microsoft version 8.0 compiler and taking all the defaults puts the code in a segment named "_TEXT". I ran the Borland TDUMP utility on the OBJ file the Microsoft compiler generated and saw this as part of the output:
000077 SEGDEF 1 : _TEXT WORD PUBLIC Class 'CODE' Length: 001a
Here we can see that the compiled code was generated into a segment with a WORD alignment attribute.
WORD alignment on a segment means that the linker will align every OBJ module linked with that segment name on a WORD boundary.
The Borland 4.52 compiler generated a different segment alignment attribute by default:
000081 SEGDEF 1 : _TEXT BYTE PUBLIC Class 'CODE' Length: 000b
The Borland compiler is specifying byte alignment by default for segments it's creating. Now, consider this little C program that calls an assembler function named FOO():
/* FOO.C */ extern void FOO(void); int main(void) { FOO(); return 0; }
The assembler code for FOO() looks like this:
_TEXT segment byte public 'CODE' ;_TEXT segment para public 'CODE' _FOO proc near public _FOO ret ; Do nothing - just return _FOO endp _TEXT ends end
When I built a tiny model version of FOO.C using the Borland 4.52 compiler and linked it with the little assembler module, I got an executable COM file that was 5586 bytes long. Then I switching the segment declarations on the assembler module so they looked like this and rebuilt the program:
;_TEXT segment byte public 'CODE' _TEXT segment para public 'CODE'
Now the FOO() function lives in a paragraph aligned segment. The executable COM file that resulted from building this version was 5602 bytes long.
The version that included the paragraph aligned segment puffed up by 16 bytes.
If we had a program with a substantial number of paragraph aligned modules being linked in, the size impact on the code could be significant.
On average, 8 bytes of padding would be inserted by the linker for each paragraph aligned module.
Remember back to the chapter where we examined the effects of the 486 and Pentium on chip CPU caches on performance? Getting hits in the on chip caches was a critical aspect of making those CPU perform well.
Every byte of padding fluff injected by a linker into our code is going to contribute to having more "dead space" in the cache lines on 486's and Pentiums.
This is not to say we should always avoid paragraph or dword alignments in all modules of a program. Sometimes there will be performance gains to be had by paragraph aligning certain performance critical modules and functions. Done judiciously, the occasional paragraph alignment can assist in getting good hits in the on chip caches. Consider this little code fragment that is typical of many C/C++ function entry points:
;---------------------------------------- ; The tail end of a low performance path ; function is just above foobar(). ;---------------------------------------- foobar proc near push ebp ;(1 byte) This instruction is on one cache line ; ; Paragraph alignment boundary happens to fall here between ; the "push ebp" and the "mov ebp,esp" instructions. ; mov ebp,esp ;(2 bytes) This instruction is on a different cache line ; ; The body of the foobar() function goes here. ;
Now, if the foobar() function entry point isn't paragraph aligned, as in this example, bad things happen to a 486's CPU cache every time foobar() is called. It potentially takes two 16 byte cache line loads just to get through the first two instructions of the foobar() function. What's worse, is that 15 bytes of the cache line that the push ebp instruction sits in are probably going to be wasted because they're the tail end of a low performance path function that's not likely to be executed nearly as often as foobar() might be.
In very high performance paths through the code, we'd like the cache lines involved to wind up packed with as much useful code as possible. Padding fluff from a linker hurts, as does missing a critical point where an alignment would help keep mostly dead space lines out of the cache as in the above example.
Remember - cache lines are a very scarce resource on both the 486 and Pentium.
486's have 512 of them, and they're combined instruction and data cache lines. The Pentium has 256 instruction cache lines, and 256 data cache lines. Pentium cache lines are 32 bytes. 486 cache lines are 16 bytes.
Be aware that many 32 bit C/C++ compilers are prone to assigning a DWORD segment alignment type to the modules they produce when they're generating 32 bit "flat" code. Many small functions in different source files will cause, on average, a 2 byte puff padding for each module linked.
;------------------------------------------- ; PACK.ASM - a code packing demonstration ;------------------------------------------- foo segment para public 'foocode' assume cs:foo bar proc near public bar ret bar endp foo ends foo2 segment para public 'foo2code' assume cs:foo2 bar2 proc near public bar2 ret bar2 endp foo2 ends end
If the foo and foo2 segments were combined together in a GROUP:
foogroup group foo,foo2
then no segment packing would be necessary. In effect, you have told the assembler to do the equivalent of code packing already. The foo and foo2 segments will be automatically GROUP'ed (in essence "packed") together automatically.
When a linker packs together the foo segment and the foo2 segment, it's going to physically GROUP foo and foo2 together as if you had done this manually yourself.
I assembled the above code with the old IBM 2.0 version of Microsoft's Macro Assembler. This older assembler inserts less extraneous crud in an OBJ than current versions of most assemblers do. For this discussion's purposes, it's output will be less confusing when we look at what was produced.
Running Borland's TDUMP utility on the resultant OBJ produced this result:
Turbo Dump Version 4.1 Copyright (c) 1988, 1994 Borland International Display of File PACK.OBJ 000000 THEADR A 000006 LNAMES Name 1: '' Name 2: 'FOO' Name 3: 'FOO2' Name 4: 'FOO2CODE' Name 5: 'FOOCODE' 000025 SEGDEF 1 : FOO PARA PUBLIC Class 'FOOCODE' Length: 0001 00002F SEGDEF 2 : FOO2 PARA PUBLIC Class 'FOO2CODE' Length: 0001 000039 LEDATA Segment: FOO Offset: 0000 Length: 0001 0000: C3 . 000041 LEDATA Segment: FOO2 Offset: 0000 Length: 0001 0000: C3 . 000049 PUBDEF 'BAR' Segment: FOO:0000 000056 PUBDEF 'BAR2' Segment: FOO2:0000 000064 MODEND
Notice the two "SEGDEF" lines in TDUMP's output. The assembler has indeed created two distinct code segments of one byte each.
Now, if we run Borland's TLINK linker on PACK.OBJ and tell it to produce a Windows DLL (this is TLINK's /Twd switch), the resultant DLL is 529 bytes in size:
PACK DLL 529 12-18-95 11:54a
Here's the MAP file that resulted from linking PACK.OBJ:
Start Length Name Class -----> 0001:0000 0001H FOO FOOCODE -----> 0001:0010 0001H FOO2 FOO2CODE Warning: No automatic data segment
We can ignore that warning. We're not going to actually run this DLL. We're just using it as a vehicle to explore a linker's behavior.
The interesting thing here is that the FOO and FOO2 segments have been combined into a single segment by default with TLINK. Notice the lines the arrows are pointing to. The start segments are the same, and the FOO2 segment's starting offset isn't zero.
Using the Microsoft LINK linker produced virtually identical results -- it looks like LINK is packing code segments by default as well. LINK requires a DEF file to produce a Windows DLL though. I used this DEF file:
LIBRARY PACK.DLL EXETYPE WINDOWS
I ran LINK with this command line:
LINK pack,pack.dll,pack,pack,pack.def;
This is the linker MAP file that LINK produced:
pack.dll Start Length Name Class 0001:0000 00001H FOO FOOCODE 0001:0010 00001H FOO2 FOO2CODE
The Microsoft LINK linker also produced a 529 byte DLL as did Borland's TLINK.
Now let's see how specifying a "don't pack code" switch to the linker effect the size of this DLL and the resultant MAP file. To disable packing of code segments TLINK uses the /P- switch (for Microsoft linkers, this will be the /NOPACKCODE command line option). I linked PACK.OBJ with TLINK using this command line:
TLINK /Twd /P- PACK.OBJ
Some significant changes occurred telling TLINK to not pack code segments. The DLL's executable size grew significantly.
PACK DLL 1,025 12-18-95 1:55p
The DLL grew by 496 bytes. It looks like TLINK is aligning segments within the DLL on 512 byte boundaries. We can verify this using Borland's TDUMP utility to dump out the contents of PACK.DLL. Here a partial listing of what TDUMP produced when run on this version of the DLL:
Initial Stack Size 0000h ( 0. ) Segment count 0002h ( 2. ) Module reference count 0000h ( 0. ) Moveable Entry Point Count 0000h ( 0. ) File alignment unit size 0200h ( 512. )
The interesting items here are the "Segment count" being 2. This verifies that TLINK did indeed keep the two code segment separate when run with the /P- switch. The other interesting item is the "File alignment unit size" being 512 by default. This means every physical segment described in this DLL is going to be aligned within the DLL on a 512 byte boundary in the DLL.
WARNING: Having large alignment sizes can really bloat out the size of an EXE or DLL when multiple physical segments are in that EXE or DLL.
By the way, the Microsoft LINK linker also defaults to 512 byte segment alignment within an executable unless directed otherwise. It too produced a similarly bloated out DLL when run with its /NOPACKCODE switch.
There is a way to get around this horrendous size bloat when using options to not pack code. The file alignment size is adjustable on these linkers. Small powers of 2 are good number to start with -- like 16. For TLINK, this is the "/A=nn" switch. For LINK its the "/A:nn" switch. Note the slight difference -- TLINK uses a "=" character as a seperator, LINK uses a ":" character. Other than this minor syntactic difference, they both do the same thing -- alter that File Alignment Size that's causing the DLL to bloat up so bad.
Running TLINK with this command line (don't pack code, and align on 16 byte boundaries in the DLL):
TLINK /Twd /P- /A=16 pack
produced this DLL:
PACK DLL 273 12-18-95 2:19p
That was much nicer wasn't it! The size of the DLL plunged from
over 1K to 273 bytes. The MAP file produced by TLINK was as
expected:
Start Length Name Class ---> 0001:0000 0001H FOO FOOCODE ---> 0002:0000 0001H FOO2 FOO2CODE Warning: No automatic data segment
Notice the start points on the lines with the arrows. We do indeed have two physical segments in this DLL, and the first byte of the FOO2 segment is starting at offset 0 which we would expect when it isn't being "packed" in with other segments.
TIP: telling your linker to align segments on boundaries smaller than the default 512 in Windows (or 16 bit OS/2) programs can save a LOT on the size of an EXE or DLL.
Remember back to an earlier chapter where segment swapping was discussed? When a byte in a segment is touched, the whole segment typically needs to be brought into physical memory. If the code segment packing is done indiscriminately, this can result in significant increases in working set for a program.
Consider this not uncommon scenario.
Suppose these 20 routines, of maybe 500 bytes each, are randomly splattered around in the executable. Having them "packed" together with the less frequently used code could easily result in having one of these critical path routines wind up in each blob of code the linker has "packed" together into a single physical code segment. This means that for our 10K of "hot" code to be in memory and running, we could need virtually all of the 500K that makes up the application present in memory as well!
When the linker packs the 500K worth of application code together, there's only going to be about 8 approximately 64K sized physical code segments produced as a result. All it would take is for one of the 20 critical routines to be in any one of those approximately 64K sized groupings to cause the whole approximately 64K blob to be dragged into memory.
The net result of this scenario is that our hypothetical application could be consuming 400K+ more physical memory at runtime than it really needed to.
That's an extra 400K+ that the user could have used towards a bigger disk cache, or to allow the application to process larger sets of data before heavy swapping sets in and destroys performance (remember the MSPIGGY example from an earlier chapter?).
This scenario is bad, but not all that uncommon. If the OS is one that's capable of paging rather than segment swapping the situation will get a little better. In a paging environment, the worst that can happen is that each of those twenty 500 byte routines straddles a page boundary. In that case, each routine would cause 8K of physical memory to get burned at runtime. Even though that situation is better than 400K+ that could happen in a segment swapping system (like with some 16 bit DOS extenders, Windows 3.X in "standard" mode, or OS/2 1.X), 20 times 8K is still 160K worst case. This is still a pretty horrible bloat factor for 20 routines that by rights should fit just fine in three 4K pages if the routines were placed adjacent to each other when the application was linked together.
This hypothetical application could see a working set savings of hundreds of kilobytes of physical memory simply by moving the critical routines around so they're closer together!
Remember, in a paging system, all it takes is to touch a single byte on a page for the whole 4K page to be dragged into physical memory. In a segment swapping system, all it takes is to touch a single byte in a segment to cause the whole segment to be dragged into physical memory.
Packing code, when done correctly, can also have a latent advantage in 16 bit programs where FAR calls are being made. Normally a FAR call is going to result in a segment register getting reloaded. This is one of the more expensive operations you can do on an 80x86 processor when it's running in protected mode. These inter-segment FAR call instructions can be translated by many modern linkers into a sequence similar to this one:
push cs call near WhereEver nop
When the code segments get packed together, all of a sudden any FAR call's between them can be translated to the PUSH CS, NEAR CALL sequence because the target addresses are now within near call range. Doing this translation prevents the CS register from getting reloaded for the call.
Recent versions of Borland's TLINK do this translation automatically for you when linking a program together. TLINK has an option to disable it if needed. Microsoft's LINK linker doesn't do the translations by default but has the /FA switch to enable it.
Allowing the linker to do far call translations is generally a good thing and you should do it if possible.
WARNING: A linker CAN be fooled if some data byte looks like a far call instruction opcode and it's followed by a relocation fixup that looks like a far call address. In a rare situcation like this, the linker will destroy the code doing a far call translation. If you enable far call translations and your program stops working correctly, you may have one of these rare situations in your code.
tonyi@ibm.net
- Shut up and jump!