Chapter 2 - Working with C/C++ Compilers

Is newer better? Know your tools Use The Force Compiler Extensions

There's lots of different compilers playing in the PC field, and many different versions of each as well. There's Microsoft, Borland, Watcom, Metaware, Zortech(now Symantec), Lattice, GNU, etc, ad nauseam.

Which one is the best for your project? The answer may well be several different ones, and possibly even not the most current version of any particular one either!

Is newer better? (maybe not in every case)

Let me give you a concrete real life example of what I mean here. I'll use the evolution of the Turbo/Borland C/C++ compilers as a demonstration. Every version of Turbo/Borland C/C++ since Turbo C 1.0 has included a little utility called TOUCH. I'm sure you're all familiar with what TOUCH does - i.e. not much! It just updates the date and time on a file to be whatever the current date/time on the system happens to be.

TOUCH utility sizes
CompilerSize in bytes
Turbo C 1.0/1.5/2.03992
Turbo C++ 1.05118
Borland C++ 2.0/3.05124
Borland C++ 3.15528
Borland C++ 1.5(OS/2)15872
Watcom C++ 10.044033

The 1987 vintage TOUCH took up only 3992 bytes. Given what TOUCH does, and the almost 4K size, its pretty safe to assume that the Turbo C TOUCH utility wasn't written in assembler code because an assembler version of TOUCH would only be a few hundred bytes. So, TOUCH is probably a tiny model C program.

The Turbo C version 1.5 and version 2.0 TOUCH's look to be identical to the Turbo C 1.0 version for the most part. Their size was identical.

Enter Turbo C++ version 1.0 though. Now we see a big change in the little TOUCH utility! The Turbo C++ 1.0 TOUCH has grown by about 1K. Do you suppose there could be 1K worth of new function or bug fixes in this TOUCH? Not likely. A more plausable scenario is that someone updated the version of the compiler TOUCH was being built with and didn't pay attention to the code growth when it happened.

The Turbo C++ 1.0 version of TOUCH has also managed to blow the 2K and 4K cluster boundries that the previous version was sliding under by 104 bytes! Hmmmm....

Now along comes Borland C++ version 2.0, and we see that TOUCH has managed to grow some again. Not too bad this time though, only 6 bytes more for the 1991 vintage TOUCH. The TOUCH that comes with the Borland C++ version 3.0 compiler is the same size as this one too. Along came Borland C++ version 3.1 and we see a different, and bigger TOUCH again though. This one has grown by just over 400 bytes. Could there possibly be 400 bytes worth of new function and/or bug fixes in this version of TOUCH? Again, not likely.

The TOUCH that comes with the OS/2 Borland C++ version 1.5 compiler is a real shocker. It makes the DOS/Windows compiler versions all look anemic by comparason.

An even worse shocker is the TOUCH that comes with the Watcom version 10.0 C/C++ compiler -- 44,033 bytes. To be fair to the Watcom folks, their WTOUCH does do a lot more than the Borland DOS or OS/2 versions. It's also a "bound" 16 bit OS/2 1.X compatible executable (which means the same executable can run under OS/2 protected mode or DOS). The penalty for having this dual-mode capability in a program normally runs between 12K and 20K depending on how much of the OS/2 dual mode library is being dragged in.

So, what the heck was going on with this DOS version of the Borland TOUCH utility that caused it to grow from its original 1987 incarnation at 3992 bytes to the 5528 byte 1992 edition? It's rather unlikely that anyone at Borland spent a lot of effort doing major code modifications to the DOS version of TOUCH. After all, TOUCH isn't the kind of utility that requires a lot of maintenance work once it's up and running. It's also unlikely that there could be enough operational code in the source for TOUCH that code generation differences in the different compilers would account for the difference either.

The difference has to be in runtime library and C startup code differences between the different versions of the compiler. Later versions of the Borland compiler are clearly carrying around more baggage than the earlier versions were.

When we're looking to shrink down some program, this knowlege is useful. Blindly using the latest version of some compiler may well hurt rather than help us. If some utility or component of a system is working OK with an older revision of a compiler, and that version is producing smaller executables, then maybe it pays to keep the older version of the compiler around for building those components that will bennefit from it.

There's also some testing impacts to consider here as well. If by using an older version of a compiler, a particular component of a system stays the same from release to release, the test folks will have a lot warmer and fuzzier feeling about it. Barring bugs and/or behavior changes in new operating systems, that particular component is always going to behave the same for you.

Balancing off the testing advantage is whatever problems are involved in keeping different levels of compilers around in whatever build system you've got. These aren't insurmountable problems, and they're a one-time cost too. Once you get the multiple compiler environment setup it's not a big deal.

By the way, I've got a whole raft of C/C++ compilers from several vendors on hand, and I often find myself going back to the 1988 vintage Turbo C 2.0 compiler for doing DOS things because it produces the smallest EXE's in many cases.

If you've got code that can be compiled by compilers from several different vendors, or with different versions of a compiler, try'em all out and see which one works best for particular components of whatever you're working on.


Know your tools and what they're capable of

Most C/C++ compilers these days have all sorts of code generation options. These can often have a huge impact on the size of the code they're going to generate. Take for example the ability to "inline" some of the string functions like strcpy(), strcmp(), etc.

The "inline" capability is faster and you should use it in modules that are in speed critical areas of your code. However, you may not need it everywhere.

Some compilers offer the ability to "inline" specific functions, and not inline others. For example, later versions of the Borland compilers have a special compiler "pragma" for this.


#pragma intrinsic

This pragma controls the inlining of specific functions within a particular module. Let's analyze a little example using the Borland compiler and see how big a difference this stuff can make when it's applied.

Here's a little C function that uses the strcpy() function in a couple of places. In one place we've got a fast, high performance path, in another we've got a slow path.

/*----------------------------------------------------------
   Sample function to demonstrate the effects of inlining
   of string functions.
----------------------------------------------------------*/
#include <string.h>

char s1[10], s2[10];

void foo(int fast)
        {
        if (fast)
                {
                /* #pragma turns on inlining for strcpy() */
                #pragma intrinsic strcpy
                strcpy(s1,s2);
                }
        else
                {
                /* #pragma turns off inlining for strcpy() */
                #pragma intrinsic -strcpy
                strcpy(s2,s1);
                }
        }

I compiled this test function with the Borland C++ version 3.1 compiler with the -c (compile only) and -S (generate an ASM file) options. Here's the relevent code that was generated. I've added some comments by the generated code that contain the byte counts for the code that was produced for the two different calls to strcpy().

_TEXT   segment byte public 'CODE'
        ;
        ; void foo(int fast)
        ;
        assume  cs:_TEXT
_foo    proc    near
        push    bp
        mov     bp,sp
        push    si
        push    di
        ;
        ;       {
        ;       if (fast)
        ;
        cmp     word ptr [bp+4],0
        je      short @1@86
        ;
        ;       {
        ;#pragma intrinsic strcpy
        ;      strcpy(s1,s2);
        ;
        mov     si,offset DGROUP:_s1    (3 bytes)
        mov     di,offset DGROUP:_s2    (3 bytes)
        push    ds                      (1 byte )
        pop     es                      (1 byte )
        xor     ax,ax                   (2 bytes)
        mov     cx,-1                   (3 bytes)
        repnz   scasb                   (2 bytes)
        not     cx                      (2 bytes)
        sub     di,cx                   (2 bytes)
        shr     cx,1                    (2 bytes)
        xchg    si,di                   (2 bytes)
        mov     ax,ds                   (2 bytes)
        mov     bx,ax                   (2 bytes)
        mov     ax,es                   (2 bytes)
        mov     ds,ax                   (2 bytes)
        mov     es,bx                   (2 bytes)
        rep     movsw                   (2 bytes)
        adc     cx,cx                   (2 bytes)
        rep     movsb                   (2 bytes)
        ;                       Total     (39 bytes)
        ;       }
        ;
        jmp     short @1@114
@1@86:
        ;
        ;   else
        ;      {
        ;#pragma intrinsic -strcpy
        ;      strcpy(s2,s1);
        ;
        mov     ax,offset DGROUP:_s1    (3 bytes )
        push    ax                      (1 byte  )
        mov     ax,offset DGROUP:_s2    (3 bytes )
        push    ax                      (1 byte  )
        call    near ptr _strcpy        (3 bytes )
        pop     cx                      (1 byte  )
        pop     cx                      (1 byte  )
                                Total     (13 bytes)
@1@114:
        ;
        ;       }
        ;       }
        ;
        pop     di
        pop     si
        pop     bp
        ret
_foo    endp
_TEXT   ends

Wow! There was a 26 byte difference between inlining strcpy() and not inlining it. If your code had 100 calls to strcpy() scattered around in various places and they were all inlined it's going to cost an extra 2.6K in code size. 2.6K is a lot of code. It might be the difference between being able to keep something as "small" or "tiny" model or being forced into "medium" model with multiple code segments.

So, don't just blindly throw a global switch on your compiler to inline things unless they really do need to be inlined for speed. Doing this kind of stuff globally can really pork out the code.


Use "the force" Luke!

In this case, "the force" is a profiler or performance analysis tool. Run the code under a profiler in what you expect to be common user scenarios and let the tool tell you where the performance hot spots are in the code. In those sections you may want to pay the price for inlining. What the profiler is probably going to tell you is that there's a fairly limited number of routines in the code where all the hot action is occuring. Everywhere else, just won't matter in terms of speed.

The profiler may even tell you that the hot spots in the code aren't CPU bound at all, rather they're I/O bound or dependent on some operating system supplied system call. If the profiling tool reveals 90% of the program's time is spent in the C compiler's fwrite() routine or in some system call that draws arcs in an OS/2 or Windows window, then inlining things isn't going to buy you much at all because your code isn't the speed bottleneck.


Sometimes inlining is smaller as well as faster!

I just got done telling you why you might not want to inline strcpy() functions in many cases, well there's a fairly common exception worthy of noting here.

When the compiler knows the length of the source string, it might generate smaller code than by calling the C runtime library to do the job.

In the previous example the compiler couldn't know the length of the source string at compile time because it was a variable. Suppose though it's a string constant? Let's check out how the Borland version 3.1 C++ compiler behaves in a case like this one:

        /*----------------------------------------------------------
          Sample function to demonstrate the effects of inlining
          of string functions.
        ----------------------------------------------------------*/
        #include <string.h>

        char s1[10];

        void foo(void)
                {
                #pragma intrinsic strcpy
                strcpy(s1," ");   /* this one is inlined */
                #pragma intrinsic -strcpy
                strcpy(s1," ");   /* this one calls the runtime library */
                }

In this case, the length of the source string can betermined at compile time because the source is a constant value. This allowed the compiler to generate a lot better code for the intrinsic version than in the previous example. Here's the code that got generated for this experiment:

_foo    proc    near
        push    bp
        mov     bp,sp
        push    si
        push    di
   ;
   ;       {
   ;    #pragma intrinsic strcpy
   ;       strcpy(s1," ");
   ;
        mov     di,offset DGROUP:_s1       (3 bytes )
        mov     si,offset DGROUP:s@        (3 bytes )
        push    ds                         (1 byte  )
        pop     es                         (1 byte  )
        mov     cx,1                       (3 bytes )
        rep     movsw                      (2 bytes )
                                     Total (13 bytes)
   ;
   ;    #pragma intrinsic -strcpy
   ;       strcpy(s1," ");
   ;
        mov     ax,offset DGROUP:s@+2      (3 bytes )
        push    ax                         (1 byte  )
        mov     ax,offset DGROUP:_s1       (3 bytes )
        push    ax                         (1 byte  )
        call    near ptr _strcpy           (3 bytes )
        pop     cx                         (1 byte  )
        pop     cx                         (1 byte  )
   ;                                 Total (13 bytes)
   ;       }
   ;
        pop     di
        pop     si
        pop     bp
        ret
_foo    endp

In this case both versions turned out to be the same at 13 bytes. Clearly, the way to go in a situation like this one is to inline the strcpy(). Whenever you have identical sized code sequences where one is faster than the other, go for the speed because it's free.

Suppose we're generating code for a 80186 or better CPU here and can tell the compiler to use the added instructions those CPU's implement? In that case, the compiler generated this code for the call to the runtime library:

;
;    #pragma intrinsic -strcpy
;       strcpy(s1," ");
;
        push    offset DGROUP:s@+2      (3 bytes )
        push    offset DGROUP:_s1       (3 bytes )
        call    near ptr _strcpy        (3 bytes )
        add     sp,4                 (3 bytes )
                                Total   (12 bytes)

Interestingly, here the compiler was smart enough to use the PUSH immediate instruction, but chose to use a 1 byte larger ADD SP,4 to clean the parameters off the stack rather than the the two POP CX's that it used when generating 8086 code.

Compiler specific extensions

Many C/C++ compilers have implementation specific extensions that can be valuable for speeding up and shrinking code. Naturally using any of these makes the code less portable than using more standard features.

Is the portability trade off worth it? That depends on the application and its intended use. If you're writing a TSR or DOS device driver, then it's a fair assumption that the code isn't likely to be ported to a mainframe or some machine with a CPU that isn't 80x86 compatible.

One approach to the portability problem is to write and debug the code using fairly portable techniques and then #ifdef sections for a specific compiler. Within the 80x86 compiler world, most vendors implement a set of runtime library functions called int86() and int86x() for low level OS and BIOS interfacing. Among the compilers that do implement these functions, their behavior is usually quite similar -- often being identical. For all practical purposes, these two functions are "standard" among 16 bit C/C++ compilers for the PC. They have no analog outside the PC world of course, but within it we can usually count on them being present.

Inline assembler code

Suppose we had a C program that detects the presence of an XMS driver (like HIMEM.SYS). A "standard" approach to accomplishing this using a 16 bit C/C++ compiler for the PC might go like this:

        /*------------------------------------------
          DETECT.C

          Detect an XMS driver in a "standard" way
          (for the PC world at least).

          This source code will compile and run OK
          using many different C/C++ compilers for
          the PC.  I tried it with the Lattice 3.0
          compiler.  Turbo C 2.0, Borland C++ 3.1,
          Borland C++ 4.52, Microsoft C++ 8.0, and
          Watcom C++ 10.0.  All work fine with the
          same source code.
        ------------------------------------------*/
        #include <stdio.h>
        #include <dos.h>

        int main()
           {
           union REGS regs;
           int   rval;

           regs.x.ax = 0x4300;
           int86(0x2F, &regs, &regs);
           if (regs.h.al == (char)0x80)
              {
              puts("XMS driver is present");
              rval = 0;
              }
           else
              {
              puts("No XMS driver present");
              rval = -1;
              }
           return rval;
           }

Compiling the code using small model with the old Borland Turbo C 2.0 compiler yielded an EXE that was 4232 bytes in size. However, the Borland compiler has some nice extensions that can make the code smaller and faster than using the REGS union and the int86() runtime library call. Turbo C has "pseudo register" variables that represent the CPU's registers and the ability to generate an interrupt call inline. Recoding DETECT to take advantage of those implementation specific features would look like this:

#include <stdio.h>
#include <dos.h>

int main(void)
   {
   union REGS regs;
   int   rval;

#ifdef __TURBOC__
   _AX = 0x4300;
   geninterrupt(0x2F);
   if (_AL == (char)0x80)
#else
   regs.x.ax = 0x4300;
   int86(0x2F, &regs, &regs);
   if (regs.h.al == (char)0x80)
#endif
      {
      puts("XMS driver is present");
      rval = 0;
      }
   else
      {
      puts("No XMS driver present");
      rval = -1;
      }
   return rval;
   }

This version of DETECT uses both the register variables and the inline generation of a software interrupt call when it's being compiled by one of the Borland compilers (the Borland compilers all predefine the symbol __TURBOC__).

Some nice things happened special casing the Borland compiler for this program. The EXE that resulted dropped from 4232 bytes down to 3992 bytes -- a 240 byte decrease. Notice that this happens to be enough of a reduction that disk clusters will be saved.

Running the DSPACE utility (from chapter 1) on the original 4232 byte EXE gave this output:

        File [detect.exe] is 4232 bytes long
        Cutting   136 bytes saves a 4K cluster
        Cutting   136 bytes saves a 2K cluster
        Cutting   136 bytes saves a 1K cluster
        Cutting   136 bytes saves a 512 byte cluster

All we needed to save a 4K, 2K, 1K, and 512 byte cluster on DETECT was 136 bytes. Special casing the Borland compiler got 240 bytes, so in this case making that small change was a real winner when DETECT was compiled with Turbo C 2.0 -- and the code is still going to compile and run OK on all those other various compilers too because the change was #ifdef'd into the code.

Obviously the majority of the 240 byte savings from special casing of Turbo C comes from not having the int86() function linked into the program anymore. However, some of it comes from an improvement in the generated code as well.

Using the standard int86() method would require the compiler to generate an assignment to a variable in memory, pass several parameters to int86(), and then generate an comparason with a memory variable to check the result.

The special case version will generate a much simpler set of instructions. The assignment statement "_AX = 0x4300" translates directly into a "MOV AX,4300h" instruction. The "geninterrupt(0x2F)" statement translates into an "INT 2Fh" instruction, and the "if (_AL == 0x80)" test translates to a "CMP AL,80h" instruction.

The older Microsoft C/C++ compilers don't implement pseudo register variables and the ability to inline a software interrupt call. However, they do implement the ability to do inline assembler. Extending the DETECT program to special case the Microsoft product as well as the Borland would look something like this:

#include <stdio.h>
#include <dos.h>

int main(void)
   {
   union REGS regs;
   int   rval;
#ifdef _MSC_VER
   char ALreturn;
#endif

#ifdef __TURBOC__
   _AX = 0x4300;
   geninterrupt(0x2F);
   if (_AL == (char)0x80)
#else
   #ifdef _MSC_VER
      __asm mov   ax,4300h
      __asm int   2Fh
      __asm mov   ALreturn,al
      if (ALreturn == (char)0x80)
   #else
      regs.x.ax = 0x4300;
      int86(0x2F, &regs, &regs);
      if (regs.h.al == (char)0x80)
   #endif
#endif
      {
      puts("XMS driver is present");
      rval = 0;
      }
   else
      {
      puts("No XMS driver present");
      rval = -1;
      }
   return rval;
   }

Current versions of the Microsoft compiler predefine the "_MSC_VER" macro, so I've keyed the special case code off that symbol. Before the special casing for the Microsoft compiler the EXE that the version 8.0 compiler produced was 5939 bytes in size. After the special casing the size of the EXE had dropped to 5795 -- a 144 byte reduction. Note that we had to introduce an intermediate variable here because the Microsoft compiler doesn't implement the pseudo-registers the way the Borland compiler does.

The Watcom C/C++ compiler's approach to inline assembler is quite a bit different than the Microsoft compiler. Watcom has you define a quasi-function called a "code burst". The code burst is then expanded inline wherever it is "called" in the code.

The advantage of Watcom's code burst scheme is that you get to define the behavior of the inline code with respect to the registers it may destroy, and the register(s) it passes back return values in.

This allows the optimizer to know a lot more about how the section of inline code behaves. This scheme should allow the optimizer to produce better code. The Microsoft and Borland compilers will disable certain optimizations in functions containing inline assembler code because they don't have any knowlege about the side effects from that code.

Using all the default options and compiling with small model, the Watcom 10.0 compiler produced a 5756 byte executable for the DETECT program. Now here's a version of DETECT that incorporated some conditionalized code to special case the Watcom compiler as well as the Borland and Microsoft compilers:

#include <stdio.h>
#include <dos.h>

#ifdef __WATCOMC__
extern char XMSpresent(void);
#pragma aux XMSpresent = \
        "mov ax,4300h"   \
        "int 2Fh"        \
        value   [al]     \
        modify  [ax];
#endif

int main(void)
   {
   union REGS regs;
   int   rval;
#ifdef _MSC_VER
   char ALreturn;
#endif

#ifdef __TURBOC__
   _AX = 0x4300;
   geninterrupt(0x2F);
   if (_AL == (char)0x80)
#else
   #ifdef _MSC_VER
      __asm mov   ax,4300h
      __asm int   2Fh
      __asm mov   ALreturn,al
      if (ALreturn == (char)0x80)
   #else
      #ifdef __WATCOMC__
         if (XMSpresent() == (char)0x80)
      #else
         regs.x.ax = 0x4300;
         int86(0x2F, &regs, &regs);
         if (regs.h.al == (char)0x80)
      #endif
   #endif
#endif
      {
      puts("XMS driver is present");
      rval = 0;
      }
   else
      {
      puts("No XMS driver present");
      rval = -1;
      }
   return rval;
   }

By the way, the Watcom compiler predefines the macro "__WATCOMC__", so that's a convenient way to detect the Watcom compiler. Compiling this new version of DETECT that special cases the Watcom compiler produced an executable that was only 4620 bytes. That's a savings of 1136 bytes!

If we were writing a TSR or device driver for DOS that had an int86() call embedded in it somewhere, then special casing the Watcom, Borland, and Microsoft compilers could pay handsome dividends in terms of resident code size for the driver or TSR. Obviously anyone writing and distributing a 3rd party library would want to look at special casing the code for these compilers too. The generic int86() way of doing things is indeed more standard, but it comes at a price.

Attention to small details like this can give a library vendor a competitive edge in the market. Attention to small details like this can give your applications an edge against competitors who were too lazy or unaware of the latent power in their tools to do them.

Some handy Borland specific things

_FLAGS pseudo register

All versions of the Borland compilers since Turbo C 2.0 implement direct access to the CPU's flags register via a pseudo register called _FLAGS. This can be an incredibly handy little device for dealing with functions that give back error conditions in the CPU's carry or zero flags.

For example, suppose we're calling some DOS Int 21 function that returns with carry set when an error occurred. Using _FLAGS to handle a condition like this is trivial:

        #define CARRY_SET (_FLAGS & 0x0001)
        /* blah, blah, blah,... */
        geninterrupt(0x21);
        if (CARRY_SET)
                {
                /* handle the error condition here */
                }

Note that bit #0 in the flags register is the carry flag, so the mask 0x0001 is masking off the value of the carry flag in the test. The Borland compilers are smart enough to notice tests like this one and special case the code they generate for them. The code generated for an "if" statement, like in this example, is going to resolve to a JC or JNC instruction -- you couldn't do any better in a situation like this by writing it in pure assembler code. If you're doing low level interfacing work with the Borland compilers, this is a good one to have in your bag of tricks.

_es _ds _ss _cs pointers

Another handy extension the Borland compilers implement is the ability to declare a near 16 bit pointer type that is referenced via a specific segment register. Suppose we have two functions like these:

        static void foo(int *p)
                {
                *p = 1234;
                }

        void bar(void)
                {
                int  x;

                foo(&x);
                }

Here we've got the function bar() calling foo() and passing the address of an "automatic" variable as a parameter. In a case like this, the variable "x" will be allocated on the stack. If these functions are compiled using large model, the pointer being passed to foo() will be "far". The Borland 4.52 compiler generated this large model code for the two functions:

FOO_TEXT        segment byte public 'CODE'
   ;
   ;            static void foo(int *p)
   ;
        assume  cs:FOO_TEXT,ds:DGROUP
foo     proc    far
        push    bp
        mov     bp,sp
   ;
   ;                    {
   ;                    *p = 1234;
   ;
        les     bx,dword ptr [bp+6]
        mov     word ptr es:[bx],1234
   ;
   ;                    }
   ;
        pop     bp
        ret
foo     endp

   ;
   ;            void bar(void)
   ;

        assume  cs:FOO_TEXT,ds:DGROUP
_bar    proc    far
        enter   2,0
   ;
   ;                    {
   ;                    int  x;
   ;
   ;                    foo(&x);
   ;
        push    ss
        lea     ax,word ptr [bp-2]
        push    ax
        push    cs
        call    near ptr foo
        add     sp,4
   ;
   ;                    }
   ;
        leave
        ret
_bar    endp

Notice the "les bx,dword ptr [bp+6]" instruction that was generated in the foo() function. This is typical of a "far" pointer reference in large model code. This instruction is going to result in a segment register load which is an expensive operation in protected mode code -- like under Windows or a DOS extender. It's also sloshing a double word around in memory which hurts on CPU's with a 16 bit data path like the 386SX chips. For the call to foo() the double word pointer also caused the SS register to be pushed.

If we could be assured that the foo() function would only be called with pointers to variables that live on the stack then we could change the declaration of the foo() function to look like this:

        static void foo(int _ss *p)
                {
                *p = 1234;
                }

This declares the "p" pointer to be a 16 bit near pointer that will always have an SS: override applied to any reference that uses it. Since the parameter passed to foo() in this example does indeed live on the stack, the call to foo() in the bar() function only needs to pass the offset of the "x" variable. The code generated when the _ss pointer change was made looks like this:

FOO_TEXT        segment byte public 'CODE'
   ;
   ;            static void foo(int _ss *p)
   ;
        assume  cs:FOO_TEXT,ds:DGROUP
foo     proc    far
        push    bp
        mov     bp,sp
        push    si
        mov     si,word ptr [bp+6]
   ;
   ;                    {
   ;                    *p = 1234;
   ;
        mov     word ptr ss:[si],1234
   ;
   ;                    }
   ;
        pop     si
        pop     bp
        ret
foo     endp

   ;
   ;            void bar(void)
   ;
        assume  cs:FOO_TEXT,ds:DGROUP
_bar    proc    far
        enter   2,0
   ;
   ;                    {
   ;                    int  x;
   ;
   ;                    foo(&x);
   ;
        lea     ax,word ptr [bp-2]
        push    ax
        push    cs
        call    near ptr foo
        pop     cx
   ;
   ;                    }
   ;
        leave
        ret
_bar    endp

Notice how the costly loading of the segment register is gone in the foo() function now. The code for the call to foo() is also a lot simpler and faster now as well.

_ds, _cs, and _es pointers behave in a similar manner to the _ss pointer we just examined. The only difference is in the type of segment override the compiler is going to apply whenever the pointer is referenced. These special pointer types can be a powerful fine tuning tool for programs being built with the Borland compilers. Having the ability to suppress the reloading of a segment register in protected mode goes a long way towards minimizing the usual speed penalty (about 25% versus the same code running in real mode) associated with protected mode large model programs.

_seg pointers

Another special Borland extension is the _seg pointer type. _seg pointers are a type of pointer that maps directly to a segment or selector value with an implied offset of zero. An _seg pointer is a 16 bit variable. These are primarily useful for saving space in situations where you would normally have a "far" pointer where the 16 bit offset part of the pointer is always going to be zero. Using an _seg pointer in cases like this can memory sometimes because a normal "far" pointer would be 4 bytes.

In a plain DOS program using an _seg pointer is an easy way to access a program's PSP (program segment prefix). Here's an example of an _seg pointer being used. For output, this program simply echos whatever parameters it was pass when it was run.

#include <dos.h>
#include <stdio.h>

int main(void)
   {
   char _seg *pPSP;  /* an _seg pointer to our PSP */
   char near *p;
   unsigned char CmdLineLen;

   /*
     _psp is initialized by the Borland startup
     code as being the segment address of this
     program's PSP.
   */
   pPSP = (void _seg *)_psp;
   CmdLineLen = *(pPSP+128);
   p = (char near *)129;

   while (CmdLineLen--)
      {
      putchar(*(pPSP+p));
      p++;
      }
   return 0;
   }

One of the convenient properties of _seg pointers is they can be combined with a "near" pointer when doing pointer arithmetic. The result of adding a near pointer to an _seg pointer is that a "far" pointer is generated. This is being done in the call to putchar() in the example program.

Some handy Microsoft specific things

Based pointers

The P-code interpreter


Copyright © 1998, Tony Ingenoso e-mail graphic tonyi@ibm.net - Shut up and jump!
Last modified on Sunday, Dec 13, 1998
This page produced the old fashoned way - with a text editor