Chapter 5 - Working Set and Data tuning

Working set and data tuning for virtual memory systems

There are two aspects to keeping programs small:

Executable size
Memory size

Often these two goals will be in conflict and a choice between one or the other will have to be made, or a compromise reached. Previous chapters have largely dwelled on executable size, now we're going to take a look at memory size aspects.

There are two major forms of virtual memory systems in wide use in the world today:

Page swapping
Segment swapping

Segment swapping systems
OS/2 1.X
Windows 3.X(standard mode)
Some 286 DOS extenders
286 Unix systems

Page swapping systems
OS/2 2.X(and Warp)
Windows 95/98
Windows NT
Windows 3.X(enhanced mode)
Some 386 DOS extenders
386 Unix systems

Intel CPU's from the 80286 onward are capable of running segment swapping systems.
Intel 80386 and later CPU's are capable of running paged virtual memory systems and segment swapping systems.

NOTE: For the rest of this discussion, I'll be using "386" as a generic term to refer to the 386, 386SX, 486, 486SX, and Pentium CPU's. If something is specific to a particular CPU, I'll mention it specifically.

The basic concepts presented here apply to both segment swapping systems and paging systems. However, we'll be primarily focusing on paging systems in the discussion. In fact, the basic concepts also apply in the plain DOS world when applications are built using overlays as a way to get around 640K limitations.

[Overlays are really just a swapping mechanism where the swapping is being managed by code in the application rather than code in the operating system.]

A brief look at segment swapping

Segment swapping systems manage memory in units that are variable in size. With Windows in standard mode or OS/2 1.X, segments could be anywhere from 1 byte up to 64K in size. The advantage to a scheme like this is that memory can be (though not always is) managed in a more space efficient manner. If a program allocated a 500 byte segment for a chunk of data, then only 500 bytes of physical memory (plus a little bookkeeping overhead) needs to be allocated when that 500 byte segment is present in memory. A downside occurs when someone allocates a big segment, say 32K, where only a small portion of that 32K is actively being used by the program. A segment swapping system will force the whole 32K to be present in physical memory even though only a fraction of it may actually be getting accessed with any frequency.

How varying length segment might look in memory

In a segment swapping system, the whole segment will be forced into memory when a program loads one of the segment registers with the selector for the segment.
Even if the program intends to touch only a few bytes in a 64K segment, the whole 64K segment will be brought into memory.

This bringing in of the whole segment is a good thing if programs are well designed and their segments contain only the code and data needed to do specific tasks. It's grim when programs are poorly designed and segments contain extraneous code or data to the task at hand because it can force the system's memory manager to "swap out" some other program running in the system to make room for code/data that aren't going to be used right away.

Segment swapping advantages

Because segment swapping systems allocate variable length memory chunks and variable length swap file units, they tend to be more space efficient in their memory usage.
Swap files tend to grow less than those of paging systems where memory allocations and swap file allocations are all the same size. [Paging system behaviors are similar to the DOS FAT cluster discussion in earlier chapters.]

The paged virtual memory model

With paging, all the memory allocations, and swap file allocations are going to be uniform in size.

On the 386, pages will normally be 4K. [The chip actually supports a mode where pages can be 1 byte, but I don't know of any paged virtual memory OS's that have been designed this way.]

A picture of a paged system will look similar to the prior segment swapping picture with one change - all the boxes representing memory and swap file allocations will all be the same size - 4K.

How paging systems look in memory

Contrary to popular misconception, on a 386 running a paging system, "segments" do in fact still exist. [The only difference is that now they can have a maximum theoretical size of 4 gigabytes rather than 64 kilobytes.]

The way operating system designers create the popular "flat" programming model for 32 bit programs is by making a program's CS, DS, ES, and SS segment registers all point to the same "segment" in memory.

The 386's 32 bit "flat" programming model is analogous to a "tiny" model .COM file program under DOS where the DOS loader initializes CS, DS, SS, and ES to all point to the program's PSP (program segment prefix) when the program is started.

Programmer's who are familiar with the way EMS memory works under DOS will notice that a 386 paged virtual memory system works in much the same way. The physical memory in the machine is analogous to the 64K EMS page frame. The various 16K EMS pages are analogous to a 4K virtual memory page. The sum total of available EMS memory is similar to the size of a program's virtual memory space provided by the operating system. Just as a 16K EMS page needs to be mapped somewhere into the 64K EMS frame to be accessed, a 4K page in a 386 virtual memory system needs to be loaded into physical memory before it can be accessed. Let me repeat that again - it's important.

A 4K page in a 386 virtual memory system needs to be present in physical memory before it can be accessed.

Why virtual memory isn't "free".

Getting virtual memory pages into and out of physical memory is usually expensive (i.e. the swapper or loader is going to hit the disk drive). Typically more virtual memory is present than physical memory in a system. The only exception here would be if the system swapper is disabled and the system is running only out of physical memory. Yea, you can do this with some systems. OS/2 and Windows have ways to disable swapping and force the system to run out of physical memory only. You might try doing this as an experiment on a machine just to see how it runs.

For OS/2 you'd specify MEMMAN=NOSWAP in the CONFIG.SYS file. You'd better be prepared to have around 16+M of memory in your system, or it probably won't boot. With Windows 3.X, there's a setting in the control panel under the "386 Enhanced" icon where you can disable the Windows swap file. If there's only 6M or 8M of memory in your machine, you'll start running out of memory real fast opening up a few DOS boxes.

In reality, to run acceptably on machines without massive amounts of memory, most virtual memory systems are going to need to swap sometimes. Not everyone has, or can afford, machines with 32M or 64M of memory where they may have the luxury of running with swapping disabled.

[1998 update: That last paragraph was written sometime in 1994 when memory was relatively expensive, and 16M of RAM was considered a luxury. In 1998, most machines are shipping with 32M or 64M as standard, with 128M or 256M being not uncommon at all. Of course the situation has changed little in reality because the memory requirement growth of software in the last 4 years has been nothing short of incredible! A "mundane" old 120mhz Pentium in 1998 running with 32M of RAM can be brought to its knees very quickly with some of the bloated stuff on the market today. However, if you took that same machine and loaded it up with a 94' vintage operating system and some 94' vintage apps, it would be very responsive and fast with 32M]

Our goal as programmers is to reduce the amount of swapping needed during the normal operation of our programs.

Note the emphasis here on "normal operation". A couple of seconds of swapping in an editor during loading of a spell check module isn't going to upset the user too much. The disk drive will blip a few times and they're on their way. After all, spell checking a document isn't the kind of thing people do every 5 seconds. If however the program is so piggy that the drive is whirring away constantly when the user is just typing normal text, well that's going to a pretty serious annoyance to most people.

Being a good multi-tasking "citizen"

Systems like OS/2, Windows, Windows NT, UNIX, etc. allow more than one program to be active in the machine at the same time. There will probably be several active programs competing for CPU time and memory to run in. Remember that thought I emphasized before? I'll say it again:

A 4K page in a 386 virtual memory system needs to be present in physical memory before it can be accessed.

In a segment swapping virtual memory system (like Windows in "standard" mode, or OS/2 1.X), an analogous statement is that a whole segment needs to be present in physical memory before it can be accessed.

All of the programs running in the system are in competition for the machine's physical memory and the amount of virtual memory in the system is basically unrelated to this fact.

If our programs are being piggy and accessing different pages or segments like crazy, then other programs are going to be pushed out of physical memory by the system's swapper to make room for our stuff. This is going to cause overall system performance to suffer.

How bad can this cause us to "suffer"? Let's do an experiment and find out. Here's a little C program called MSPIGGY.C that allocates 16M of memory and then increments each byte of that memory in two different ways. The first way just bops down the 16M in a linear manner incrementing each byte, the second way skips pages incrementing a byte in a page and then moves to the next page. When it gets to the end of the 16M, it starts incrementing the second byte on a each page, then the third byte, etc. until all the bytes in each page have been incremented once.

/*
  MSPIGGY.C
  Demonstration of how memory accessing effects
  system performance in virtual memory systems.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <process.h>
#define BYTECOUNT (1024L*1024L*16L) /* 16 meg */
int main(void)
   {
   char *nums,*p;
   int i, count;
   time_t start,stop,delta;

   nums = malloc(BYTECOUNT * sizeof(char));
   if (nums == NULL)
      {
      puts("Out of memory!");
      exit(-1);
      }
   puts("Memory allocated OK. Starting linear access");
   fflush(stdout);

   start = time(NULL);
   for (p=nums, i=0, count=0; i<BYTECOUNT; i++)
      {
      *(p+i) += 1;
      count++;
      }
   stop = time(NULL);
   delta = stop-start;
   printf("Linear done. %d seconds elapsed,"
          "%d bytes incremented\n",delta,count);
   fflush(stdout);

   puts("Starting non-linear access");
   fflush(stdout);
   start = time(NULL);
   for (p=nums, i=0,count=0; i<4096L; i++, p++)
      {
      int j;
      for (j=0; j<(BYTECOUNT/4096L); j++)
         {
         *(p+(j*4096)) += 1;
         count++;
         }
      }
   stop = time(NULL);
   delta = stop-start;
   printf("Non-linear done. %d seconds elapsed, "
          "d bytes incremented\n",delta,count);
   fflush(stdout);
   return 0;
   }

On my 60mhz Pentium with 8M of memory, MSPIGGY is certainly going to cause some swapping action in both cases because it'll touch 16M of memory in both cases, but there's only 8M in the box. Clearly something is going to have to get thrown out of memory by the swapper at some point for this program to run in both cases.

When I ran MSPIGGY on the 8M Pentium test machine it ripped through the first phase in about 10 seconds and the disk light was on pretty constantly. Even with all the needed swapping, 10 seconds to increment 16 million bytes of stuff isn't too bad at all. The second phase was a little more disappointing. This one has been grinding away for over a half hour now and no end is in sight. I'm about to head into work, so I'll leave it running and we'll see if it's done when I get back home tonight!

Well, I'm back. MSPIGGY is still grinding away on my hard disk with no end in sight and it's been about 10 hours now. This is getting a little tiresome folks, and I've probably racked up a years worth of normal wear and tear on this drive so far. I'm going to have mercy on my drive and kill MSPIGGY now.

For the sake of this discussion let's assume that MSPIGGY would have completed just as I killed it.

We're talking the difference between 10 seconds, and about 10 hours here!

This is pretty darn serious folks. MSPIGGY ran (at least) 3,600X slower on my 8M machine when it used the second method to access memory that skipped across pages rather than sweeping through them linearly.

Granted, MSPIGGY is a program that's specifically designed to bring a machine with less than 16M of physical memory to its knees with that second method of accessing memory. However, MSPIGGY is also a good example of how bad things can get when code isn't careful about the way its accessing its data. After all, the first method did take only 10 seconds to run!

Suppose we changed the number of bytes MSPIGGY was going to access to be only 4M rather than 16M? On my 8M machine we could probably expect that version to run a lot better when it hits the second method.

Well, I made that change and the results were as expected. The second method didn't hurt nearly as much as it did with the 16M version of MSPIGGY. I did a few runs with the 4M MSPIGGY and they clocked in at about 1 second for the first method, and about 10 seconds for the second method. We don't need to be rocket scientists to realize that 10 seconds is just a tad quicker than 10 hours divided by 4.

How MSPIGGY died such a horrible death

What really happened to MSPIGGY went like this. There was 8M (minus OS overhead) available in the machine. As the program went skipping through pages, it quickly got to a point where every time it touched a byte to increment it, it was causing a page fault and causing the system's swapper to hit the disk drive to do a swap in of the page that byte happened to live on. In other words, every iteration of the inner loop in the second method was causing a disk access. To increment all 16M bytes would have eventually caused somewhere in the range of 16M disk accesses. Given the page skipping the second method used, MSPIGGY was going to die horribly even on a machine with close to 16M because none of the previously touched pages were going to be revisited until a not present page had been swapped in right before it. Visualize having a carrot held in front of you just out of reach. Every time you get close, wham, that page/carrot would be yanked away because the physical memory it was residing in had to be swapped out to bring in something that was just a little closer that was about to be touched.

I was ran MSPIGGY under OS/2 Warp (and Windows NT 3.5) for this experiment. Suppose Warp and NT were capable of swapping out and swapping in 100 pages every second. This is a highly optimistic estimate for my machine by the way. The second MSPIGGY method was going to need about 16M swap in/out operations to run to completion. Given the optimistic swapping rate estimate, MSPIGGY would have needed about 16M/100, or about 160,000 seconds to run to completion. This is about 2,666 minutes or around 44 hours! MSPIGGY wasn't even close to being done when I got home. In fact, MSPIGGY was really just warming up. When I interrupted the run it was a mercy killing!

What about "normal" programs though?

Granted, MSPIGGY's second method is totally pathological and designed to cause horrendous performance problems. However, suppose our programs access their data only 1/20th of 1% as pathologically as MSPIGGY? This isn't a wholly unreasonable possibility now is it?

This would mean a run time of maybe 80 seconds versus 10. We could be 8X slower once swapping starts to happen for "average" code accessing memory in a somewhat sloppy way. Is this really possible? Actually it can be a lot worse depending on how tight memory is in the machine.

Performance in virtual memory systems generally follows what's known as a "knee curve".

A typical "knee curve"
Once you cross a certain memory threshold, performance starts to plunge dramatically. Well written, tuned up programs are going to delay the point at which any given system hits "the knee" on the curve. This means customers will be able to run more apps at any given time, or maybe opt for a larger disk cache to boost disk performance.

**A typical "knee curve"**
	Once you cross a certain memory threshold, performance starts to plunge dramatically. Well written, tuned up programs are going to delay the point at which any given system hits "the knee" on the curve. This means customers will be able to run more apps at any given time, or maybe opt for a larger disk cache to boost disk performance.

Isn't just slapping in more memory the answer?

To a degree, yes - more memory shoves the knee of the performance curve out farther making it harder to hit. However, every time we ratchet up the memory requirements for a program, we ratchet down our prospective audience for that program too. Producing a program that demands 4M for itself probably catches quite a large number of machines in the installed base. If the same program needs 16M for itself, the audience becomes more select. If it needs 32M, then we're starting to sell our program to a very select audience.

Being sloppy about memory accessing, even when there's plenty of memory hurts too though! All memory is not the same folks.

You see, there's this small matter of caches. When a machine gets an on-chip CPU cache hit for a memory access, it's going to be dramatically faster than a miss where the CPU has to wait for a main memory from slower external DRAM's.

All those nifty 1 cycle instruction timings for the 486 and Pentium assume on-chip cache hits! When you go outside the cache, you're going to start playing the "Slug City Blues".

Let's go back to the MSPIGGY example to see how the cache plays a big part. Remember the timings for the 4M version of MSPIGGY on my Pentium with 8M? 1 second for the first method and about 10 seconds for the second method. Could the additional bit of pointer arithmetic needed for the second form have caused the 10X slowdown? Not likely, that seems a bit much. Maybe 3X or 4X would be believable, but 10X is a bit much to swallow. Even the worst C compiler isn't that bad. The outer loop code in the second method is only executed 4096 times too. On any Pentium, the time to execute the outer loop is going to be virtually nil. Heck, even a clunky old 4.77mhz 8088 machine could execute the outer loop instructions in under 1 second.

The 4M version of MSPIGGY is definitely showing the effects of my Pentium's cache in action, and here's why...

The Pentium has two different caches.

There's an 8K I-cache that caches fetched instructions.
There's an 8K D-cache for data references.

[1998 update: The Intel Pentium MMX CPU's have doubled the size of their on-chip caches (which is why they're a tad faster than non-MMX CPU's running at the same clock speed, even when not doing MMX specific operations). Also, a whole bunch of Pentium clone chips have appeared and all sorts of cache architecture variants along with them -- for example the Centaur C6 CPU's which have a 64K monolithic (i.e. combined I & D) cache.]

In MSPIGGY, the 8K I-cache is more than enough to hold those loops where all the action is occurring as the bytes get incremented, so clearly the I-cache isn't the problem here. The 8K D-cache has to be the major player in this program.

On the Pentium, a "cache line" is 32 bytes.

8K / 32 = 256. This means there's 256 cache lines of 32 bytes each in the I and D caches. When the Pentium accesses a chunk of memory that's not cached, it's going to cause at least one, and maybe two, cache lines to be loaded from external (i.e. slower) memory. A two line load will occur when a data item spans a 32 byte boundary in memory. In this case, the Pentium will load a cache line for each of the 32 byte areas the item is spanning.

The relevant fact here is that there's only 256 lines in the Pentium's I and D caches.

On chip CPU cache lines are a very scarce resource!

Let's think about this for a bit. The situation is kind of similar to the notion of page swapping to a hard disk. Consider those 256 cache lines as if they were "physical memory", and think about the main memory in the machine as if it were "virtual memory". Physical memory is typically a scarce resource compared to virtual memory. If we don't find a page in physical memory, then it needs to be swapped in. If the Pentium doesn't find data in one of the cache lines, then the line (or lines) containing that data need to be brought in!

OK, now that we've got a basic understanding of cache lines on the Pentium, we can start to understand how the 4M version of MSPIGGY is behaving with regard to the D-cache.

4M of memory takes up 1024 4K pages right? The second method in MSPIGGY goes bopping along memory in 4K increments whereas the first method bops along in 1 byte increments. Hmmm, 1024 is more than 256. There's only 256 D-cache lines available, and we're having to reload one for virtually every iteration through that inner loop the second method uses.

The second MSPIGGY method was churning the heck out of the Pentiums D-cache!

The first method is a lot more "cache friendly" to the Pentium's D-cache because it accesses memory in a linear manner with those one byte increments. Once the first method touches a byte in a cache line and causes it to be brought in, we're going to see 31 hits in that D-cache line before another line might need to be brought in!

Is this stuff about the Pentium D-cache for real?

I wasn't intending to write a book on how to make Pentiums go fast, but this is pretty important stuff. What the heck, it's interesting too! MSPIGGY wasn't really designed to flush out performance issues in the Pentium D-cache. The code in the loops for the two methods is somewhat different making actual comparisons kind of hard to interpret. What we really need is a program designed to show the D-cache in action. This next example is my first crack at hunting down the D-cache.

**Test to explore Pentium D-cache performance**
/* DCACHE.C Demonstration of how the Pentium D-cache can effect performance. / #include <stdio.h> #include <stdlib.h> #include <time.h> #include <process.h> #define REPS (6553610) #define IN_D_CACHE ((128+64+32)32) /256 lines available / char stuff[1000000]; char c; time_t start,stop,delta; volatile int incr, halt; / Volatile should force the compiler / / to access rather than enregister / int main(void) { register int i,j,count; puts("Starting linear access"); fflush(stdout); / This loop's data accesses should sit nicely in the Pentium's D-cache. / incr = 1; halt = (IN_D_CACHEincr)/16; count = 0; start = time(NULL); for (j=0; j<REPS; j++) for (i=0; i<halt; ) { c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; count += 16; } stop = time(NULL); delta = stop-start; printf("Linear done. %d seconds %d touches\n", delta, count, c); fflush(stdout); puts("Starting non-linear access"); fflush(stdout); /* This loop's data accesses will thrash the Pentium's D-cache. / incr = 32; halt = (IN_D_CACHEincr)/16; count = 0; start = time(NULL); for (j=0; j<REPS; j++) for (i=0; i<halt; ) { c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; c = stuff[i]; i+=incr; count += 16; } stop = time(NULL); delta = stop-start; printf("Non-linear done. %d seconds %d touches\n", delta, count, c); fflush(stdout); return 0; }

Test to explore Pentium D-cache performance

/*
  DCACHE.C
  Demonstration of how the Pentium D-cache
  can effect performance.
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <process.h>
#define REPS (65536*10)
#define IN_D_CACHE ((128+64+32)*32) /*256 lines available */

char stuff[1000000];
char c;
time_t start,stop,delta;
volatile int incr, halt; /* Volatile should force the compiler */
                         /* to access rather than enregister   */

int main(void)
    {
    register int i,j,count;

    puts("Starting linear access");
    fflush(stdout);
    /*
     This loop's data accesses should sit nicely in
     the Pentium's D-cache.
    */
    incr = 1;
    halt = (IN_D_CACHE*incr)/16;
    count = 0;
    start = time(NULL);
    for (j=0; j<REPS; j++)
        for (i=0; i<halt; )
            {
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            count += 16;
            }
    stop = time(NULL);
    delta = stop-start;
    printf("Linear done. %d seconds %d touches\n",
           delta, count, c);
    fflush(stdout);

    puts("Starting non-linear access");
    fflush(stdout);
    /*
      This loop's data accesses will thrash the
      Pentium's D-cache.
    */
    incr = 32;
    halt = (IN_D_CACHE*incr)/16;
    count = 0;
    start = time(NULL);
    for (j=0; j<REPS; j++)
        for (i=0; i<halt; )
            {
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            c = stuff[i]; i+=incr; c = stuff[i]; i+=incr;
            count += 16;
            }
    stop = time(NULL);
    delta = stop-start;
    printf("Non-linear done. %d seconds %d touches\n",
           delta, count, c);
    fflush(stdout);
    return 0;
    }

The basic idea behind DCACHE.C is to have two identical loops that touch bytes in memory. The first loop skips through memory a byte at a time. In theory this should work well with the D-cache because we'll be getting lots of hits on subsequent accesses due to the 32 byte cache line loads. The second loop is more hostile and skips through memory in 32 byte increments. In theory this will cause the vast majority of a cache line data, when a byte is touched, to be wasted.

All of the iteration counts in DCACHE.C are arbitrary by the way. I picked them to be big enough so the program would run long enough to generate some timings that would make for useful comparisons on my 60mhz Pentium. The test machine is running OS/2 Warp with 8M. There's no disk swapping going on with this configuration when DCACHE runs, so we can rule out disk I/O as an influence on the results. I compiled DCACHE.C using the Borland 1.5 OS/2 compiler taking all the defaults for options (i.e. "bcc dcache.c" as a command line). From prior experiments with this particular compiler we know it's going to dword align variables.

**DCACHE results on 60mhz Pentium(3 trials)**
Starting linear access Linear done. 22 seconds 293601280 touches Starting non-linear access Non-linear done. 48 seconds 293601280 touches
Starting linear access Linear done. 22 seconds 293601280 touches Starting non-linear access Non-linear done. 49 seconds 293601280 touches
Starting linear access Linear done. 22 seconds 293601280 touches Starting non-linear access Non-linear done. 48 seconds 293601280 touches

DCACHE results on 60mhz Pentium(3 trials)

Starting linear access
Linear done. 22 seconds 293601280 touches
Starting non-linear access
Non-linear done. 48 seconds 293601280 touches

Starting linear access
Linear done. 22 seconds 293601280 touches
Starting non-linear access
Non-linear done. 49 seconds 293601280 touches

Starting linear access
Linear done. 22 seconds 293601280 touches
Starting non-linear access
Non-linear done. 48 seconds 293601280 touches

These results look pretty consistent - let's call it about a 2X speed difference due to the effects of the Pentium's D-cache. Unrolling the critical inner loops in the DCACHE code 16 times helps to ensure we're not just measuring the looping activity here.

The real D-cache effect is actually going to be a lot more than 2X because there's a constant cost for looping and pointer incrementing factored into the timings on those loops.

If we did a calibration run on the loops and took out the memory touch we could get a better idea of what the true gain is. It's 4:00am though and I'm kind of tired, so I'll leave that up to you, gentle reader, as one of those proverbial "exercises".

I'm content to know that the Pentium's D-cache is a serious item and that "data locality" is a very important thing for the Pentium. It means that we'll need to pay attention to getting related data items onto the same pages in the virtual memory space (lest our programs behave like MSPIGGY did and turn people's machines into disk exercisers). We will also need to pay attention to the way that data is laid out within a particular page if we'd like to really scream on a Pentium.

What about 486's and their cache?

486 has an 8K internal cache
486 cache is combined I-cache and D-cache
486 cache lines are 16 bytes

486's should respond well to attempts to create locality in programs just like the Pentium did. Given its smaller cache, we can expect the 486 to be less tolerant of goofs on our part though. It's going to be easier to force things out of the 486's cache since it's only 8K and is caching both code and data.

How much less tolerant can a 486 be? I ran DCACHE on a 486SX-25 machine I've got that doesn't have any external cache (i.e. its got only the 8K internal cache). The results from two trials on the 486SX were:

DCACHE results on 486SX-25
Starting linear access Linear done. 86 seconds 293601280 touches Starting non-linear access Non-linear done. 327 seconds 293601280 touches
Starting linear access Linear done. 87 seconds 293601280 touches Starting non-linear access Non-linear done. 340 seconds 293601280 touches

DCACHE results on 486SX-25

Starting linear access
Linear done. 86 seconds 293601280 touches
Starting non-linear access
Non-linear done. 327 seconds 293601280 touches

Starting linear access
Linear done. 87 seconds 293601280 touches
Starting non-linear access
Non-linear done. 340 seconds 293601280 touches

Here, the difference is somewhere between 3X and 4X. Pretty dramatic eh?

By the way, it turns out that this version of DCACHE was treating the 486's cache pretty good for the linear access part of the run. The total code size for the whole DCACHE.C module was just over 1K (041A bytes according to the Borland TDUMP utility):

000183 SEGDEF 1 : _TEXT DWORD PUBLIC USE32 Class 'CODE' Length 041a

The size of the data being whacked on in the Pentium's D-cache, was 7K because of this #define:

#define IN_D_CACHE ((128+64+32)*32)

This means the unrolled loops, and the data were parked nice and snug in the 486's 8K cache.

How about some experimental controls?

Yes, I did a couple of control experiments too. I've got an uncached 386SX-25 and a 386DX-20 that I ran DCACHE on. The results on these were predictable. Having no internal, or external cache, those machines were pretty close for both parts of a DCACHE run.

DCACHE results on 386DX-20
[C:\]dcache Starting linear access Linear done. 474 seconds 293601280 touches Starting non-linear access Non-linear done. 469 seconds 293601280 touches

Nothing dramatic here, those numbers are fairly close. I expect the slight variance is due to some minor instruction dword alignment effects in the loops. 386's can be sensitive to that kind of code positioning.

DCACHE results on 386SX-25
[D:\]dcache Starting linear access Linear done. 735 seconds 293601280 touches Starting non-linear access Non-linear done. 695 seconds 293601280 touches

Again, nothing too dramatic in the way of differences.

Location, location, location!

When running code and data are sitting in the on-chip cache, you're gonna win big. When they're not, you lose big frown

. Clearly it would be very tedious to design a program where everything was sitting in the CPU's internal cache all the time. After all, 8K or 16K isn't exactly a huge amount of memory.

The next best thing though is if we can get hits in an external cache.

External caches typically run from 64K up to a meg. Even a small 64K external cache is a pretty serious weapon if the code and data aren't too sloppy. There are 16 4K pages that can fit in even a small 64K external cache. In reality, 64K is quite a bit of code and a fair amount of data. Even with the worst C/C++ compiler, you'll have to be writing code for a little while to generate 64K worth of executable code. With assembler code, you'll really be working for a while to generate 64K worth. Think back to the days of IBM's original PC. The whole BIOS for the system, plus the embedded Microsoft BASIC interpreter, all fit into well under 64K worth of ROM.

Remember when I said we should strive to be good multitasking citizens?

One of the fallout's of this cache discussion is that being a good citizen is going to almost be automatic if we watch out for the CPU's cache.
Getting code and data locality where things are parked in the CPU cache also implies that we're not out whacking pages willy-nilly in a random way all over the place causing the system's swapper to go berserk (remember MSPIGGY?).

tonyi@ibm.net - Shut up and jump!

Last modified on Thursday, Jan 14, 1999
This page produced the old fashoned way - with a text editor