More tricks in C and Assembler code

Chapter 11 - More tricks in C and Assembler code

Tail Merging Intro
The speed cost of JMP's on different CPU's
A typical tail merge opportunity
Rippling up the merges
Add some code to save some code
Reordering PUSH/POP's to create opportunity
Middle merging
Sleazy tricks with the carry flag
Front merging
Putting JCXZ and variable entry points to use
Setting one segment register to another - The speed/space tradeoff
One instruction subroutines
Let the LST file guide you!
Destructive compares with INC/DEC
Splitting nibbles in the AL register
Joining nibbles
Jump chaining
The missed ASSUME
Testing values in AH with LAHF
INC/DEC transformations
String operations can sometimes do more than just "strings"
LOOP isn't always for 'loops'
XCHG as a destructive MOV
LEA isn't just for addresses
Using JCXZ when you can't hit the flags
Tracking a variable in a register
Do we always need a stack frame to access parameters?
PUSHA/POPA on 186 or better
Immediate values on 186 or better CPU's
Recognizing convenient prior results
Merging CALL/RET into JMP
Far call translations
Custom instructions via illegal opcode and Int 3's
That NPX Thang!
Final thoughts for this chapter

Tail merging

Return to chapter index Tail merging is a powerful technique that can reduce the size of code quite a bit in some cases. The idea is to find common sequences of instructions at the end of several different functions and replace the duplicated code with a JMP so it only exists in one place.

Tail merging might be a performance hit on older CPU's as jumps are expensive instructions on pre-486 CPU's. The Pentium's branch prediction hardware allows it to do a branch in one cycle if the target is in the I-cache, and a 486 only takes 3 cycles if the target is in its onboard cache. As always, let a profiler guide your actions here. If the code isn't in a hot spot, then the merging is a good thing. It may even allow some code that is speed critical to remain parked in a CPU's internal cache by bringing things a little closer together. As we saw in earlier chapters, keeping heavily used sections of code and data in the 486 and Pentium's onboard cache can provide major speed wins. If things get sufficiently tighter, we may manage to pull some code back enough that a page of working set gets saved in a virtual memory system.

Use a little common sense here too. Tail merging exits from functions, where the function executes thousands of cycles before exiting, isn't going to hurt in any visible way even if the profiler says 99% of the time in the program is spent in that function. The cost of the tail merged exit is insignificant compared to the work going on in the rest of the function.

Let the purpose of the code guide your actions. Here's a real life war story:

While working on PC DOS 6.1 I decided to take a look at the PRINTER.SYS device driver to see if it could be slimmed down some. PRINTER.SYS was an important module for many customers in countries that work with code pages other than 437. PRINTER.SYS was fairly big as DOS device drivers go too. The previous DOS 5 version was in the 11.6K range when loaded. What's worse, is that the customers most likely to use PRINTER.SYS are the same customers who are probably loading TSR's and drivers like KEYB.COM, NLSFUNC.EXE, and DISPLAY.SYS. These folks are getting hammered on memory compared to most users in the USA who won't need the national language support modules. Any memory relief thrown their way is usually appreciated.

Well, after doing a bunch of tail merging and a few other tricks, the PC DOS 6.1 version of PRINTER.SYS's resident size had dropped by over 900 bytes when compared to the DOS 5 version. Now it was down to around 10.7K resident. It's goodness when 900 bytes can be stripped out of a device driver that consumes that precious DOS memory below the 1M point.- some people would kill for an extra 900 bytes of memory.

Was the slimmer PRINTER.SYS in PC DOS 6.1 (and all subsequent versions of IBM's DOS) a little slower than the DOS 5 version? Probably, but I could never tell the difference and there's been no complaints from customers so far. Printers are so slow compared to CPU's that any difference just isn't going to be noticeable. Would you rather have that 900 bytes of low DOS memory back, or have a print job complete a couple of microseconds faster? I'd have to say that for 99.999% of the users of PRINTER.SYS, the answer to that question is a no brainer.

Chapter 11 - More tricks in C and Assembler code

Tail merging

The speed cost of JMP's on different CPU's

A "typical" tail merge opportunity

Rippling up the merges

Add some code to save some code

Reordering PUSH/POP's to create opportunity

Middle merging

Tricks with the carry flag

Front merging

Putting JCXZ and variable entry points to use

Setting one segment register to another - The speed/space tradeoff

One instruction subroutines

Let the LST file guide you!

Destructive compares with INC/DEC

Splitting nibbles in the AL register

Joining nibbles

Jump chaining

The missed ASSUME

Testing values in AH with LAHF

INC/DEC transformations

String operations can sometimes do more than just "strings"

LOOP isn't always for 'loops'

XCHG as a destructive MOV

LEA isn't just for addresses

Using JCXZ when you can't hit the flags

Tracking a variable in a register

Do we always need a stack frame to access parameters?

PUSHA/POPA on 186 or better

Immediate values on 186 or better CPU's

Recognizing convenient prior results

Merging CALL/RET into JMP

Far call translations

Custom instructions via illegal opcode and Int 3's

That NPX Thang!

Final thoughts for this chapter