(M)  s i s t e m a   o p e r a c i o n a l   m a g n u x   l i n u x ~/ · documentação · suporte · sobre

 

4. Performance Tuning

Here are the places where you can trade off spending against the performance level you want to buy and your expected job mix.

4.1. How To Pick Your Processor

Right now, the chips to consider for running Unix are the Pentium IIs and Pentium IIIs and their clone equivalents from AMD or Cyrix. Life used to be more complicated, but with Pentium prices plunging as they have been and the PCI bus having taken over the world, nothing else makes much sense for a new Intel-based system.

Brands don't matter much, so don't feel you need to pay Intel's premiums if you see an attractive Cyrix, AMD or other chip-clone system offered.

To compare the performance of different Intel-based systems with each other and with machines from other manufacturers, you can take a look at the SPECmark Table at ftp://ftp.cdf.toronto.edu/pub/spectable. That document recommends (and I do too) that you read the SPEC FAQ at http://www.specbench.org/spec/specfaq.html to get background before browsing the table.

Good current advice about chipsets can be found at The Cheap /Linux/ Box.

4.2. Of Memory In...

Buy lots of RAM, it's the cheapest way to improve real performance on any virtual-memory system. 64MB now comes standard on most clone configurations. This is good enough for X.

Tuning is simple. Watch your job mix with top(1) and add memory until you're not swapping to disk any more.

4.3. Cache Flow

4.3.1. Overview

The most obscure of the important factors in the performance of a clone system is the motherboard's memory subsystem design, consisting both of primary and secondary cache and DRAM. The two questions performance-minded buyers have to deal with are: (1) does the cache design of a given motherboard work with Unix, and (2) how much cache SRAM should my system have, and how should it be organized?

Before normal clock speeds hit two digits in MHz, cache design wasn't a big issue. But DRAM's memory-cycle times just aren't fast enough to keep up with today's processors. Thus, your machine's memory controller caches memory references in faster static RAM (SRAM), reading from main memory in chunks that the board designer hopes will be large enough to keep the CPU continuously fed under a typical job load. If the cache system fails to work, the processor will be slowed down to less than the memory's real access speed — which, given typical 70ns DRAM parts, is about 7MHz.

You'll sometimes hear the terms L1, L2, and L3 cache. These refer to Level 1, Level 2, and Level 3 cache. L1, today, is always on the CPU (well, unless you're HP). L2 is off-chip cache. L3 is a second-level off-chip cache. Anything that will fit in L1 can be run at full CPU speed, as there is no need to go off chip. Anything (or things) too large to fit in L1 will try to run in L2. If it fits in L2, you still won't have to deal with the bus. If you're familiar with virtual memory, think of it this way: When you run out of L1, you swap to L2, when you run out of L2, you swap to main memory, when you run out of main memory, you swap to disk. Each stage is slower, and more prone to conflicts with other parts of the system, than what it follows.

Cache is like memory, but is faster, more expensive, and smaller. L1 is generally faster and smaller than L2, which is generally faster and smaller than L3, which is faster and smaller than memory. Some PCs will not have L2 or L3. Most workstation-class machines have L1 and L2. L3 is rare, even on big expensive Unix servers, but will become more common when CPUs start coming with L1 and L2 on-chip.

Because most L1 caches are on the CPU chip, there's isn't very much room for them so they tend to be small. It looks like the Pentium has 2 L1 caches, one for instructions (I-cache) and one for data (D-cache); each is 8 KB. If this is the only cache size available for Pentium, all laptops you look at will have this.

The size of the L2 cache you get will depend on what brand and model of laptop you buy, since Compaq and Fujitsu and NEC can decide independently how much L2 cache to put on their motherboards (within a range defined by the CPU chip). It's usually decided by the marketing people, not the technical people, based on what chips are available at what prices and what price they intend to sell the computer for. It looks like most benchmark results you'll see are with 256 or 512 KB of L2 cache; AT&T makes one Pentium-based server with 4 MB of L2 cache.

There are other cache-related buzzwords you may encounter.

"Write-back" means that when you update something in "memory" the cache doesn't actually push the new value out to the memory chips (or to L2, if it's an L1 write-back) until the "line" gets replaced in the cache. ("line" is the chunk-size caches act on, usually a small number of bytes like 8, 16, 32, or 64 for L1, or 32, 64, 128, or 256 for L2)

"Write-through" means that when you update "memory" the cache updates its value as well as sending an immediate update to physical memory (or to L2 if it's an L1 write-through). Write-back is generally faster if your application fits in the cache.

"Non-blocking, out of order" means that the CPU looks at the next N instructions it's about to execute. It executes the first and finds that the data isn't in cache. Since it's boring to just wait around for the data to come back from memory, it looks at the next instruction. If that 2nd instruction doesn't need the data the 1st instruction is waiting on, the CPU goes ahead and executes that instruction. If the 3rd instruction does need the data, it remembers it needs to execute that one after the data comes in and goes on to the 4th instruction. Depending on how many outstanding requests are allowed, if the 4th one causes a cache miss on a different line it may put that one on hold as well and go on to the 5th instruction. The Pentium Pro can do this, but I don't think the Pentium can.

"Set-associative" means the cache is split into 2 or more mini-caches. Because of the way things are accessed in a cache, this can help a program that has some "badly behaving" code mixed with some "good" code. Other terms that go with it are "LRU" (the mini-cache picked for replacement is the one Least Recently Used) or "random" (the line picked is selected randomly).

They can make a big difference in how happy you are with your system's performance. There are enough variables that you probably aren't going to be able to predict how happy you'll be with a configuration unless you sit down in front of the machine and run whatever it is you plan to run on it. Make up your own benchmark floppy with your primary application to take with you to showrooms. (Throw it away after all your test drives, since it will probably have collected a virus or three.)

Bigger or faster isn't always better. Speed is usually a tradeoff with size, and you have to match L2 cache size/speed to CPU speed. A system with a faster MHz CPU could perform worse than a system with a slower chip because the CPU<-->L2 speed match might be such that the faster CPU requires a different, slower mode on the L2 connection.

If all you want to do is run MyLittleSpreadsheet, and the code and data all fit in 400 KB, a system with 512 KB of L2 cache will likely run more than twice as fast as a system with 256 KB of L2. If MLS fits in 600 KB and has a very sequential access pattern (a "cache-buster"), the 128 KB and 256 KB systems will perform about the same -- like a dog; if the pattern is random rather than sequential, the 512KB system will probably do some fractional amount better than the 128 KB system. This is why it's so important to try out your application and ignore impressive numbers for programs you're never going to run.

Also, you may find the Doom benchmark page useful :-).

4.3.2. How Caching Works

Caches get their leverage from exploiting two kinds of locality in memory references. Temporal locality means "access now, it should be accessed again soon", while spatial means "if byte N was asked for, byte N+1, N+2, N+3 will probably be wanted too". Because Unix multitasks, every context switch violates both types of locality. Spatial is impacted because contexts may not be located closely together. Temporal is impacted because you have to wait until all other ready contexts get their chance to run before you can run again.

One side-effect of what's today considered "good programming practice", with high-level languages using a lot of subroutine calls, is that the program counter of a typical process hops around like crazy. You might think that this, together with the periodic clock interrupts for multitasking, would make spatial locality very poor.

However, the clock interrupt only fires about 60 times per second. This is a very low overhead, if you consider how many instructions can be exectuted at 60 MHz in 1/60th of a second (for a poor estimate, something like 30 MIPS * 1/60 = half a million instructions--at 16 bits each, roughly a megabyte of memory has been walked through!). This is lots of opportunity to take advantage of temporal locality -- and most programs are not so large that their time-critical parts won't fit inside a megabyte.

(Thanks to Michael T. Pins and Joan Eslinger for much of this section.)

4.3.3. A Short Primer on Cache Design

Before we go further in discussing specifics of the Intel processors we'll need some basic cache-design theory. (Some of this repeats and extends the Overview.)

Modern system designs have two levels of caching; a primary or internal cache right on the chip, and a secondary or external cache in high-speed memory (typically static RAM) off-chip. The internal cache feeds the processor directly; the external cache feeds the internal cache.

A cache is said to hit when the processor, or a higher-level cache, calls for a particular memory location and gets it. Otherwise, it misses and has to go to main memory (or at least the next lower level of cache) to fetch the contents of the location. A cache's hit rate is the percentage of time, considered as a moving average, that it hits.

The external cache is added to reduce the cost of an internal cache miss. To speed the whole process up, it must serve the internal cache faster than main memory would be able to do (to hide the slowness of main memory). Thus, we desire a very high hit rate in the secondary cache as well as very high bandwidth to the processor.

Obviously, secondary cache hit rate can be improved by making it bigger. It can also be increased by increasing the associativity factor (more on this later, but for now note that too much associativity can cost a big penalty).

A cache is divided up into lines. Typically, in an i486 system, each line is 4 to 16 bytes long (the i486 internal cache uses 16-byte lines; external line size varies). When the processor reads from an external-cache address that is not in the internal cache, that address and the surrounding 16 bytes are read into a line.

Each cache line has a tag associated with it. The tag stores the address in memory that the data in that cache line came from. (Plus a bit to indicate that this line contains valid data).

Some more important terms describing how caches interact with memory:

write-through

it wouldn't do to let your cache get out of sync with main memory. The simplest way to handle this is to arrange that every write to cache generates the corresponding write to main store. In the simplest kind of "write-through" cache, then, you only get cache speedup on reads. But if the cache system includes write postings, writes will usually be fast too.

write posting

Most write-through cache designs have a few `write posting' buffers that allow the cache to accept write data as fast as a write back cache for later writing to memory. So, as far as the processor is concerned, most writes will happen at zero wait states (the full cache speed), unless the processor issues enough writes in a short interval to cause the write posting buffers to fill up faster than they can be emptied.

write-back

For each cache address range in DRAM, writes are done to cache only until a new block has to be fetched due to out-of-range access (at which point the old one is flushed to DRAM). This is much faster, because you get cache speedup on writes as well as reads. It's also more expensive and trickier to get right.

Write-back secondary caches are generally not a good idea. Beyond what was said in the write-through paragraph above, recall that the goal of the secondary cache is to have a high hit rate and high bandwidth to the processor's internal cache. When a cache-miss occurs in the secondary cache, often the line being replaced is dirty and must be written to main memory first. The total time to service the secondary cache miss nearly doubles.

Even when the secondary cache line being replaced is not dirty, the service time goes up because the dirty bit must first be examined before accessing to main memory. Write-through caches have the advantage of being able to look up data in the secondary cache and in main memory in parallel (in the case where the secondary cache misses, some of the delay of looking in main memory has already been taken away). (Write-back caches cannot do this because they might have to write-back the cache line before doing the main memory read.)

For these reasons, write-back caches are generally regarded as being inferior to write-posting buffers. They cost too much silicon and more often than not perform worse.

Now some terms that describe cache organization. To understand these, you need to think of main RAM as being divided into consecutive, non-overlapping segments we'll call "regions". A typical region size is 64K. Each region is mapped to a cache line, 4 to 128 bytes in size (a typical size is 16). When the processor reads from an address in a given region, and that address is not already in core, the location and others near it are read into a line.

direct-mapped

Describes a cache system in which each region has exactly one corresponding line (also called "one-way cache").

two-way set-associative

Each region has *two* possible slots; thus your odds of not having to fetch from DRAM (and, hence, your effective speed) go up.

There are also "four-way" caches. In general, an n-way cache has n pages per region and improves your effective speed by some factor proportional to n. However, multiset caches become very costly in terms of silicon real estate, so one does not commonly see five-way or higher caches.

Because set-associative caches make better use of SRAM, they typically require less SRAM than a direct-mapped cache for equivalent performance. They're also less vulnerable to Unix's heavy memory usage. Andy Glew of USENET's comp.arch group says "the usual rule of thumb is that a 4-way set-associative cache is equivalent to a direct-mapped cache of twice the size". On the other hand, some claim that as cache size gets larger, two-way associativity becomes less useful. According to this school of thought it actually becomes a net loss over a direct-mapped cache at cahe sizes over 256K.

So, typically, you see multi-set cache designs on internal caches, but direct-mapped designs for external caches.

The larger you make a cache line, the cheaper the design will be, as you save on expensive tag ram, but the worse the performance will be, as you pay for each cache miss manyfold. It is not reasonable to have 32 bytes reload on a cache miss.

In the presence of interleaved DRAM memory, a cache line should not be larger than a whole DRAM line -- double interleaved: 2*4 bytes, quadruple 4*4 bytes. Otherwise, memory fetches to the external cache get slow.

An external cache which can support the i486 burst mode can increase bandwidth to a much higher level than one which doesn't, and can significantly reduce the cost of an internal cache miss.

4.4. Suggestions for Buying

The best advice your humble editor can give is a collection of rules of thumb. Your mileage may vary...

4.4.1. Rule 1: Buy only motherboards that have been tested with Unix

One of DOS's many sins is that it licenses poor hardware design; it's too brain-dead to stretch the cache system much. Thus, bad cache designs that will run DOS can completely hose Unix, slowing the machine to a crawl or even (in extreme cases) causing frequent random panics. Make sure your motherboard or system has been tested with some Unix variant.

4.4.2. Rule 2: Be sure you get enough cache.

If your motherboard offers multiple cache sizes, make sure you how much is required to service the DRAM you plan to install.

Bela Lubkin writes: "Excess RAM [over what your cache can support] is a very bad idea: most designs prevent memory outside the external cache's cachable range from being cached by the 486 internal cache either. Code running from this memory runs up to 11 times slower than code running out of fully cached memory."

4.4.3. Rule 3: "Enough cache" is at least 64K per 16MB of DRAM

Hardware caches are usually designed to achieve effective 0 wait state status, rather than perform any significant buffering of data. As a general rule, 64Kb cache handles up to 16Mb memory; more is redundant.

A more sophisticated way of determining cache size is to estimate the number of processes you expect to be running simultaneously (ie 1 + expected load average, let this value be N). Your external cache should be about N * 32k in size. The justification for this is as follows: upon a context switch, it is a good idea to be able to hold the entire i486 internal cache in the secondary cache. For each process you would need something less than 8k * 4 (since it is 4-set associative, you need 32k to help map the conflicting cache lines, and the extra cache left over (24k) should be plenty to help improve the hit rate of the secondary cache when the internal cache misses. The number of main memory accesses caused by context switching should be reduced.

Of course, if you are going to be running programs with large memory requirements (especially data), then a huge secondary cache would probably be a big win. But most programs in the run queue will be small though (ls, cat, more, etc.).

4.4.4. Rule 4: If possible, max out the board's cache when you first buy

Bela continues: "Get the largest cache size your motherboard supports, even if you're not fully populating it with RAM. The motherboard manufacturer buys cache chips in quantity, knows how to install them correctly, and you won't end up throwing out the small chips later when you upgrade your main RAM."

Gerard Lemieux qualifies this by observing that if adding SRAM increases the external cache line size, rather than increasing the number of cache lines, it's a lose. If this is the case, then an external cache miss could cost you dearly; imagine how long the processor would have to wait if the line size grew to 1024 bytes. If the cache has a poor hit rate (likely true, since the number of lines has not changed), performance would deteriorate.

Also (he observes), spending an additional $250 for cache chips might buy you 2-3% in performance (even in Unix). You must ask yourself if it is really worth it.

4.4.5. Caveat

A lot of fast chips are held back by poor cache systems and slow memory. To avoid trouble, cloners often insert wait states at the cache, slowing down the chip to the effective speed of a much slower chip.

(Thanks to Guy Gerard Lemieux <lemieux@eecg.toronto.edu> for insightful comments on an earlier version of this section.)

4.5. Bus Wars

This is yet another area in which progress has simplified your choices a lot. There used to be no fewer than four competing bus standards out there (ISA, EISA, VESA/VLB, PCI, and PCMCIA). Now there are effectively just two -- PCI for desktop/tower machines and PCMCIA for laptops.

4.5.1. Bus Types

PCI is Intel's fast 64-bit bus for the Pentium. Many PCI boards are actually PCI/ISA that supports both standards, so you can use less expensive ISA peripherals and controllers. Beware, though; dual-bus boards lose about 10% of their performance relative to single-bus PCI boards.

In the laptop market everything is PCMCIA. PCMCIA peripherals are about the size of credit cards (85x54mm) and vary in thickness between 5 and 10mm. They have the interesting feature that they can be hot-swapped (unplugged out and plugged in) while the computer is on. However, they are seldom seen in desktop machines. They require a special daemon to handle swapping; free versions are now standard under Linux.

4.5.2. Plug And Play

Many PCI cards have a feature called ``Plug and Play''. These cards negotiate with the operating system at boot time for things like IRQs and DMA channels -- they have no jumpers. Beware of these! Linux doesn't yet have full support for Plug and Play, though there are support utilities available. (Of the Microsoft OSs, only Windows 95 and up supports Plug and Play fully -- DOS can't handle it at all and Windows 3.1 requires manual intervention).

4.5.3. Historical Note

There used to be two ISA buses, the original 8-bit IBM PC and a 16-bit compatible extension sometimes called "AT bus". The term ISA didn't come into use until well into the lifetime of the latter. Here's a more complete list:

ISA

The original IBM PC bus architecture. The 8-bit version is completely extinct. The 16-bit AT version is still alive but has been declared obsolete by Intel in the PC99. specification.

MCA

Micro-Channel Architecture. A ``standard'' that IBM attempted to promulgate, esp. in the PS/2 series of machines. While they tried to claim it was faster/more efficient, it really was only marginally better than ISA. Its real advantage--to IBM--was that it was a closed architecture; they didn't publish the details of implementation as they had on virtually everything for the XT/AT. It failed horribly; customers didn't want to walk back into that trap.

EISA

Extended ISA. Required motherboard setup and manufacturer-provided descriptions that got loaded into flash ROM. Manually. By the user. Superseded by VLB years ago and now extinct.

VLB

VESA Local Bus. Usually seen on video cards providing high-speed graphics data transfer, it was also being touted for other cards. VLB slots accepted ISA and EISA cards and a further extension called the VESA local-bus specification. Supplanted by PCI.

PCI

Peripheral Component Interconnect. The winner of the bus wars.

4.6. Disk Wars: IDE vs. SCSI

4.6.1. Overview

Another basic decision is IDE vs. SCSI. Either kind of disk costs about the same, but the premium for a SCSI card varies all over the lot, partly because of price differences between VLB and PCI SCSI cards and especially because many motherboard vendors bundle an IDE chipset right on the system board. SCSI gives you better speed and throughput and loads the processor less, a win for larger disks and an especially significant consideration in a multi-user environment; also it's more expandable.

In terms of pure disk speed, IDE will always be faster, as they use the same underlying disks, and IDE has less overhead. As fast as disks are getting today, the difference is effectively noise. The real advantage of SCSI comes from its extra brains. IDE uses polled I/O, which means that when you are accessing the disk, the CPU isn't doing anything else. Most SCSI systems, on the other hand, are DMA based, freeing up the system to do other things at the same time. Hence, in terms of full system performance, SCSI is indeed faster if you have good hardware and an intelligent OS.

Another important win for SCSI is that it handles multiple devices much more efficiently. You can have at most two IDE devices; four for EIDE. SCSI permits up to 7 (15 for Wide SCSI).

If you have two IDE (or ST506 or ESDI) drives, only one can transfer between memory and disk at once. In fact, you have to program them at such a low level that one drive might actually be blocked from seeking while you're talking to the other drive. SCSI drives are mostly autonomous and can do everything at once; and current SCSI drives are not quite fast enough to flood more than half the SCSI bus bandwidth, so you can have at least two drives on a single bus pumping full speed without using it up. In reality, you don't keep drives running full speed all the time, so you should be able to have 3-4 drives on a bus before you really start feeling bandwidth crunch.

Of course, IDE is cheaper. Many motherboards have IDE right on board now; if not, you'll pay maybe $15 for an IDE adapter board, as opposed to $200+ for the leading SCSI controller. Also, the cheap SCSI cabling most vendors ship can be flaky. You have to use expensive high-class cables for consistently good results. See Mark Sutton's horror story.

4.6.2. Enhanced IDE

These days you seldom see plain IDE; souped-up variants are more usual. These are "Enhanced IDE" (E-IDE) and "Fast AT Attachment" (usually ATA for short). ATA is Seagate's subset of E-IDE, excluding some features designed to permit chaining with CD-ROMs and tape drives using the new "ATAPI" interface (an E-IDE extension; so far only the CD-ROMs exist); in practice, ATA and E-IDE are identical.

You'll need to be careful about chaining in CD-ROMs and tape drives when using IDE/ATA. The IDE bus sends all commands to all disks; they're supposed to latch, and each drive then checks to see whether it is the intended target. The problem is that badly-written drivers for CD-ROMs and tapes can collide with the disk command set. It takes expertise to match these peripherals.

Neither ATA nor E-IDE has the sustained throughput capacity of SCSI (they're not designed to) but they are 60-90% faster than plain old IDE. E-IDE's new ``mode 3'' boosts the IDE transfer rate from IDE's 3.3MB/sec to 13.3MB/sec. The new interface supports up to 4 drives of up to 8.4 gigabytes capacity.

E-IDE and ATA are advertised as being completely compatible with old IDE. Theoretically, you can mix IDE, E-IDE and ATA drives and controllers any way you like, and the worst result you'll get is conventional IDE performance if the enhancements don't match up (the controller picks the lowest latch speed). In practice, some IDE controllers (notably the BusLogic) choke on enhanced IDE.

Accordingly, I recommend against trying to mix device types an an E-IDE/ATA bus. Unfortunately, this removes much of E-IDE/ATA's usefulness!

E-IDE on drives above 540MB does automatic block mapping to fool the BIOS about the drive geometry (avoiding limits in the BIOS type tables). They don't require special Unix drivers.

Many motherboards now support ``dual EIDE'' channels, i.e. two separate [E]IDE interfaces each of which can, theoretically, support two IDE disks or ATA-style devices.

4.6.3. SCSI Terminology

The following, by Ashok Singhal <ashoks@duckjibe.eng.sun.com> of Sun Microsystems with additions by your humble editor, is a valiant attempt to demystify SCSI terminology.

The terms ``SCSI'', ``SCSI-2'', and ``SCSI-3'' refer to three different specifications. Each specification has a number of options. Many of these options are independent of each other. I like to think of the main options (there are others that I'll skip over because I don't know enough about them to talk about them on the net) by classifying them into five categories:

4.6.3.1. Logical: SCSI-1, SCSI-2, SCSI-3>

This refers to the commands that the controllers understand. Shortly after SCSI first came out, the vendors agreed on a spec for a common comand set called CCS. CCS was made a required part of the SCSI-2 standard. You should be able to use a SCSI disk with a SCSI-2 card and vice-versa as long as they both support CCS. Non-CCS SCSI devices aren't worth considering.

``SCSI-3'' is a superset of SCSI-2 including commands intended for CD-R and streaming multimedia devices.

4.6.3.2. Electrical Interface

  • single-ended (max cable length 6 meters)

  • differential (max cable length 25 meters)

This option is independent of command set, speed, and path width. Differential is less common but allows better noise immunity and longer cables. It's rare in SCSI-1 controllers.

For a PC you will probably always see single-ended SCSI controllers but if you're shopping around for disks you might run across differential disks. They will likely be more expensive than single-ended ones and will not work on your single-ended bus.

4.6.3.3. Handshake

  • Asynchronous (acknowledge each word (8, 16 or 32 bits) transferred.

  • Synchronous (multiple-word transfers permitted between ACKS).

Synchronous is faster. This mode is negotiated between controller and device; modes may be mixed on the same bus. This is independent of command set, data width, and electrical interface.

4.6.3.4. Synchronous Speed (does not apply for asynchronous option)

Normal transfer speed is 5 megabytes/sec. The ``fast'' option (10 mb/sec) is defined only in SCSI-2 and SCSI-3. Fast-20 (or ``Ultra'') is 20 mb/sec; Fast-40 (or "Ultra-2") is 40MB/sec. The fast options basically defines shorter timing parameters such as the assertion period and hold time.

The parameters of the synchronous transfer are negotiated between each target and initiator so different speed transfers can occur over the same bus.

4.6.3.5. Path width

The standard SCSI data path is 8 bits wide. The ``wide'' option exploits a 16- or 32-bit data path (uses 68-pin rather than 50-pin data cables). You also get 4-bit rather than 3-bit device IDs, so you can have up to 16 devices. The wide option doubles or quadruples your transfer rate, so for example a fast-20/wide SCSI link using 16 bits transfers 40mb/sec.

What are those ``LUN'' numbers you see when you boot up? Think of them as sub-addresses on the SCSI bus. Most SCSI devices have only one ``logical'' device inside them, thus they're LUN zero. Some SCSI devices can, however, present more than one separate logical unit to the bus master, with different LUNs (0 through 7). The only context in which you'll normally use LUNs is with CD-ROM juke boxes. Some have been marketed that offer up to 7 CD-ROMS with one read head. These use the LUN to differentiate which disk to select.

(There's history behind this. Back in the days of EISA, drives were supposed to be under the control of a separate SCSI controller, which could handle up to 7 such devices (15 for wide SCSI). These drives were to be the Logical Units; hence the LUN, or Logical Unit Number. Then, up to 7 of these SCSI controllers would be run by the controller that we today consider the SCSI controller. In practice, hardware cost dropped so rapidly, and capability increased so rapidly, it became more logical to embed the controller on the drive.)

4.6.4. Avoiding Pitfalls

Here are a couple of rules and heuristics to follow:

Rule 1: Total SCSI cable length (both external and internal devices) must not exceed six meters. For modern Ultra SCSI (with its higher speed) cut that to three feet!

It's probably not a good idea to cable 20MB/s or faster SCSI devices externally at all. If you must, one of our informants advises using a Granite Digital ``perfect impedance'' teflon cable (or equivalent); these cables basically provide a near-perfect electrical environment for a decent price, and can be ordered in custom configurations if needed.

A common error is to forget the length of the ribbon cable used for internal devices when adding external ones (that is, devices chained to the SCSI board's external connector).

Rule 2: Both ends of the bus have to be electrically terminated.

On older devices this is done with removable resistor packs — typically 8-pin-inline widgets, yellow or blue, that are plugged into a plastic connector somewhere near the edge of the PCB board on your device. Peripherals commonly come with resistor packs plugged in; you must remove the packs on all devices except the two end ones in the physical chain.

Newer devices advertised as having "internal termination" have a jumper or switch on the PCB board that enables termination. These devices are preferable, because the resistor packs are easy to lose or damage.

Rule 3: No more than seven devices per chain (fifteen for Wide SCSI).

There are eight SCSI IDs per controller. The controller reserves ID 7 or 15, so your devices can use IDs 0 through 6 (or 0 through 14, wide). No two devices can share an ID; if this happens by accident, neither will work.

The conventional ID assignments are: Primary hard disk = ID 0, Secondary hard disk = ID 1, Tape = ID 2. Some Unixes (notably SCO) have these wired in. You select a device's ID with jumpers on the PCB or a thumbwheel.

SCSI IDs are completely independent of physical device chain position.

Heuristic 1: Stick with controllers and devices that use the Centronics-style 50-pin connector. Internally these connectors are physically identical to diskette cables. Externally they use a D50 shell. This "standard" connector is common in the desktop/tower/rackmount-PC world, but you'll find lots of funky DIN and mini-DIN plugs on devices designed for Macintosh boxes and some laptops. Ask in advance and don't get burned.

Heuristic 2: For now, when buying a controller, go with an Adaptec xx42 or one of its clones such as the BusLogic 542. (I like the BusLogic 946 and 956, two particularly fast Adaptec clones well-supported under Linux.) The Adaptec is the card everybody supports and the de-facto standard. Occasional integration problems have been reported with Unix under Future Domain and UltraStor cards, apparently due to command-set incompatibilities. At least, before you buy these, make sure your OS explicitly supports them.

However: Beware the combination of an Adaptec 1542 with a PCI Mach32 video card. Older (1.1) Linux kernels handled it OK, but all current ones choke. Your editor had to replace his 1542 because of this, swearing sulphurously the while.

Heuristic 3: You'll have fewer hassles if all your cables are made by the same outfit. (This is due to impedence reflections from minor mismatches. You can get situations where cable A will work with B, cable B will work with C, but A and C aren't happy together. It's also non-commutative. The fact that `computer to A to B' works doesn't mean that `computer to B to A' will work.

Heuristic 4. Beware Cheap SCSI Cables!

Mark Sutton tells the following instructive horror story in a note dated 5 Apr 1997:

I recently added an additional SCSI hard drive to my home machine. I bought an OEM packaged Quantum Fireball 2 gig SCSI drive (meaning, I bought a drive in shrinkwrap, without so much as mounting hardware or a manual. Thank God for Quantum's web page or I would have had no idea how to disable termination or set the SCSI ID on this sucker. Anyway, I digress...). I stuck the drive in an external mounting kit that I found in a pile of discarded computer parts at work and my that boss said I could have. (All 5 of my internal bays were full of devices.)

Anyway, I had my drive, and my external SCSI mounting kit, I needed a cable.

I went into my friendly local CompUSA in search of a SCSI cable, and side-by-side, on two hooks, were two "identical" SCSI cables. Both were 3 feet. Both had centronics to centronics connectors, both were made by the same manufacturer. They had slightly different model numbers. One was $16.00, one was $30.00. Of course, I bought the $16 cable.

Bad, I say, BAD BAD MISTAKE. I hooked this sucker up like so:

 ———-  ———   ————-   ———
 |Internal|--|Adaptec|—|New Quantum|—|UMAX   |
 |Devices |  |1542CF | ^ |  Disk     | ^ |Scanner|
 ———-  ——--  | ————- | ——— 
                       |               |
                   New $16 cable   Cable that came
                                     with scanner.

Shortly after booting, I found that data all over my old internal hard drive was being hosed. This was happening in DOS as well as in Linux. Any disk access on either disk was hosing data on both disks, attempts to scan were resulting in corrupted scans *and* hosing files on the hard disks. By the time I finished swapping cables around, and checking terminations and settings, I had to restore both Linux and DOS from backups.

I went back to CompUSA, exchanged the $16 cable for the $30 one, hooked it up and had no more problems.

I carefully examined the cables and discovered that the $30 cable contained 24 individual twisted pairs. Each data line was twisted with a ground line. The $16 cable was 24 data wires with one overall grounded shield. Yet, both of these cables (from the same manufacturer) were being sold as SCSI cables!

You get what you pay for.

(Another correspondent guesses that the cheap cable probably said ``Macintosh'' on it. The Mac connector is missing most of its ground pins.)

4.6.5. Trends to Watch For

Disks of less that 2GB capacity simply aren't being manufactured anymore; there's no margin in them. Our spies tell us that all major disk makers retooled their lines a while back to produce 540MB unit platters, which are simply being stacked 2N per spindle to produce ranges of drives with roughly 1GB increments of capacity. The highest reasonably-priced drives are still 9GB (16 platters per drive), but you can get 23GB or even 45GB capacities (these are probably packing 2.4GB per platter).

Average drive latency is inversely proportional to the disk's rotational speed. For years, most disks spun at 3600 rpm; most high-performance disks now spin at 7,200 rpm, and high-end disks like the Seagate Cheetah line are moving to 10,000 rpm. These fast-spin disks run extremely hot; expect cooling to become a critical constraint in drive design.

Drive densities have reached the point at which standard inductive read/write heads are a bottleneck. In newer designs, expect to see magnetoresistive head assemblies with separate read and write elements.

4.6.6. More Resources

There's a USENET SCSI FAQ. Also see the home page of the T10 committee that writes SCSI standards.

There is a large searchable database of hard disk and controller information at the PC DISK Hardware Database.

4.7. Other Disk Decisions

Look at seek times and transfer rates for your disk; under Unix disk speed and throughput are so important that a 1-millisecond difference in average seek time can be noticeable.

4.7.1. Disk Brands

An industry insider (a man who buys hard drives for systems integration) has passed us some interesting tips about drive brands. He says the absolute best-quality drives are the Hewlett-Packards but you will pay a hefty premium for that quality.

The other top-tier manufacturers are Quantum and Seagate; these drives combine cutting-edge technology with very aggressive pricing.

The second tier consists of Maxtor, Conner, and Western Digital.

Maxtor often leads in capacity and speed, but at some cost in other quality measures. For example, many of the high-capacity Maxtor drives have serious RFI emission problems which can cause high error rates. SCSI has built-in ECC correction, so SCSI drives only take a performance hit from this; but it can lead to actual errors from IDE drives.

Western Digital sells most of its output to Gateway at sweetheart prices; WD drives are thus not widely available elsewhere.

The third tier consists of Fujitsu, Toshiba, and everyone else. My friend observes that the Japanese, despite their reputation for process engineering savvy, are notably poor at drive manufacturing; they've never spent the money and engineering time needed to get really good at making the media.

If you see JTS drives on offer, run away. It is reliably reported that they are horrible.

Just as a matter of interest, he also says that hard drives typically start their life cycle at an OEM price around $400 each. When the price erodes to around $180, the product gets turfed — there's no margin any more.

I've found a good cheap source for reconditioned SCSI disks at Uptime Computer Support Services.

4.7.2. To Cache Or Not To Cache?

Previous issues said "Disk cacheing is good, but there can be too much of a good thing. Excessively large caches will slow the system because the overhead for cache fills swamps the real accesses (this is especially a trap for databases and other applications that do non-sequential I/O). More than 100K of cache is probably a bad idea for a general-purpose Unix box; watch out for manufacturers who inflate cache size because memory is cheap and they think customers will be impressed by big numbers." This may no longer be true on current hardware; in particular, most controllers will interrupt a cache-fill to fulfill a `real' read request.

In any case, having a large cached hard drive (particularly in the IDEs) often does not translate to better performance. For example, Quantum makes a 210Mb IDE drive which comes with 256Kb cache. Conner and Maxtor also have 210Mb drives, but only with 64Kb caches. The transfer rate on the drives, however, show that the Quantum comes in at 890Kb/sec, while the Maxtor and Conner fly away at 1200Kb/sec. Clearly, the Conner and Maxtor make much better use of their smaller caches.

However, it may be that any hardware disk cacheing is a lose for Unix! Scott Bennett <bennett@mp.cs.niu.edu> reports a discussion on comp.unix.wizards: "nobody found the hardware disk caches to be as effective in terms of performance as the file system buffer cache...In many cases, disabling the hardware cache improved system performance substantially. The interpretation of these results was that the cacheing algorithm in the kernel was superior to, or at least better tuned to Unix accesses than, the hardware cacheing algorithms."

On the other hand, Stuart Lynne <sl@mimsey.com> writes:

Ok. What I did was to use the iozone program.

What this showed was that on my root disk in single user mode I could get about 500kb for writing and 1000kb for reading a 10MB file. With the disk cache disabled I was able to get the same for writing but only about 500kb for reading. I.e. it appears the cache is a win for reading, at least if you have nothing else happening.

Next I used a script which started up iozone in parallel on all four disks, two to each of the big disks (three) and one on the smaller disk. A total of seven iozone's competing with each other.

This showed several interesting results. First it was apparant that higher numbered drives did get priority on the SCSI bus. They consistantly got better throughput when competing against lower numbered drives. Specifically drive 1 got better results than drive 0 on controller 0. Drive 4 got better results than drive 3 on controller 1. All of the drives are high end Seagate and have similiar characteristics.

In general with cache enabled the results where better for reading than writing. When the cache was disabled the write speed in some cases went up a bit and the read speed dropped. It would seem that the readahead in some cases can compete with the writes and slow them down.

My conclusions are that we'll see better performance with the cache. First the tendency is to do more reading than writing in your average Unix system so we probably want to optimize that. Second if we assume an adequate system cache slow writes shouldn't affect an individual process much. When we write we are filling the cache and we don't usually care how long it takes to get flushed. Of course we would notice it when writing very large files.

Thus (this is your humble editor again), I can only recommend experiment. Try disabling the cache. Your throughput may go up!

4.8. Tuning Your I/O Subsystem

(This section comes to us courtesy of Perry The Cynic, <perry@sutr.cynic.org>. My own experience agrees pretty completely with his.)

Building a good I/O subsystem boils down to two major points: pick matched components so you don't over-build any piece without benefit, and construct the whole pipe such that it can feed what your OS/application combo needs.

It's important to recognize that ``balance'' is with respect to not only a particular processor/memory subsystem, but also to a particular OS and application mix. A Unix server machine running the whole TCP/IP server suite has radically different I/O requirements than a video-editing workstation. For the ``big boys'' a good consultant will sample the I/O mix (by reading existing system performance logs or taking new measurements) and figure out how big the I/O system needs to be to satisfy that app mix. This is not something your typical Linux buyer will want to do; for one, the application mix is not static and will change over time. So what you'll do instead is design an I/O subsystem that is internally matched and provides maximum potential I/O performance for the money you're willing to spend. Then you look at the price points and compare them with those for the memory subsystem. That's the most important trade-off inside the box.

So the job now is to design and buy an I/O subsystem that is well matched to provide the best bang for your buck. The two major performance numbers for disk I/O are latency and bandwidth. Latency is how long a program has to wait to get a little piece of random data it asked for. Bandwidth is how much contiguous data can be sent to/from the disk once you've done the first piece. Latency is measured in milliseconds (ms); bandwidth in megabytes per second (MB/s). Obviously, a third number of interest is how big all of your disks are together (how much storage you've got), in Gigabytes (GB).

Within a rather big envelope, minimizing latency is the cat's meow. Every millisecond you shave off effective latency will make your system feel significantly faster. Bandwidth, on the other hand, only helps you if you suck a big chunk of contiguous data off the disk, which happens rarely to most programs. You have to keep bandwidth in mind to avoid mis-matching pieces, because (obviously) the lowest usable bandwidth in a pipe constrains everything else.

I'm going to ignore IDE. IDE is no good for multi-processing systems, period. You may use an IDE CD-ROM if you don't care about its performance, but if you care about your I/O performance, go SCSI.

Let's look at the disks first. Whenever you seriously look at a disk, get its data sheet. Every reputable manufacturer has them on their website; just read off the product code and follow the bouncing lights. Beware of numbers (`<12ms fast!') you may see in ads; these folks often look for the lowest/highest numbers on the data sheet and stick them into the ad copy. Not dishonest (usually), but ignorant.

What you need to find out for a disk is:

  1. What kind of SCSI interface does it have? Look for "fast", "ultra", and "wide". Ignore disks that say "fiber" or "differential" (these are specialty physical layers not appropriate for the insides of small computers). Note that you'll often find the same disk with different interfaces.

  2. What is the "typical seek" time (ms)? Make sure you get "typical", not "track-to-track" or "maximum" or some other measure (these don't relate in obvious ways, due to things like head-settling time).

  3. What is the rotational speed? This is typically 4500, 5400, 7200, or 10000 rpm (rotations per minute). Also look for "rotational latency" (in ms). (In a pinch, average rotational latency is approx. 30000/rpm in milliseconds.)

  4. What is the `media transfer rate' or speed (in MB/s)? Many disks will have a range of numbers (say, 7.2-10.8MB/s). Don't confuse this with the "interface transfer rate" which is always a round number (10 or 20 or 40MB/s) and is the speed of the SCSI bus itself.

These numbers will let you do apple-with-apples comparisons of disks. Beware that they will differ on different-size models of the same disk; typically, bigger disks have slower seek times.

Now what does it all mean? Bandwidth first: the `media transfer rate' is how much data you can, under ideal conditions, get off the disk per second. This is a function mostly of rotation speed; the faster the disk rotates, the more data passes under the heads per time unit. This constrains the sustained bandwidth of this disk.

More interestingly, your effective latency is the sum of typical seek time and rotational latency. So for a disk with 8.5ms seek time and 4ms rotational latency, you can expect to spend about 12.5ms between the moment the disk `wants' to read your data and the moment when it actually starts reading it. This is the one number you are trying to make small. Thus, you're looking for a disk with low seek times and high rotation (RPM) rates.

For comparison purposes, the first hard drive I ever bought was a 20MB drive with 65ms seek time and about 3000RPM rotation. A floppy drive has about 100-200ms seek time. A CD-ROM drive can be anywhere between 120ms (fast) and 400ms (slow). The best IDE harddrives have about 10-12ms and 5400 rpm. The best SCSI harddrive I know (the Seagate Cheetah) runs 7.8ms/10000rpm.

Fast, big drives are expensive. Really big drives are very expensive (that's 20GB+ drives as of this writing in August 1998). Really fast drives are pretty expensive (that's about < 8ms right now). On the other end, really slow, small drives are cheap but not cost effective, because it doesn't cost any less to make the cases, ship the drives, and sell them.

In between is a `sweet spot' where moving in either direction (cheaper or more expensive) will cost you more than you get out of it. The sweet spot moves (towards better value) with time. Right now (August 1998), it's about at 4GB drives, 8-10ms, 5400-7200rpm, fast or ultra SCSI. If you can make the effort, go to your local computer superstore and write down a dozen or so drives they sell `naked'. (If they don't sell at least a dozen hard drives naked, find yourself a better store. Use the Web, Luke!) Plot cost against size, seek and rotational speed, and it will usually become pretty obvious which ones to get for your budget.

Do look for specials in stores; many superstores buy overstock from manufacturers. If this is near the `sweet spot', it's often surprisingly cheaper than comparable drives. Just make sure you understand the warranty procedures.

Note that if you need a lot of capacity, you may be better off with two (or more) drives than a single, bigger one. Not only can it be cheaper (2x4GB is often cheaper than 1x9GB), but you end up with two separate head assemblies that move independently, which can cut down on latency quite a bit (see below).

Once you've decided which kind of drive(s) you want, you must decide how to distribute them over one or more SCSI buses. Yes, you may want more than one SCSI bus. (My current desktop machine has three.) Essentially, the trick is to make sure that all the disks on one bus, talking at the same time, don't exceed the capacity of that bus. At this time, I can't recommend anything but an Ultra/Wide SCSI controller. This means that the attached SCSI bus can transfer data at up to 40MB/s for an Ultra/Wide disk, 20MB/s for an Ultra/narrow disk, and 10MB/s for a `fast SCSI' disk. These numbers allow you do do your math: an 8MB/s disk will eat an entire bus on its own if it's `fast' (10MB/s). Three 6MB/s ultra/narrow disks fit onto one bus (3x6=18MB/s<20MB/s), but just barely. Two ultra/wide Cheetahs (12.8MB/s) will share an (ultra/wide) bus (25.6<40), but they would collide on an ultra/narrow bus, and any one Cheetah would be bandwidth constrained on a (non-ultra) `fast' bus (12.8 > 10).

If you find that you need two SCSI buses, you can go with `dual channel' versions of many popular SCSI controller cards (including the Adaptec). These are simply two controllers on one card (thus taking only one PCI slot). This is cheaper and more compact than two cards; however, on some motherboards with more than 3 PCI slots, using two cards may be somewhat faster (ask me what a PCI bridge is :-).

How do you deal with slow SCSI devices - CD-ROMS, scanners, tape drives, etc.? If you stick these onto a SCSI bus with fast disks, they will slow down things a bit. You can either accept that (as in ``I hardly ever use my scanner anyway''), or stick them onto a separate SCSI bus off a cheap controller card. Or you can (try to) get an ATA version to stick onto that inevitable IDE interface on your motherboard. The same logic applies to disks you won't normally use, such as removables for data exchange.

If you find yourself at the high end of the bandwidth game, be aware that the theoretical maximum of the PCI bus itself is 132MB/s. That means that a dual ultra/wide SCSI controller (2x40MB/s) can fill more than half of the PCI bus's bandwidth, and it is not advised to add another fast controller to that mix. As it is, your device driver better be well written, or your entire system will melt down (figuratively speaking).

Incidentally, all of the numbers I used are `optimal' bandwidth numbers. The real scoop is usually somewhere between 50-70% of nominal, but things tend to cancel out - the drives don't quite transfer as fast as they might, but the SCSI bus has overhead too, as does the controller card.

Whether you have a single disk or multiple ones, on one or several SCSI buses, you should give careful thought to their partition layout. Given a set of disks and controllers, this is the most crucial performance decision you'll make.

A partition is a contiguous group of sectors on the disk. Partitioning typically starts at the outside and proceeds inwards. All partitions on one disk share a single head assembly. That means that if you try to overlap I/O on the first and last partition of a disk, the heads must move full stroke back and forth over the disk, which can radically increase seek time delays. A partition that is in the middle of a partition stack is likely to have best seek performance, since at worst the heads only have to move half-way to get there (and they're likely to be around the area anyway).

Whenever possible, split partitions that compete onto different disks. For example, /usr and the swap should be on different disks if at all possible (unless you have outrageous amounts of RAM).

Another wrinkle is that most modern disks use `zone sectoring'. The upshot is that outside partitions will have higher bandwidth than inner ones (there is more data under the heads per revolution). So if you need a work area for data streaming (say, a CD-R pre-image to record), it should go on an outside (early numbered) partition of a fast-rotating disk. Conversely, it's a good convention to put rarely-used, performance-uncritical partitions on the inside (last).

Another notes concerns SCSI mode pages. Each (modern) SCSI disk has a small part of its disk (or a dedicated EEPROM) reserved for persistent configuration information. These parameters are called `mode pages', for the mechanism (in the SCSI protocol) for accessing them. Mode page parameters determine, among others, how the disk will write-cache, what forms of error recovery it uses, how its RAM cache is organized, etc. Very few configuration utilities allow access to mode page parameters (I use FWB Toolkit on a Mac - it's simply the best tool I know for that task), and the settings are usually factory preset for, uh, Windows 95 environments with marginal hardware and single-user operation. Particularly the cache organization and disconnect/reconnect pages can make a tremendous difference in actual performance. Unfortunately there's really no easy lunch here - if you set mode page parameters wrong, you can screw up your data in ways you won't notice until months later, so this is definitely `no playing with the pushebuttons' territory.

Ah yes, caches. There are three major points where you could cache I/O buffers: the OS, the SCSI controller, and the on-disk controller. Intelligent OS caching is by far the biggest win, for many reasons. RAM caches on SCSI controller cards are pretty pointless these days; you shouldn't pay extra for them, and experiment with disabling them if you're into tinkering.

RAM caches on the drives themselves are a mixed bag. At moderate size (1-2MB), they are a potential big win for Windows 95/98, because Windows has stupid VM and I/O drivers. If you run a true multi-tasking OS like Linux, having unified RAM caches on the disk is a significant loss, since the overlapping I/O threads kick each other out of the cache, and the disk ends up performing work for nothing.

Most high-performance disks can be reconfigured (using mode page parameters, see above) to have `segmented' caches (sort of like a set-associative memory cache). With that configured properly, the RAM caches can be a moderate win, not because caching is so great on the disk (it's much better in the OS), but because it allows the disk controller more flexibility to reschedule its I/O request queue. You won't really notice it unless you routinely have >2 I/O requests pending at the SCSI level. The conventional wisdom (try it both ways) applies.

And finally I do have to make a disclaimer. Much of the stuff above is shameless simplification. In reality, high-performance SCSI disks are very complicated beasties. They run little mini-operating systems that are most comfortable if they have 10-20 I/O requests pending at the same time. Under those circumstances, the amortized global latencies are much reduced, though any single request may experience longer latencies than if it were the only one pending. The only really valid analysis are stochastic-process models, which we really don't want to get into here. :-)

4.9. Souping Up X Performance

If you care about X performance, be sure you get a graphics card with a dedicated blitter and a high-speed local-bus connection. If it says "AGP" you have this; AGP is a cross-vendor standard for a local bus optimized for graphics.

These cards speed up X in two ways. First, they offload some common screen-painting operations from the main processor onto specialized processors on the card itself. Secondly, by using a local bus, they make it possible to send commands to the card faster than the ISA bus could allow. The combined effect can be eye-poppingly fast screen updates even at very high resolutions.

There's no longer much reason to bother with any of the commercial X servers like MetroLink or X/Inside. XFree86 now supports most of the high-end cards that used to be the special preserve of the commercial X versions.

If you're feeling really flush, plump for a 15", 17" or even 20" monitor. The larger size can make a major difference in viewing comfort. Also you'll be set for 1600x1200, which many cards can support these days. In the mean time, the bigger screen will allow you to use fonts in smaller pixel sizes so that your text windows can be larger, giving you a substantial part of the benefit you'd get from higher pixel resolutions.