Chapter 1 Notes

City College of San Francisco - CS270
Computer Architecture
Module: Background

Chapter 1 Notes

These notes are on Chapter 1 of Patterson and Hennessy 5th Edition. They highlight certain areas and indicate the importance of others in our classwork. Some material is added for background

Section 1.1

Computer performance is not simply a function of the speed of the processor or of the number of cores as will be apparent later in the chapter. As pointed out in this section, performance improvements can come from many sources.

Section 1.2

Be familiar with the Eight Great Ideas in Computer Architecture in this section:

Moore's Law governed the improvement in processor speed for several decades by leveraging miniturization. Today, current technologies have nearly reached their limit of miniturization - a wire can only be so thin, so new techniques have accompanied minturization to keep Moore's law correct - mainly in the area of parallelization. But new breakthroughs will be needed to maintain improvements in processor speed. You can see the recent failure of Moore's law in Fig 1.17
Any student of object-oriented programming is familiar with the concept of abstraction - in fact, any well-written program in even a purely procedural language uses abstraction to simplify their work.
The common case is a very important concept when designing such capabilities as an operating system. It is easy to appreciate the impact speeding up operating system code has on our experience using a computer. It may not be obvious that a very small fraction of the code in any operating system (or most programs) is responsible for much of the execution time. Concentrating efforts to improve program performance on these areas will provide a much greater return than trying to improve the code overall. We will see a simple example of this later as we analyze the code produced for different programming constructs.
When discussing performance broadly, there are several mechanisms used to improve it:

when possible, code can be executed in parallel by more than one processor. Although this is difficult to apply to a single program, it easily allows us to execute more than one program at a time. A simple example here is asking a friend to run one of your errands while you run another. Neither errand gets done faster, but the entire workload is completed faster.
we can overlap the execution of instructions from the same program using pipelining. Although pipelining does not speed up a single instruction, it allows us to execute more instructions in the same amount of time by overlapping them. A familiar example of pipelining is using a washer and drier at home. It is much faster to start a second load in the washer while the first is drying than it is to wait until the first load is dry to start the second.
we can use speculative execution to continue doing work while waiting for a decision to be made. A simple example here might be researching a trip itinerary before you know whether you will be going. The work might be unused, but if the decision is to take the trip, some of the work is already done.
using several levels of memory reduces memory wait time. Most of us are aware that cache memory is used to speed up a processor. This is because the access time for cache memory is much faster than it is for non-cache [main] memory. A well-designed cache can reclaim up to 90% of the time lost waiting for memory if there is no cache at all.

Section 1.3

Know the difference between a machine language and an assembly language. You should also understand the job of a compiler compared to that of an assembler. Many modern compilers have built-in assemblers to save a step.

All data is binary. Some of it is actual data [values] and some is encoded instructions. Which is which is determined by the interpretation placed on the data. You can interpret data values as instructions but it will not make much sense. In fact, errant programs often attempt to do this, by branching to an address which is not part of the program, for example.

The fact that a single read-write random-access memory is used to store instructions and data is the stored-program concept. It seems obvious today, but in the early days, using a single storage area for both instructions and data was a novel idea. Computers that implemented this concept are [incorrectly] referred to as vonNeumann machines, as opposed to Harvard machines, which had separate storage areas for instructions and for data. The true implementation of the vonNeumann architecture describes the basic machine more completely than simply how it stores data and instructions - in it a computer is divided into a processing unit, control unit, memory, mass storage, and input/output mechanisms. As we will see, these are close to the five classic components of a computer which will be covered beginning in Chapter 2.

Section 1.4

Memory is designed as a hierarchy of different memory types because there is an inverse relationship between the cost of the memory type and its speed - how quickly data in that memory can be used by a program. Register memory is faster than cache memory, which is faster than main memory (RAM), which is faster than disk storage.

This section has many important definitions. Of particular importance are the ISA and what an ABI is.

Often the history of computer development is divided into Generations. We usually think of four generations of computer development:

the first generation covered the first true computers. These behemoths used Vacuum tubes for switches and, by the end of the first generation, magnetic cores for memory. These computers executed instructions at the fantastic speed of many thousands of instructions per second (KIPS). They often took up an entire room. There were very few commercial versions, as these monsters were far too expensive to reproduce commercially in any quantity; however, 46 UNIVAC I's, the first real commercial computer, were eventually sold ($1million each - 5200 vacuum tubes, 1000 72-bit words of main storage)

Interestingly, this generation is the origin of a popular computer term of today. At one point in a program's development, it's failure became inexplicable, concluding that the hardware was to blame. When the system was disassembled, it was discovered that moths had gotten into the vacuum tube area of the system and their remains interrupted the electric contacts. To this day, 'bugs' are blamed for program malfunctions.

The second generation of development was heralded by the use of discrete transistors as switches. These handy devices were the first giant step on the path to component miniaturization (and to the accompanying power reduction). However, the fact that discrete transitions were used meant that commercial production, though now possible, was still very difficult. Often a functioning computer required the hand wire wrapping of hundreds of thousands of transistors with their accompanying resistors and other circuit elements. Computers at this time had the amazing main RAM storage area of up to 128 kB of core memory, and this was so efficiently utilized that it could support a bank of thirty or so timesharing users on text terminals. The CPU and memory of these systems was eventually shrunk to the size of a walk-in closet, fulfilling an early computer professional's prediction that "in his lifetime a general-purpose computer would fit in a single room".
The third generation of computers was signaled by the invention of small- and medium scale integrated circuits. After this, the dividing line between generations is not so clear: usually the fourth generation is spoken as beginning with microprocessors. It continues to today.

Here is a somewhat scary, archive picture showing examples of circuit boards using each of the first three generations' technology. Anyone want to guess how large the implementation of a similar function using today's technology would be?

Section 1.5

You should understand the [simplified] wafer to working chip progression in this section and be able to use the equations for cost per die, dies per wafer, and yield in this section.

Section 1.6

Recently the definition of performance has changed. It was always presumed that performance meant response time or the time it takes for a single task to complete. This was historically increased by further miniaturization. Because increased miniaturization results in increased heat per unit area, the limit on heat production, which is set by how fast the heat can be removed, was avoided by decreasing the voltage of the chip. Voltage, however, has now met a limitation and cannot be decreased further. This means miniaturization has met its upper limit (at least with current technology), and improvements in response time are now limited.

Instead, performance "improvements" now mean improvements in throughput, which is the total work that can be done by the processor. This improvement is provided by adding cores. Unless a program is rewritten to take advantage of more than one core, this improvement cannot affect a single program. If the user is running multiple programs at a time (i.e., multitasking such as using a browser while performing some other task), the increase in throughput will feel like an improvement in response time.

A side note

We should pause for a moment and explain a drawback in the book's discussion of [response time] performance and the context they must be taken in.

The book's discussion of performance analyzes instruction timings only, and implies that these are instruction timings of a user program only. In real life this is not the whole story. If we are to speak of the elapsed ("real" or "wall-clock") time taken by an actual program, it is divided into three categories:

the time taken by instructions in the program's code. This time is called "user time" in Unix parlance.
the time taken by instructions in system code executed on behalf of the program. This includes instructions executed due to system calls - in particular I/O operations. This time is called "system time" in Unix parlance.
the time taken due to latency. This is time for I/O operations to complete. In real life it also includes time for competing processes, but we assume for the sake of benchmarking a program that this is minimized, although on every real system, no matter how quiescent it is, this is a factor.

These last two time factors may easily dwarf the user time in an actual program. Remember, it is the "real" time that the user experiences.

These three times: "real", "system", and "user" can be seen if a standard Unix command is "timed":

$ time ls -lR ~ > /dev/null 2>&1
real    0m0.246s
user    0m0.015s
sys   0m0.108s

In the output above, you can see the I/O latency by comparing the real time to the sum of the user and sys times. The latency is (.246-.123)/.246 or 50% of the total. This can be seen even more clearly by running the same command again immediately

$ time ls -lR ~ > /dev/null 2>&1
real    0m0.055s
user    0m0.008s
sys   0m0.047s

In this second run, there is zero latency. This is because Unix keeps the directory information in-memory for as long as it can, and no actual I/O was necessary. Although you could expect the system time to decrease (due to less time spent copying data), the decrease in the user time is difficult to explain. This is probably due to using a limited test case and shows the amount of noise in the time sample.

This side note does not discount the book's approach, but you should realize that very few real programs fall into these pure "user" programs. Such programs are called "compute-bound". Most programs have a significant I/O component.

When considering performance as a function of instruction timings alone, as the book does, it is not sufficient to talk solely about instruction count, as the time required to perform an instruction may differ between instructions.

To illustrate this, let's consider an addition operation. In most modern machines, an addition instruction will only work on operands in registers. This makes it very fast - there is no latency to "fetch" the operands. Hence, addition is much faster than an instruction which accesses memory. As a contrary example, our Simple Machine's ADD instruction has one operand in the accumulator and one in-memory. This requirement to go to memory means that the ADD instruction in the Simple Machine is just as slow as the LOAD instruction.

These differences are highlighted by designating instruction classes, as the book does. Then a benchmark program must be divided into percentages of instructions that fall into each class, and a "global" CPI calculated using the fractions and the times for each class. Remember , this global CPI only applies to the program in question, although an estimate may be derived by considering average instruction frequencies for "typical" programs.

The attentive reader may have noticed that branches comprise a separate instruction class, and may have wondered why they are slower than other instructions. This is because of their affect on the instructions after them and on instruction caches. We will see this effect later in the course when we discuss instruction pipelining and how branches affect it. This affect branches have on the instructions that come after them is one topic of speculative execution, which was discussed in section 1.2.

Section 1.7

Read this section for background only. We do not have the time to spend on power equations. You should be aware of the concepts, however.

Section 1.8

As we know, many of the improvements in computers today come from increasing parallelization. This section makes the important point that, although increasing the parallel capability of a processor by adding more "cores" increases throughput, it does not necessarily improve response time. The reason is that programs must be rewritten to take advantage of multiple CPUs. You should also note that multi-threading is used mainly to give a program a second thread of control - not to execute in parallel. For a program to really use multiple CPUs it must be written to divide up its calculations to execute on several CPUs at once. This requires separating out data dependencies, which can be very difficult. The only "parallel" -type speedups that are available to programs without rewriting them are such items as pipelining, speculative execution and instruction-level parallelism that are provided by the hardware.

Multiple cores have a more immediate effect on throughput than on response time. For many applications, such as a datacenter, this performance metric may very well be more important.

Section 1.9

Read the SPEC benchmarking information for background. As discussed, it is very difficult to compare the speed of different systems, or even of the same system doing different tasks. There are a huge number of variables. For this purpose, great pains have been taken by some organizations to write benchmark software that can be used for these types of comparisons. This software is written to try to mimick the types of tasks users are interested in. The most commonly quoted benchmark sets are the SPEC benchmarks. Many manufacturers will run SPEC benchmarks on their hardware, and quote the results of those benchmarks they performed well on in ads. It may be more interesting to know the results of the benchmarks they do not quote.

Section 1.10

Amdahl's law is very important, and is often applied to using multiple cores on a modern processor. In effect it is a statement of a diminishing-returns law where a program is divided into two parts - the part that can be parallelized and the part that cannot. The second part's execution time is stable and limits the percentage improvement of the overall execution time that can be achieved by parallelization.

This can also be seen in other systems. Returning to our example of a long recursive listing (above), where the I/O latency was 50% of the elapsed time - this places an upper limit on the amount of improvement of elapsed time (which is, of course, the time that the user 'feels') that can be achieved by increasing instruction speed. For example, if we executed instructions twice as fast, the elapsed time (by Amdahl) would only decrease by 25%. This is where diminishing-returns is apparent - the next doubling of instruction speed only results in a 12.5% improvement (compared to the original) - the next, only 6.25%.

Section 1.11

Note the overall equation for the execution time of a program. You can also see the list of the five classic components of a computer in the Roadmap.

This page was made entirely with free software on linux:
Kompozer, the Mozilla Project and Openoffice.org