sections in this module City College of San Francisco - CS270
Computer Architecture

Module: Background
module list

Chapter 1 Notes

These notes are on Chapter 1 of Patterson and Hennessy 5th Edition. They highlight certain areas and indicate the importance of others in our classwork. Some material is added for background

Section 1.1

Computer performance is not simply a function of the speed of the processor or of the number of cores as will be apparent later in the chapter. As pointed out in this section, performance improvements can come from many sources.

Section 1.2

Be familiar with the Eight Great Ideas in Computer Architecture in this section:

Section 1.3

Know the difference between a machine language and an assembly language. You should also understand the job of a compiler compared to that of an assembler. Many modern compilers have built-in assemblers to save a step.

All data is binary. Some of it is actual data [values] and some is encoded instructions. Which is which is determined by the interpretation placed on the data. You can interpret data values as instructions but it will not make much sense. In fact, errant programs often attempt to do this, by branching to an address which is not part of the program, for example.

The fact that a single read-write random-access memory is used to store instructions and data is the stored-program concept. It seems obvious today, but in the early days, using a single storage area for both instructions and data was a novel idea. Computers that implemented this concept are [incorrectly] referred to as vonNeumann machines, as opposed to Harvard machines, which had separate storage areas for instructions and for data. The true implementation of the vonNeumann architecture describes the basic machine more completely than simply how it stores data and instructions - in it a computer is divided into a processing unit, control unit, memory, mass storage, and input/output mechanisms. As we will see, these are close to the five classic components of a computer which will be covered beginning in Chapter 2.

Section 1.4

Memory is designed as a hierarchy of different memory types because there is an inverse relationship between the cost of the memory type and its speed - how quickly data in that memory can be used by a program. Register memory is faster than cache memory, which is faster than main memory (RAM), which is faster than disk storage.

This section has many important definitions. Of particular importance are the ISA and what an ABI is.

Often the history of computer development is divided into Generations. We usually think of four generations of computer development:

Interestingly, this generation is the origin of a popular computer term of today. At one point in a program's development, it's failure became inexplicable, concluding that the hardware was to blame. When the system was disassembled, it was discovered that moths had gotten into the vacuum tube area of the system and their remains interrupted the electric contacts. To this day, 'bugs' are blamed for program malfunctions.
Here is a somewhat scary, archive picture showing examples of circuit boards using each of the first three generations' technology. Anyone want to guess how large the implementation of a similar function using today's technology would be?

Section 1.5

You should understand the [simplified] wafer to working chip progression in this section and be able to use the equations for cost per die, dies per wafer, and yield in this section.

Section 1.6

Recently the definition of performance has changed. It was always presumed that performance meant response time or the time it takes for a single task to complete. This was historically increased by further miniaturization. Because increased miniaturization results in increased heat per unit area, the limit on heat production, which is set by how fast the heat can be removed, was avoided by decreasing the voltage of the chip. Voltage, however, has now met a limitation and cannot be decreased further. This means miniaturization has met its upper limit (at least with current technology), and improvements in response time are now limited.

Instead, performance "improvements" now mean improvements in throughput, which is the total work that can be done by the processor. This improvement is provided by adding cores. Unless a program is rewritten to take advantage of more than one core, this improvement cannot affect a single program. If the user is running multiple programs at a time (i.e., multitasking such as using a browser while performing some other task), the increase in throughput will feel like an improvement in response time.

A side note

We should pause for a moment and explain a drawback in the book's discussion of [response time] performance and the context they must be taken in.

The book's discussion of performance analyzes instruction timings only, and implies that these are instruction timings of a user program only. In real life this is not the whole story. If we are to speak of the elapsed ("real" or "wall-clock") time taken by an actual program, it is divided into three categories:

These last two time factors may easily dwarf the user time in an actual program. Remember, it is the "real" time that the user experiences.

These three times: "real", "system", and "user" can be seen if a standard Unix command is "timed":

$ time ls -lR ~ > /dev/null 2>&1
real    0m0.246s
user    0m0.015s
sys     0m0.108s

In the output above, you can see the I/O latency by comparing the real time to the sum of the user and sys times. The latency is (.246-.123)/.246 or 50% of the total. This can be seen even more clearly by running the same command again immediately

$ time ls -lR ~ > /dev/null 2>&1
real    0m0.055s
user    0m0.008s
sys     0m0.047s

In this second run, there is zero latency. This is because Unix keeps the directory information in-memory for as long as it can, and no actual I/O was necessary. Although you could expect the system time to decrease (due to less time spent copying data), the decrease in the user time is difficult to explain. This is probably due to using a limited test case and shows the amount of noise in the time sample.

This side note does not discount the book's approach, but you should realize that very few real programs fall into these pure "user" programs. Such programs are called "compute-bound". Most programs have a significant I/O component.

When considering performance as a function of instruction timings alone, as the book does, it is not sufficient to talk solely about instruction count, as the time required to perform an instruction may differ between instructions.

To illustrate this, let's consider an addition operation. In most modern machines, an addition instruction will only work on operands in registers. This makes it very fast - there is no latency to "fetch" the operands. Hence, addition is much faster than an instruction which accesses memory. As a contrary example, our Simple Machine's ADD instruction has one operand in the accumulator and one in-memory. This requirement to go to memory means that the ADD instruction in the Simple Machine is just as slow as the LOAD instruction.

These differences are highlighted by designating instruction classes, as the book does. Then a benchmark program must be divided into percentages of instructions that fall into each class, and a "global" CPI calculated using the fractions and the times for each class. Remember , this global CPI only applies to the program in question, although an estimate may be derived by considering average instruction frequencies for "typical" programs.

The attentive reader may have noticed that branches comprise a separate instruction class, and may have wondered why they are slower than other instructions. This is because of their affect on the instructions after them and on instruction caches. We will see this effect later in the course when we discuss instruction pipelining and how branches affect it. This affect branches have on the instructions that come after them is one topic of speculative execution, which was discussed in section 1.2.

Section 1.7

Read this section for background only. We do not have the time to spend on power equations. You should be aware of the concepts, however.

Section 1.8

As we know, many of the improvements in computers today come from increasing parallelization. This section makes the important point that, although increasing the parallel capability of a processor by adding more "cores" increases throughput, it does not necessarily improve response time. The reason is that programs must be rewritten to take advantage of multiple CPUs. You should also note that multi-threading is used mainly to give a program a second thread of control - not to execute in parallel. For a program to really use multiple CPUs it must be written to divide up its calculations to execute on several CPUs at once. This requires separating out data dependencies, which can be very difficult. The only "parallel" -type speedups that are available to programs without rewriting them are such items as pipelining, speculative execution and instruction-level parallelism that are provided by the hardware.

Multiple cores have a more immediate effect on throughput than on response time. For many applications, such as a datacenter, this performance metric may very well be more important.

Section 1.9

Read the SPEC benchmarking information for background. As discussed, it is very difficult to compare the speed of different systems, or even of the same system doing different tasks. There are a huge number of variables. For this purpose, great pains have been taken by some organizations to write benchmark software that can be used for these types of comparisons. This software is written to try to mimick the types of tasks users are interested in. The most commonly quoted benchmark sets are the SPEC benchmarks. Many manufacturers will run SPEC benchmarks on their hardware, and quote the results of those benchmarks they performed well on in ads. It may be more interesting to know the results of the benchmarks they do not quote.

Section 1.10

Amdahl's law is very important, and is often applied to using multiple cores on a modern processor. In effect it is a statement of a diminishing-returns law where a program is divided into two parts - the part that can be parallelized and the part that cannot. The second part's execution time is stable and limits the percentage improvement of the overall execution time that can be achieved by parallelization.

This can also be seen in other systems. Returning to our example of a long recursive listing (above), where the I/O latency was 50% of the elapsed time - this places an upper limit on the amount of improvement of elapsed time (which is, of course, the time that the user 'feels') that can be achieved by increasing instruction speed. For example, if we executed instructions twice as fast, the elapsed time (by Amdahl) would only decrease by 25%. This is where diminishing-returns is apparent - the next doubling of instruction speed only results in a 12.5% improvement (compared to the original) - the next, only 6.25%.

Section 1.11

Note the overall equation for the execution time of a program. You can also see the list of the five classic components of a computer in the Roadmap.

Prev This page was made entirely with free software on linux:  
Kompozer, the Mozilla Project
and Openoffice.org    
Next

Copyright 2014 Greg Boyd - All Rights Reserved.