sections in this module | City
College of San Francisco - CS270 Computer Architecture Module: Background |
module list |
Chapter 1 Notes
These notes are on Chapter 1 of Patterson and Hennessy 5th Edition. They highlight certain
areas and indicate the importance of others in our classwork. Some
material is added for background
Section 1.1
Computer performance is not simply a function of the speed of the
processor or of the number of cores as will be apparent later in
the chapter. As pointed out in this section, performance improvements can come from many sources.
Section 1.2
Be familiar with the Eight Great Ideas in Computer Architecture in this section:
Section 1.3
Know the difference between a machine language and an assembly language. You should also understand the job of a compiler compared to that of an assembler. Many modern compilers have built-in assemblers to save a step.
All data is binary. Some of it is actual data [values] and some
is encoded instructions. Which is which is determined by the
interpretation placed on the data. You can interpret data values
as instructions but it will not make much sense. In fact, errant
programs often attempt to do this, by branching to an address
which is not part of the program, for example.
The fact that a single read-write random-access memory is used to
store instructions and data is the stored-program concept. It seems obvious today,
but in the early days, using a single storage area for both
instructions and data was a novel idea. Computers that implemented
this concept are [incorrectly] referred to as vonNeumann machines,
as opposed to Harvard machines, which had separate storage areas
for instructions and for data. The true implementation of the
vonNeumann architecture describes the basic machine more
completely than simply how it stores data and instructions - in it
a computer is divided into a processing unit, control unit,
memory, mass storage, and input/output mechanisms. As we will see,
these are close to the five classic components of a computer which
will be covered beginning in Chapter 2.
Section 1.4
Memory is designed as a hierarchy of different memory types because there is an inverse relationship between the cost of the memory type and its speed - how quickly data in that memory can be used by a program. Register memory is faster than cache memory, which is faster than main memory (RAM), which is faster than disk storage.
This section has many important definitions. Of particular
importance are the ISA and what an ABI is.
Often the history of computer development is divided into Generations. We usually think of four generations of computer development:
The second generation of development was heralded by the use of discrete transistors as switches. These handy devices were the first giant step on the path to component miniaturization (and to the accompanying power reduction). However, the fact that discrete transitions were used meant that commercial production, though now possible, was still very difficult. Often a functioning computer required the hand wire wrapping of hundreds of thousands of transistors with their accompanying resistors and other circuit elements. Computers at this time had the amazing main RAM storage area of up to 128 kB of core memory, and this was so efficiently utilized that it could support a bank of thirty or so timesharing users on text terminals. The CPU and memory of these systems was eventually shrunk to the size of a walk-in closet, fulfilling an early computer professional's prediction that "in his lifetime a general-purpose computer would fit in a single room".
Section 1.5
You should understand the [simplified] wafer to working chip progression in this section and be able to use the equations for cost per die, dies per wafer, and yield in this section.
Section 1.6
Recently the definition of performance has changed. It was always presumed that performance meant response time or the time it takes for a single task to complete. This was historically increased by further miniaturization. Because increased miniaturization results in increased heat per unit area, the limit on heat production, which is set by how fast the heat can be removed, was avoided by decreasing the voltage of the chip. Voltage, however, has now met a limitation and cannot be decreased further. This means miniaturization has met its upper limit (at least with current technology), and improvements in response time are now limited.
Instead, performance "improvements" now mean improvements in throughput, which is the total work that can be done by the processor. This improvement is provided by adding cores. Unless a program is rewritten to take advantage of more than one core, this improvement cannot affect a single program. If the user is running multiple programs at a time (i.e., multitasking such as using a browser while performing some other task), the increase in throughput will feel like an improvement in response time.
A side note
We should pause for a moment and
explain a drawback in the book's discussion of [response time]
performance and the context they must be taken in.
The book's discussion of performance analyzes instruction timings only, and implies that these are instruction timings of a user program only. In real life this is not the whole story. If we are to speak of the elapsed ("real" or "wall-clock") time taken by an actual program, it is divided into three categories:
These last two time factors may easily
dwarf the user time in an actual program. Remember, it is the
"real" time that the user experiences.
These three times: "real", "system", and "user" can be seen if a standard Unix command is "timed":
$
time ls -lR ~ > /dev/null 2>&1
real
0m0.246s
user
0m0.015s
sys
0m0.108s
In the output above, you can see the I/O latency by comparing the real time to the sum of the user and sys times. The latency is (.246-.123)/.246 or 50% of the total. This can be seen even more clearly by running the same command again immediately
$
time ls -lR ~ > /dev/null 2>&1
real
0m0.055s
user
0m0.008s
sys
0m0.047s
In this second run, there is zero latency. This is because Unix keeps the directory information in-memory for as long as it can, and no actual I/O was necessary. Although you could expect the system time to decrease (due to less time spent copying data), the decrease in the user time is difficult to explain. This is probably due to using a limited test case and shows the amount of noise in the time sample.
This side note does not discount the
book's approach, but you should realize that very few real
programs fall into these pure "user" programs. Such programs are
called "compute-bound". Most programs have a significant I/O
component.
When considering performance as a function of instruction timings alone, as the book does, it is not sufficient to talk solely about instruction count, as the time required to perform an instruction may differ between instructions.
To illustrate this, let's consider an addition operation. In most modern machines, an addition instruction will only work on operands in registers. This makes it very fast - there is no latency to "fetch" the operands. Hence, addition is much faster than an instruction which accesses memory. As a contrary example, our Simple Machine's ADD instruction has one operand in the accumulator and one in-memory. This requirement to go to memory means that the ADD instruction in the Simple Machine is just as slow as the LOAD instruction.
These differences are highlighted by designating instruction
classes, as the book does. Then a benchmark program must be
divided into percentages of instructions that fall into each
class, and a "global" CPI calculated using the fractions and the
times for each class. Remember , this global CPI only applies to
the program in question, although an estimate may be derived by
considering average instruction frequencies for "typical"
programs.
The
attentive reader may have noticed that branches comprise a separate
instruction class, and may have wondered why they are slower than other
instructions. This is because of their affect on the instructions after
them and on instruction caches. We will see this effect later in the
course when we discuss instruction pipelining and how branches affect
it. This affect branches have on the instructions that come after them
is one topic of speculative execution, which was discussed in section
1.2.
Section 1.7
Read this section for background only. We do not have the time to spend on power equations. You should be aware of the concepts, however.
Section 1.8
As we know, many of the improvements in computers today come from increasing parallelization. This section makes the important point that, although increasing the parallel capability of a processor by adding more "cores" increases throughput, it does not necessarily improve response time. The reason is that programs must be rewritten to take advantage of multiple CPUs. You should also note that multi-threading is used mainly to give a program a second thread of control - not to execute in parallel. For a program to really use multiple CPUs it must be written to divide up its calculations to execute on several CPUs at once. This requires separating out data dependencies, which can be very difficult. The only "parallel" -type speedups that are available to programs without rewriting them are such items as pipelining, speculative execution and instruction-level parallelism that are provided by the hardware.
Multiple cores have a more immediate effect on throughput than on
response time. For many applications, such as a datacenter, this
performance metric may very well be more important.
Section 1.9
Read the SPEC
benchmarking information for background. As discussed, it is very
difficult to compare the speed of different systems, or even of the
same system doing different tasks. There are a huge number of
variables. For this purpose, great pains have been taken by some
organizations to write benchmark software that can be used for these
types of comparisons. This software is written to try to mimick the
types of tasks users are interested in. The most commonly quoted
benchmark sets are the SPEC benchmarks. Many manufacturers will run
SPEC benchmarks on their hardware, and quote the results of those
benchmarks they performed well on in ads. It may be more interesting to
know the results of the benchmarks they do not quote.
Section 1.10
Amdahl's law is very important, and is often applied to using
multiple cores on a modern processor. In effect it is a statement
of a diminishing-returns law where a program is divided into two
parts - the part that can be parallelized and the part that
cannot. The second part's execution time is stable and limits the
percentage improvement of the overall execution time that can be
achieved by parallelization.
This can also be seen in other systems. Returning to our example of a long recursive listing (above), where the I/O latency was 50% of the elapsed time - this places an upper limit on the amount of improvement of elapsed time (which is, of course, the time that the user 'feels') that can be achieved by increasing instruction speed. For example, if we executed instructions twice as fast, the elapsed time (by Amdahl) would only decrease by 25%. This is where diminishing-returns is apparent - the next doubling of instruction speed only results in a 12.5% improvement (compared to the original) - the next, only 6.25%.
Section 1.11
Note the overall equation for the execution time of a program.
You can also see the list of the five classic components of
a computer in the Roadmap.
Prev | This page was made entirely
with free software on linux: Kompozer, the Mozilla Project and Openoffice.org |
Next |