sections in this module | City
College of San Francisco - CS270 Computer Architecture Module: MIPS-III |
module list |
Data
Areas
By now you
should be used to using the .text and .data directives in MARS to
control whether you are defining code (.text) or data for the
program. You have probably noticed that instructions are placed
in-memory at significantly different addresses than is data. There
are several reasons for this separation:
To ensure that these regions do not overlap and sufficient room
is permitted for each, systems adopt a convention for the initial
addresses for each. These blocks of addresses are typically called
segments. On MARS (and on
the traditional 32-bit MIPS processor) the .text segment starts at
0x400000 and the .data segment starts at 0x10000000. This gives us
a contiguous region of about 250MB of .text space - a fairly large
program. As an indication, the biggest memory hog on hills today
(by a significant multiple!) - oracle - has a .text size of 183MB.
Of course, if a program's .text size is larger, there is no reason
it cannot be expanded into the much larger data region.
The data region is a bit more interesting. It starts at 0x10000000 and ends at 0x80000000 - an area of 1.75GB. But there are several kinds of data. The first (and most important) division is between static data and stack data.
Stack data is used for temporary data and arguments to support a
chain of procedure calls. Without a mechanism such as the stack,
many forms of abstraction in computer programs would not be
possible. The stack must be allowed to 'grow' as necessary to
support longer call chains of possibly more complex procedures.
Data on the stack exists as long as the procedure that defined it
is active. Thus the address of an item on the stack is determined
when the procedure is called, and will [probably] be different if
the same procedure is called a second time.
Static data is used for data that resides at a fixed constant address and is independent of active procedures. This consists of global variables, string data (such as that used for messages), and dynamic data such as allocated by malloc (read: new()). This last type of data is still termed static because it is valid as long as the programmer wants, and is independent of procedures: in other words, this data does not go out of scope. This data must be free'd either explicitly or by garbage collection (if all references to it are deleted). Just like stack data, static data must be allowed to grow.
Support for two types of data that must grow using a single block of memory is implemented by starting each of the types on opposite ends of the address range. Then, one type of data grows from lower to higher addresses (the static data) and the other (the stack data) grows from higher to lower addresses:
0x10000000
................. |
0x7ffffffc |
static data grows -----> |
<-------------
stack data grows |
The static data area (which is usually just called the data area) is subdivided into three smaller segments:
initialized data, uninitialized data, and heap data.
initialized data - this
data is defined in program modules, space is allocated explicitly
for it, and it is often given an initial value. Examples of
initialized data are an array that is initialized, and a message
that has predefined textual data. When using MARS it also contains
variables whose size is known but which are not initialized (i.e.
using a .space directive). This latter category would normally be
part of uninitialized data. The size of the initialized data
region is known when the program is created (compiled and linked,
or, in the case of MARS, just assembled).
uninitialized data
follows initialized data and usually includes arrays that have a
size but that are not initialized (which are part of initialized
data in MARS). This portion of uninitialized data is called bss. Just like initialized
data, the size of bss is
known when the program is created. On MARS, both initialized and
uninitialized data are in the .data segment.
The end of the (initialized + uninitialized) data area is constant and is the highest legal absolute address available to the programmer when the program starts. (Remember, stack addresses are usually allocated outside of the programmer's control and never use absolute addresses). It is illegal to reference addresses past the end of the known data area. The addresses that follow this will be used to allocate our third type of static data - heap data. To make heap addresses legal, the size of the program must be changed by extending the data area. This is done using a system call named sbrk() (read: s-break).
In a normal program, the memory allocator package malloc() calls sbrk() to
get a block of memory addresses 'made legal' (i.e., mapped into
the address space), then manages the block internally, doling out
chunks of it as requested. We will call sbrk() directly each time we need a piece of
heap memory. We give sbrk() a size, and it returns an address to a
piece of memory of that size, which is allocated by simply
extending the data area size by that amount. We will do briefly using the MARS sbrk syscall,
then, later we will call our version of the function malloc, which will
call sbrk for us and ensure the memory it returns is not zeroed.
Suppose we need room for an array of N integers, named result. We would simply define result as a pointer to an int, and the C code to allocate the array would be as follows:
int *result =
sbrk(N<<2); (for us, this is the same as malloc(N<<2))
Byte Order
You should have been examining the hexadecimal memory areas in MARS by now and many of you have noticed that MARS displays the characters in strings strangely. (In fact, all scalars whose base type is shorter than four bytes are displayed strangely.) This is due to the byte order of the underlying machine and to the fact that all MARS segment displays are shown using four-byte quantities.
In short, there are two ways to number bytes in a four-byte
quantity (a 32-bit word). In each of the two drawings below, a
word is shown as it would appear in a register. The byte farthest
to the left is the most-significant
byte. The byte farthest to the right is the least-significant byte. In
the first drawing, we have numbered the bytes 0-3, where 0 is the
most-significant byte.
0 | 1 |
2 |
3 |
If the bytes in our word were numbered like this, and the address
of the word when it is placed in memory was 0x1000, then the
most-significant byte would appear at address 0x1000 and the
least-significant byte would appear at address 0x1003. This is how
most people (at least most of those of us who grew up in a
left-to-right reading world) would think of byte-ordering, and, in
fact, when a 32-bit value in the header of an Internet packet is
transferred across the Internet, the bytes that make up the value
must be transferred in this order - most-significant-byte first or MSB-first for short. (The
format is also called big-endian
because the big part of the value (the most-significant part) is
transferred first.) The reason this format was chosen for the
Internet was that nearly all machines running the Internet when it
was created used MSB-first order.
The other way to number bytes in our word is to assign 0 to the least-significant byte like this
3 | 2 |
1 |
0 |
Thus, if the address of this word when stored in memory was 0x1000, the byte stored at 0x1000 would be the least-significant byte. This format is called least-significant-byte first (LSB-first or little-endian)
Traditionally, all machines used one of the two byte-orders exclusively. Today, some machines (including MIPS and some Intel machines) can select which byte order they use. Of course, changing this selection may have dire consequences if the software that runs on those machines is not written correctly.
By default, however, Intel machines use LSB-first byte ordering. MARS uses the byte-order of the underlying machines (actually, I'm not sure about this - I've never run MARS on a big-endian processor). It is the use of LSB-first byte order that causes the confusion in the memory dumps.
Let's look at the bytes comprising a string of data.
prompt: .asciiz "Enter the integer to convert to hex:"
ASCII strings, whose underlying type is one byte long, are stored as you would expect. The first byte of the string is stored at address 0, the second at address 1, etc. We can see this if the string is placed in a file and a character dump of that file is shown using Linux:
(Here, botest refers to byte-order test. Tch.)
If we dump this file character-by-character we see that the characters' addresses simply increase as we would expect:
$ od -A x -tc botest
000000
E n t e
r t
h e
i n t e
g e
000010
r t
o c
o n v e
r t
t o
000020
h e x : \n
(For those of you familiar with the output of od, I have done us the favor of displaying the address (on the left) in hexadecimal.)
Let's look at the data character by character where the
characters are displayed in hexadecimal:
$ od -A x -tx1 botest
000000 45 6e 74 65 72
20 74 68 65 20 69 6e
74 65 67 65
000010 72 20 74 6f 20
63 6f 6e 76 65 72 74
20 74 6f 20
000020 68 65 78 3a 0a
If we output the file word-by-word, we will see that the words are read in LSB-first order. This means that the first character ('E', whose value is 0x45) is assumed to be the least-significant byte of the first word. When the words are redisplayed as 32-bit quantities, the characters appear swapped:
$ od -A x -tx4 botestLet's compare each of these so you can see how they line up:
The last line here is the way you will see the data displayed in the memory dump, although the characters will appear in correct order if four bytes of this data are loaded as a word into a register.
Prev | This page was made entirely
with free software on Linux: Kompozer, the Mozilla Project and Openoffice.org |
Next |