Data Areas

City College of San Francisco - CS270
Computer Architecture
Module: MIPS-III

Data Areas

By now you should be used to using the .text and .data directives in MARS to control whether you are defining code (.text) or data for the program. You have probably noticed that instructions are placed in-memory at significantly different addresses than is data. There are several reasons for this separation:

we want the addresses for data, and especially for text, to be contiguous and we don't want to chance 'running out' of addresses as our program grows.
the protections set on these areas are different. Simply, we need to execute code and read and write data. Conversely, we don't generally need (or want) to execute data, and we want to limit the ability to read and write code.

To ensure that these regions do not overlap and sufficient room is permitted for each, systems adopt a convention for the initial addresses for each. These blocks of addresses are typically called segments. On MARS (and on the traditional 32-bit MIPS processor) the .text segment starts at 0x400000 and the .data segment starts at 0x10000000. This gives us a contiguous region of about 250MB of .text space - a fairly large program. As an indication, the biggest memory hog on hills today (by a significant multiple!) - oracle - has a .text size of 183MB. Of course, if a program's .text size is larger, there is no reason it cannot be expanded into the much larger data region.

The data region is a bit more interesting. It starts at 0x10000000 and ends at 0x80000000 - an area of 1.75GB. But there are several kinds of data. The first (and most important) division is between static data and stack data.

Stack data is used for temporary data and arguments to support a chain of procedure calls. Without a mechanism such as the stack, many forms of abstraction in computer programs would not be possible. The stack must be allowed to 'grow' as necessary to support longer call chains of possibly more complex procedures. Data on the stack exists as long as the procedure that defined it is active. Thus the address of an item on the stack is determined when the procedure is called, and will [probably] be different if the same procedure is called a second time.

Static data is used for data that resides at a fixed constant address and is independent of active procedures. This consists of global variables, string data (such as that used for messages), and dynamic data such as allocated by malloc (read: new()). This last type of data is still termed static because it is valid as long as the programmer wants, and is independent of procedures: in other words, this data does not go out of scope. This data must be free'd either explicitly or by garbage collection (if all references to it are deleted). Just like stack data, static data must be allowed to grow.

Support for two types of data that must grow using a single block of memory is implemented by starting each of the types on opposite ends of the address range. Then, one type of data grows from lower to higher addresses (the static data) and the other (the stack data) grows from higher to lower addresses:

0x10000000 .................	0x7ffffffc
static data grows ----->	<------------- stack data grows

The static data area (which is usually just called the data area) is subdivided into three smaller segments: initialized data, uninitialized data, and heap data.

initialized data - this data is defined in program modules, space is allocated explicitly for it, and it is often given an initial value. Examples of initialized data are an array that is initialized, and a message that has predefined textual data. When using MARS it also contains variables whose size is known but which are not initialized (i.e. using a .space directive). This latter category would normally be part of uninitialized data. The size of the initialized data region is known when the program is created (compiled and linked, or, in the case of MARS, just assembled).

uninitialized data follows initialized data and usually includes arrays that have a size but that are not initialized (which are part of initialized data in MARS). This portion of uninitialized data is called bss. Just like initialized data, the size of bss is known when the program is created. On MARS, both initialized and uninitialized data are in the .data segment.

The end of the (initialized + uninitialized) data area is constant and is the highest legal absolute address available to the programmer when the program starts. (Remember, stack addresses are usually allocated outside of the programmer's control and never use absolute addresses). It is illegal to reference addresses past the end of the known data area. The addresses that follow this will be used to allocate our third type of static data - heap data. To make heap addresses legal, the size of the program must be changed by extending the data area. This is done using a system call named sbrk() (read: s-break).

In a normal program, the memory allocator package malloc() calls sbrk() to get a block of memory addresses 'made legal' (i.e., mapped into the address space), then manages the block internally, doling out chunks of it as requested. We will call sbrk() directly each time we need a piece of heap memory. We give sbrk() a size, and it returns an address to a piece of memory of that size, which is allocated by simply extending the data area size by that amount. We will do briefly using the MARS sbrk syscall, then, later we will call our version of the function malloc, which will call sbrk for us and ensure the memory it returns is not zeroed.

Suppose we need room for an array of N integers, named result. We would simply define result as a pointer to an int, and the C code to allocate the array would be as follows:

int *result = sbrk(N<<2); (for us, this is the same as malloc(N<<2))

Byte Order

You should have been examining the hexadecimal memory areas in MARS by now and many of you have noticed that MARS displays the characters in strings strangely. (In fact, all scalars whose base type is shorter than four bytes are displayed strangely.) This is due to the byte order of the underlying machine and to the fact that all MARS segment displays are shown using four-byte quantities.

In short, there are two ways to number bytes in a four-byte quantity (a 32-bit word). In each of the two drawings below, a word is shown as it would appear in a register. The byte farthest to the left is the most-significant byte. The byte farthest to the right is the least-significant byte. In the first drawing, we have numbered the bytes 0-3, where 0 is the most-significant byte.

If the bytes in our word were numbered like this, and the address of the word when it is placed in memory was 0x1000, then the most-significant byte would appear at address 0x1000 and the least-significant byte would appear at address 0x1003. This is how most people (at least most of those of us who grew up in a left-to-right reading world) would think of byte-ordering, and, in fact, when a 32-bit value in the header of an Internet packet is transferred across the Internet, the bytes that make up the value must be transferred in this order - most-significant-byte first or MSB-first for short. (The format is also called big-endian because the big part of the value (the most-significant part) is transferred first.) The reason this format was chosen for the Internet was that nearly all machines running the Internet when it was created used MSB-first order.

The other way to number bytes in our word is to assign 0 to the least-significant byte like this

Thus, if the address of this word when stored in memory was 0x1000, the byte stored at 0x1000 would be the least-significant byte. This format is called least-significant-byte first (LSB-first or little-endian)

Traditionally, all machines used one of the two byte-orders exclusively. Today, some machines (including MIPS and some Intel machines) can select which byte order they use. Of course, changing this selection may have dire consequences if the software that runs on those machines is not written correctly.

By default, however, Intel machines use LSB-first byte ordering. MARS uses the byte-order of the underlying machines (actually, I'm not sure about this - I've never run MARS on a big-endian processor). It is the use of LSB-first byte order that causes the confusion in the memory dumps.

Let's look at the bytes comprising a string of data.

prompt: .asciiz "Enter the integer to convert to hex:"

ASCII strings, whose underlying type is one byte long, are stored as you would expect. The first byte of the string is stored at address 0, the second at address 1, etc. We can see this if the string is placed in a file and a character dump of that file is shown using Linux:

$ cat botest
Enter the integer to convert to hex:
$

(Here, botest refers to byte-order test. Tch.)

If we dump this file character-by-character we see that the characters' addresses simply increase as we would expect:

$ od -A x -tc botest
000000   E   n   t   e   r       t   h   e       i   n   t   e   g   e
000010   r       t   o       c   o   n   v   e   r   t       t   o
000020   h   e   x   : \n

(For those of you familiar with the output of od, I have done us the favor of displaying the address (on the left) in hexadecimal.)

Let's look at the data character by character where the characters are displayed in hexadecimal:

$ od -A x -tx1 botest
000000 45 6e 74 65 72 20 74 68 65 20 69 6e 74 65 67 65
000010 72 20 74 6f 20 63 6f 6e 76 65 72 74 20 74 6f 20
000020 68 65 78 3a 0a

If we output the file word-by-word, we will see that the words are read in LSB-first order. This means that the first character ('E', whose value is 0x45) is assumed to be the least-significant byte of the first word. When the words are redisplayed as 32-bit quantities, the characters appear swapped:

$ od -A x -tx4 botest
000000   65746e45   68742072   6e692065   65676574
000010   6f742072   6e6f6320   74726576   206f7420
000020   3a786568   0000000a

Let's compare each of these so you can see how they line up:

000000 E n t e r t h e i n t e g e
000000 45 6e 74 65 72 20 74 68 65 20 69 6e 74 65 67 65
000000 65746e45 68742072 6e692065 65676574

The last line here is the way you will see the data displayed in the memory dump, although the characters will appear in correct order if four bytes of this data are loaded as a word into a register.

This page was made entirely with free software on Linux:
Kompozer, the Mozilla Project and Openoffice.org