cs260a online notes

City College of San Francisco - CS270
Computer Architecture
Module: MIPS-III (Procedures)

Character Data

MARS limits us to the use of ASCII characters. As most of us know, ASCII is a 7-bit code used to encode the characters used in the English alphabet, plus a number of special control characters. You can translate hexadecimal bytes to characters (or the reverse) by referring to the ascii man page on Linux. Here is a simple example:

Suppose we wanted to encode the sequence 'M' 'o' 'm' into a sequence of bytes. This is simply an exercise in the use of the ascii man page. We would then like to use the bytes to create a message to read

Mom

Of course, the simple way to do this in MARS is to use the .asciiz directive. For practice, however, we should do it the hard way. We can look up the character values, but what is needed to create the string?

There are two common ways to represent a character string consisting of ASCII characters. The one used by MARS is the same as the one used in C - a character string is a sequence of characters followed by a null byte (a byte with value 0). The drawback of this encoding is that you cannot know the length of a string without counting the characters. This gives way to the second encoding of a character string, which was used in Pascal and is now used in Java, the first byte of the string indicates its length.

Besides the use of the null byte (which is also called a 'nul') to end the string, we also need to encode the line terminator character. On Linux, lines are terminated with a newline ('\n').

Putting this all together, we can encode our string. Let's look up the values:

M	o	m	\n
0x4d	0x6f	0x6d	0xa

If we entered this into a MIPS program and displayed it using a syscall, we would need to add the terminating nul. The more interesting thing to do is to examine the memory region where the string is stored, which is easy since it is at the start of the data region. You can do that by using cut and paste with the program below:

    .data
mom: .byte 0x4d,0x6f,0x6d,0x0a,0x0
    .align 2
    .text
    .globl main
main:
    li    $v0,55
    li    $a1,1
    la    $a0,mom
    syscall
    jr    $ra

If you cut and paste this into a new Mars file and assemble it, you will get a big surprise when you look at the memory in the .data region. Assuming your label mom is word-aligned, you will find a word with this value

0x0a6d6f4d

in other words, in the backwards order to what you would expect. As we discussed in a previous section, this is due to the byte-order of the machine used. Since Intel machines are LSB-first, the data, when output in 4-byte quantities, reverses the bytes of a character string such as this. Whenever you are examining character data in Mars you must remember that the bytes are reversed to avoid getting confused.

You can test our UTF-8 sequence using the command-line of Linux as well. At the command-line, simply output the characters by encoding their hexadecimal representation into a string and using echo like this:

$ echo -e "\x4d\x6f\x6d\x0a"
Mom

Try it yourself by copying and pasting the echo command onto the Linux commandline.

International Characters

ASCII is easy. But most of us use languages with characters that are not part of the ASCII subset. This issue has been resolved by the long work of the Unicode consortium, who have gradually built up a standardized encoding for most of the planet's character sets.

Studying Unicode can be a bit confusing, since there is a difference between the characters themselves and how they are transmitted. The problem is this - the computers of the world use a byte stream to transmit character data. The base Unicode character set, which encodes all of the most commonly-used languages of the world, is a 16-bit code. Inside a program, these 16-bit unicode characters are referred to a wide characters. This base Unicode character set has the ASCII characters as a subset. But how can this new 16-bit code be transmitted as a sequence of bytes so that its users and the large number of ASCII users can easily coexist?

The solution is to encode non-ASCII Unicode characters as a sequence of one to three bytes. This encoding is called UTF-8. When transmitting UTF-8, ASCII Unicode characters are transmitted as single-byte ASCII values. When a non-ASCII character must be transmitted the unused most-significant bit in the character is used to signal a change in encoding, and the encoded character is sent. The number of leading 1-bits in the first byte indicates how many bytes are used for the encoded character.

Let's take a simple example. In keeping with our 'Mom' message, we can translate the message to a second Western language - French. Here, 'Mom' in English becomes 'Mère' in French. In this word, three of four characters are ASCII. These do not require any special encoding - ASCII simply remains ASCII in UTF-8. The last character, è, has the value 0xe8 in Unicode. Since this is greater than 0x7f it must be encoded as a sequence of characters. The encoding for UTF-8 indicates that any value less than 0x7ff can be encoded in a sequence of two bytes. The most-significant bits of the first byte would be 110 and the most-significant bits of the second byte would be 10. The value of the character is encoded using the remaining bits (as indicated by the underlined x's below):

first byte: 110xxxxx (the number of leading 1-bits indicates the number of bytes used to encode the character)

second byte: 10xxxxxx

If we encode our character in binary as an eleven-bit number abcdefghijk (with leading zeros to maintain the correct value), the bits are transmitted like this

first byte: 110abcde

second byte: 10fghijk

Let's apply this to our special character è. It's value is 0xe8, which is 00011101000 expressed in 11 bits. Thus, encoded using UTF-8 the bit patterns are

11000011 10101000 or 0xc3 0xa8

Feel free to try it on Linux. Here we will leave the ASCII characters as they are normally and encode the single non-ASCII character in hexadecimal:

echo -e "M\xc3\xa8re\n"

Just for fun, let's try it in another language. In another popular language one character that can be used as an expression of Mom has the value 0x5988. According to UTF-8, any 16-bit value can be encoded using three bytes. The bits used to carry the value are the x's in

first byte: 1110xxxx (again, the number of leading 1-bits indicates the number of bytes used to encode the character)

second and third bytes: 10xxxxxx

Our character has the 16-bit bit pattern 0101100110001000. This is encoded as

11100101 10100110 10001000

or 0xe5 0xa6 0x88

Go ahead and try it: echo -e "\xe5\xa6\x88\n". Perhaps someone can tell me if it is accurate.

Although UTF-8 is often used to transmit and display Unicode characters, it is annoying for storing them internally as well as manipulating them. Programs often use wide characters internally, even in 'wide character strings'. A set of library routines is used to convert between the UTF-8 encoding (called 'multi-byte characters) and wide characters as well as to classify wide characters (for example digit or alphabetic character) just as is done with ASCII characters.

Unicode also has a 32-bit variant, which easily fits all the characters known in all languages in the world. Even allowing for the private use area of the Unicode value range, all characters can fit in 24 bits. It can also be encoded using UTF-8 and requires a maximum of four bytes.

If a program understands Unicode it will convert incoming UTF-8 to wide characters for internal manipulation. When it must send the data elsewhere, it will be encoded as UTF-8 again for transmission. Since UTF-8 is an eight-bit encoding, there is never an issue with byte-order.

If a program does not understand Unicode it will not be able to translate any encoded Unicode characters in the input stream. These characters may appear as ? or some other funny character in the output if the data is later displayed.