sections in this module | City
College of San Francisco - CS270 Computer Architecture Module: MIPS-III (Procedures) |
module list |
Character
Data
MARS limits us to the use of ASCII characters. As most of us know, ASCII is a 7-bit code used to encode the characters used in the English alphabet, plus a number of special control characters. You can translate hexadecimal bytes to characters (or the reverse) by referring to the ascii man page on Linux. Here is a simple example:
Suppose we wanted to encode the sequence 'M' 'o' 'm' into a sequence of bytes. This is simply an exercise in the use of the ascii man page. We would then like to use the bytes to create a message to read
Mom
Of course, the simple way to do this in MARS is to use the .asciiz directive. For practice, however, we should do it the hard way. We can look up the character values, but what is needed to create the string?
There are two common ways to represent a character string consisting of ASCII characters. The one used by MARS is the same as the one used in C - a character string is a sequence of characters followed by a null byte (a byte with value 0). The drawback of this encoding is that you cannot know the length of a string without counting the characters. This gives way to the second encoding of a character string, which was used in Pascal and is now used in Java, the first byte of the string indicates its length.
Besides the use of the null byte (which is also called a 'nul') to end the string, we also need to encode the line terminator character. On Linux, lines are terminated with a newline ('\n').
Putting this all together, we can encode our string. Let's look up the values:
M |
o |
m |
\n |
0x4d |
0x6f |
0x6d |
0xa |
If we entered this into a MIPS program and displayed it using a syscall, we would need to add the terminating nul. The more interesting thing to do is to examine the memory region where the string is stored, which is easy since it is at the start of the data region. You can do that by using cut and paste with the program below:
.dataIf you cut and paste this into a new Mars file and assemble it, you will get a big surprise when you look at the memory in the .data region. Assuming your label mom is word-aligned, you will find a word with this value
0x0a6d6f4d
in other words, in the backwards order to what you would expect. As we discussed in a previous section, this is due to the byte-order of the machine used.
Since Intel machines are LSB-first, the data, when output in 4-byte
quantities, reverses the bytes of a character string such as this.
Whenever you are examining character data in Mars you must remember
that the bytes are reversed to avoid getting confused.
You can test our UTF-8 sequence using the command-line of Linux as well. At the command-line, simply output the characters by encoding their hexadecimal representation into a string and using echo like this:
$ echo -e
"\x4d\x6f\x6d\x0a"
Mom
Try it yourself by copying and pasting the echo command onto the
Linux commandline.
International Characters
ASCII is easy. But most of us use languages with characters that are not part of the ASCII subset. This issue has been resolved by the long work of the Unicode consortium, who have gradually built up a standardized encoding for most of the planet's character sets.
Studying Unicode can be a bit confusing, since there is a
difference between the characters themselves and how they are
transmitted. The problem is this - the computers of the world use
a byte stream to transmit character data. The base Unicode
character set, which encodes all of the most commonly-used
languages of the world, is a 16-bit code. Inside a program, these
16-bit unicode characters are referred to a wide characters. This base
Unicode character set has the ASCII characters as a subset. But
how can this new 16-bit code be transmitted as a sequence of bytes
so that its users and the large number of ASCII users can easily
coexist?
The solution is to encode non-ASCII Unicode characters as a
sequence of one to three bytes. This encoding is called UTF-8.
When transmitting UTF-8, ASCII Unicode characters are transmitted
as single-byte ASCII values. When a non-ASCII character must be
transmitted the unused most-significant bit in the character is
used to signal a change in encoding, and the encoded character is
sent. The number of leading 1-bits in the first byte indicates how many bytes are
used for the encoded character.
Let's take a simple example. In keeping with our 'Mom' message,
we can translate the message to a second Western language -
French. Here, 'Mom' in English becomes 'Mère' in French. In
this word, three of four characters are ASCII. These do not
require any special encoding - ASCII simply remains ASCII in
UTF-8. The last character, è, has the value 0xe8 in
Unicode. Since this is greater than 0x7f it must be encoded as a
sequence of characters. The encoding for UTF-8 indicates that any
value less than 0x7ff can be encoded in a sequence of two bytes.
The most-significant bits of the first byte would be 110 and the
most-significant bits of the second byte would be 10. The value of the
character is encoded using the remaining bits (as indicated by the
underlined x's below):
first byte: 110xxxxx (the number of leading 1-bits indicates the number of bytes used to encode the character)
second byte: 10xxxxxx
If we encode our character in binary as an eleven-bit number abcdefghijk (with leading zeros to maintain the correct value), the bits are transmitted like this
first byte: 110abcde
second byte: 10fghijk
Let's apply this to our special character è. It's value is 0xe8, which is 00011101000 expressed in 11 bits. Thus, encoded using UTF-8 the bit patterns are
11000011 10101000 or 0xc3 0xa8
Feel free to try it on Linux. Here we will leave the ASCII
characters as they are normally and encode the single non-ASCII
character in hexadecimal:
echo -e "M\xc3\xa8re\n"
Just for fun, let's try it in another language. In another popular language one character that can be used as an expression of Mom has the value 0x5988. According to UTF-8, any 16-bit value can be encoded using three bytes. The bits used to carry the value are the x's in
first byte: 1110xxxx (again, the number of leading 1-bits indicates the number of bytes used to encode the character)
second and third bytes: 10xxxxxx
Our character has the 16-bit bit pattern 0101100110001000. This is encoded as11100101 10100110 10001000
Go ahead and try it: echo
-e "\xe5\xa6\x88\n". Perhaps someone can tell me if it is
accurate.
Unicode also has a 32-bit variant, which easily fits all the characters known in all languages in the world. Even allowing for the private use area of the Unicode value range, all characters can fit in 24 bits. It can also be encoded using UTF-8 and requires a maximum of four bytes.
If a program understands Unicode it will convert incoming UTF-8 to wide characters for internal manipulation. When it must send the data elsewhere, it will be encoded as UTF-8 again for transmission. Since UTF-8 is an eight-bit encoding, there is never an issue with byte-order.
If a program does not understand Unicode it will not be able to
translate any encoded Unicode characters in the input stream.
These characters may appear as ? or some other funny character in
the output if the data is later displayed.
More on ASCII
Although not possible with general Unicode characters, the direct classification of an ASCII character using its numeric value is easy since several sequences of letters form a monotonically-increasing numeric sequence: upper-case letters, lower-case letters, and digits. This means the numeric value of the character '2' is one more than that for the character '1', etc.
This makes it easy to convert the character '5' to the number 5, for example. Just subtract the value of the character '0' from it. And it makes it fairly simple to convert hexadecimal expressed as ASCII to an integer.
Other Character Sets
Since most languages can be represented in 128 or 256 characters,
many languages (or language groups) have their own eight-bit code,
possibly with ASCII as a subset. There is a standard set of these
codes (character sets) named ISO8859-X where X indicates the
specific encoding. Of course, data written in one character
encoding must be read using the same encoding. You cannot write
some data encoded in ISO8859-X and expect to read it as UTF-8.
Prev | This page was made entirely
with free software on linux: Kompozer, the Mozilla Project and Openoffice.org |
Next |