169 lines
11 KiB
HTML
169 lines
11 KiB
HTML
|
<?xml version="1.0" encoding="UTF-8"?>
|
||
|
<!DOCTYPE html
|
||
|
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
||
|
<html lang="en-us" xml:lang="en-us">
|
||
|
<head>
|
||
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
||
|
<meta name="security" content="public" />
|
||
|
<meta name="Robots" content="index,follow" />
|
||
|
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
|
||
|
<meta name="DC.Type" content="concept" />
|
||
|
<meta name="DC.Title" content="How Unicode relates to prior standards such as ASCII and EBCDIC" />
|
||
|
<meta name="abstract" content="This topic provides a historical perspective on the Unicode standard, and explains how it can reduce the complexity of handling character data in globalized applications." />
|
||
|
<meta name="description" content="This topic provides a historical perspective on the Unicode standard, and explains how it can reduce the complexity of handling character data in globalized applications." />
|
||
|
<meta name="DC.Relation" scheme="URI" content="rbagsunicodeucs2.htm" />
|
||
|
<meta name="copyright" content="(C) Copyright IBM Corporation 1998, 2006" />
|
||
|
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 1998, 2006" />
|
||
|
<meta name="DC.Format" content="XHTML" />
|
||
|
<meta name="DC.Identifier" content="rbagsunicodeandprior" />
|
||
|
<meta name="DC.Language" content="en-us" />
|
||
|
<!-- All rights reserved. Licensed Materials Property of IBM -->
|
||
|
<!-- US Government Users Restricted Rights -->
|
||
|
<!-- Use, duplication or disclosure restricted by -->
|
||
|
<!-- GSA ADP Schedule Contract with IBM Corp. -->
|
||
|
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
|
||
|
<link rel="stylesheet" type="text/css" href="./ic.css" />
|
||
|
<title>How Unicode relates to prior standards such as ASCII and EBCDIC</title>
|
||
|
</head>
|
||
|
<body id="rbagsunicodeandprior"><a name="rbagsunicodeandprior"><!-- --></a>
|
||
|
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
|
||
|
<h1 class="topictitle1">How Unicode relates to prior standards such as ASCII and EBCDIC</h1>
|
||
|
<div><p>This topic provides a historical perspective on the Unicode standard,
|
||
|
and explains how it can reduce the complexity of handling character data in
|
||
|
globalized applications.</p>
|
||
|
<div class="section"><h4 class="sectiontitle">Evolving standards based on limited platforms</h4><p>The
|
||
|
representation of character data in modern computer systems can be fairly
|
||
|
complicated, depending on the needs of your globalized application. One of
|
||
|
the reasons for this complexity is that the methods for handling this data
|
||
|
have evolved from early methods that served less complicated environments
|
||
|
and hardware platforms.</p>
|
||
|
<p>In fact, many early decisions about how to encode
|
||
|
characters on a system were guided by the functional requirements of specific
|
||
|
devices, such as the early Telex (TTY) terminals and punch card technologies.
|
||
|
For example, the Delete character (with an ASCII value of x'7F') was required
|
||
|
in order to punch out all of the holes in a column of a punch card to signify
|
||
|
that the column should be ignored. The storage capacities of these early computing
|
||
|
systems placed additional limitations on system and application designers.</p>
|
||
|
<p>The
|
||
|
character encoding schemes that have grown out of these early systems were
|
||
|
built on this historical foundation:</p>
|
||
|
<ul><li>The ASCII (American Standard Code for Information Interchange) character
|
||
|
set uses 7-bit units, with a trivial encoding designed for 7-bit bytes. It
|
||
|
is the most important character set in use today, despite its limitation to
|
||
|
very few characters, because its design is the foundation for most modern
|
||
|
character sets. ASCII provides only 128 numeric values, and 33 of those are
|
||
|
reserved for special functions.</li>
|
||
|
<li>The EBCDIC (Extended Binary-Coded Decimal Interchange Code) character
|
||
|
set and a number of associated character sets, designed by IBM<sup>®</sup> for its mainframes,
|
||
|
uses 8-bit bytes. It was developed at a similar time as ASCII, and shares
|
||
|
the same set of base characters and has other similar properties. Unlike ASCII,
|
||
|
the Latin letters are not combined in two blocks for upper- and lower-case.
|
||
|
Instead, the letters are arranged so that their hexadecimal values have second
|
||
|
digits of 1 through 9 (another punch card-friendly design).</li>
|
||
|
</ul>
|
||
|
</div>
|
||
|
<div class="section"><h4 class="sectiontitle">Historical simplicity creates modern complexity</h4><p>The
|
||
|
physical and functional limitations of the early character sets gave way to
|
||
|
rapidly expanding hardware and functional capabilities. Character representation
|
||
|
on computing systems became less dependent on hardware; instead, software
|
||
|
designers used the existing encoding schemes to accommodate the needs of an
|
||
|
increasingly global community of computer users.</p>
|
||
|
</div>
|
||
|
<div class="section"><h4 class="sectiontitle">Character sets for many characters</h4><p>The most common
|
||
|
encodings (character encoding schemes) use a single byte per character, and
|
||
|
they are often called single-byte character sets (SBCS). They are all limited
|
||
|
to 256 characters. Because of this, none of them can even cover all of the
|
||
|
accented letters for the Western European languages. Consequently, many different
|
||
|
such encodings were created over time to fulfill the needs of different user
|
||
|
communities. The most widely used SBCS encoding today, after ASCII, is ISO-8859-1.
|
||
|
It is an 8-bit superset of ASCII and provides most of the characters necessary
|
||
|
for Western Europe.</p>
|
||
|
<p>However, East Asian writing systems needed a way
|
||
|
to store over 10 000 characters, and so double-byte character sets (DBCS)
|
||
|
were developed to provide enough space for the thousands of ideographic characters
|
||
|
in East Asian writing systems. Here, the encoding is still byte-based, but
|
||
|
each two bytes together represent a single character.</p>
|
||
|
<p>Even in East Asia,
|
||
|
text contains letters from small alphabets like Latin or Katakana. These are
|
||
|
represented more efficiently with single bytes. Multi-byte character sets
|
||
|
(MBCS) provide for this by using a variable number of bytes per character,
|
||
|
which distinguishes them from the DBCS encodings. MBCSs are often compatible
|
||
|
with ASCII; that is, the Latin letters are represented in such encodings with
|
||
|
the same bytes that ASCII uses. Some less often used characters may be encoded
|
||
|
using three or even four bytes.</p>
|
||
|
<p>An important feature of MBCSs is that
|
||
|
they have byte value ranges that are dedicated for lead bytes and trail bytes.
|
||
|
Special ranges for lead bytes, the first bytes in multibyte sequences, make
|
||
|
it possible to decide how many bytes belong together to encode a single character.
|
||
|
Traditional MBCS encodings are designed so that it is easy to go forwards
|
||
|
through a stream of bytes and read characters. However, it is often complicated
|
||
|
and very dependent on the properties of the encoding to go backwards in text:
|
||
|
going backwards, it is often hard to find out which variable number of bytes
|
||
|
represents a single character, and sometimes it is necessary to go forward
|
||
|
from the beginning of the text to do this.</p>
|
||
|
<p>Examples
|
||
|
of commonly used MBCS encodings are Shift-JIS and EUC-JP (for Japanese), with
|
||
|
up to 2 or 3 bytes per character.</p>
|
||
|
</div>
|
||
|
<div class="section"><h4 class="sectiontitle">Stateful encodings</h4><p>Some encodings are stateful;
|
||
|
they have bytes or byte sequences that switch the meanings of the following
|
||
|
bytes. Simple encodings, like mixed-byte EBCDIC, use Shift-In and Shift-Out
|
||
|
control characters (bytes) to switch between two states. Sometimes, the bytes
|
||
|
after a Shift-In are interpreted as a certain SBCS encoding, and the bytes
|
||
|
after a Shift-Out as a certain DBCS encoding. This is very different from
|
||
|
an MBCS encoding where the bytes for each character indicate the length of
|
||
|
the byte sequence.</p>
|
||
|
<p>The most common stateful encoding is ISO 2022 and
|
||
|
its language-specific variations. It uses Escape sequences (byte sequences
|
||
|
starting with an ASCII Escape character, byte value 27) to switch between
|
||
|
many different embedded encodings. It can also <em>announce</em> encodings that
|
||
|
are to be used with special shifting characters in the embedded byte stream.
|
||
|
Language-specific variants like ISO-2022-JP limit the set of embeddable encodings
|
||
|
and specify only a small set of acceptable Escape sequences for them.</p>
|
||
|
<p>Such
|
||
|
encodings are very powerful for data exchange but hard to use in an application.
|
||
|
Their flexibility allows you to embed many other encodings, but direct use
|
||
|
in programs and conversions to and from other encodings are complicated. For
|
||
|
direct use, a program has to keep track not only of the current position in
|
||
|
the text, but also of the state--which embeddable encoding is currently active--or
|
||
|
must be able to determine the state for a position from considerable context.
|
||
|
For conversions to other encodings, converting software might
|
||
|
need to have mappings for many embeddable encodings, and for conversions from
|
||
|
other encodings, special code must figure out which embeddable encoding to
|
||
|
choose for each character.</p>
|
||
|
</div>
|
||
|
<div class="section"><h4 class="sectiontitle">Why Unicode?</h4><p>Hundreds of encodings have been developed,
|
||
|
each for small groups of languages and special purposes. As a result, the
|
||
|
interpretation of text, input, sorting, display, and storage depends on the
|
||
|
knowledge of all the different types of character sets and their encodings.
|
||
|
Programs are written to either handle one single encoding at a time and switch
|
||
|
between them, or to convert between external and internal encodings.</p>
|
||
|
<p>Part
|
||
|
of the problem is that there is no single, authoritative source of precise
|
||
|
definitions of many of the encodings and their names. Transferring of text
|
||
|
from one machine to another one often causes some loss of information. Also,
|
||
|
if a program has the code and the data to perform conversion between a significant
|
||
|
subset of traditional encodings, then it carries several megabytes of data
|
||
|
around.</p>
|
||
|
<p>Unicode provides a single character set that covers the languages
|
||
|
of the world, and a small number of machine-friendly encoding forms and schemes
|
||
|
to fit the needs of existing applications and protocols. It is designed for
|
||
|
best interoperability with both ASCII and ISO-8859-1, the most widely used
|
||
|
character sets, to make it easier for Unicode to be used in applications and
|
||
|
protocols.</p>
|
||
|
<p>Unicode is in use today, and it is the preferred character
|
||
|
set for the Internet, especially for HTML and XML. It is slowly being adopted
|
||
|
for use in e-mail, too. Its most attractive property is that it covers all
|
||
|
the characters of the world (with exceptions, which will be added in the future).
|
||
|
Unicode makes it possible to access and manipulate characters by unique numbers
|
||
|
(that is, their Unicode code points) and use older encodings only for input
|
||
|
and output, if at all.</p>
|
||
|
</div>
|
||
|
</div>
|
||
|
<div>
|
||
|
<div class="familylinks">
|
||
|
<div class="parentlink"><strong>Parent topic:</strong> <a href="rbagsunicodeucs2.htm" title="Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application to work for a global audience.">Work with Unicode</a></div>
|
||
|
</div>
|
||
|
</div>
|
||
|
</body>
|
||
|
</html>
|