ibm-information-center/dist/eclipse/plugins/i5OS.ic.nls_5.4.0.1/rbagsunicodeandprior.htm

169 lines
11 KiB
HTML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en-us" xml:lang="en-us">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="security" content="public" />
<meta name="Robots" content="index,follow" />
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
<meta name="DC.Type" content="concept" />
<meta name="DC.Title" content="How Unicode relates to prior standards such as ASCII and EBCDIC" />
<meta name="abstract" content="This topic provides a historical perspective on the Unicode standard, and explains how it can reduce the complexity of handling character data in globalized applications." />
<meta name="description" content="This topic provides a historical perspective on the Unicode standard, and explains how it can reduce the complexity of handling character data in globalized applications." />
<meta name="DC.Relation" scheme="URI" content="rbagsunicodeucs2.htm" />
<meta name="copyright" content="(C) Copyright IBM Corporation 1998, 2006" />
<meta name="DC.Rights.Owner" content="(C) Copyright IBM Corporation 1998, 2006" />
<meta name="DC.Format" content="XHTML" />
<meta name="DC.Identifier" content="rbagsunicodeandprior" />
<meta name="DC.Language" content="en-us" />
<!-- All rights reserved. Licensed Materials Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<link rel="stylesheet" type="text/css" href="./ibmdita.css" />
<link rel="stylesheet" type="text/css" href="./ic.css" />
<title>How Unicode relates to prior standards such as ASCII and EBCDIC</title>
</head>
<body id="rbagsunicodeandprior"><a name="rbagsunicodeandprior"><!-- --></a>
<!-- Java sync-link --><script language="Javascript" src="../rzahg/synch.js" type="text/javascript"></script>
<h1 class="topictitle1">How Unicode relates to prior standards such as ASCII and EBCDIC</h1>
<div><p>This topic provides a historical perspective on the Unicode standard,
and explains how it can reduce the complexity of handling character data in
globalized applications.</p>
<div class="section"><h4 class="sectiontitle">Evolving standards based on limited platforms</h4><p>The
representation of character data in modern computer systems can be fairly
complicated, depending on the needs of your globalized application. One of
the reasons for this complexity is that the methods for handling this data
have evolved from early methods that served less complicated environments
and hardware platforms.</p>
<p>In fact, many early decisions about how to encode
characters on a system were guided by the functional requirements of specific
devices, such as the early Telex (TTY) terminals and punch card technologies.
For example, the Delete character (with an ASCII value of x'7F') was required
in order to punch out all of the holes in a column of a punch card to signify
that the column should be ignored. The storage capacities of these early computing
systems placed additional limitations on system and application designers.</p>
<p>The
character encoding schemes that have grown out of these early systems were
built on this historical foundation:</p>
<ul><li>The ASCII (American Standard Code for Information Interchange) character
set uses 7-bit units, with a trivial encoding designed for 7-bit bytes. It
is the most important character set in use today, despite its limitation to
very few characters, because its design is the foundation for most modern
character sets. ASCII provides only 128 numeric values, and 33 of those are
reserved for special functions.</li>
<li>The EBCDIC (Extended Binary-Coded Decimal Interchange Code) character
set and a number of associated character sets, designed by IBM<sup>®</sup> for its mainframes,
uses 8-bit bytes. It was developed at a similar time as ASCII, and shares
the same set of base characters and has other similar properties. Unlike ASCII,
the Latin letters are not combined in two blocks for upper- and lower-case.
Instead, the letters are arranged so that their hexadecimal values have second
digits of 1 through 9 (another punch card-friendly design).</li>
</ul>
</div>
<div class="section"><h4 class="sectiontitle">Historical simplicity creates modern complexity</h4><p>The
physical and functional limitations of the early character sets gave way to
rapidly expanding hardware and functional capabilities. Character representation
on computing systems became less dependent on hardware; instead, software
designers used the existing encoding schemes to accommodate the needs of an
increasingly global community of computer users.</p>
</div>
<div class="section"><h4 class="sectiontitle">Character sets for many characters</h4><p>The most common
encodings (character encoding schemes) use a single byte per character, and
they are often called single-byte character sets (SBCS). They are all limited
to 256 characters. Because of this, none of them can even cover all of the
accented letters for the Western European languages. Consequently, many different
such encodings were created over time to fulfill the needs of different user
communities. The most widely used SBCS encoding today, after ASCII, is ISO-8859-1.
It is an 8-bit superset of ASCII and provides most of the characters necessary
for Western Europe.</p>
<p>However, East Asian writing systems needed a way
to store over 10 000 characters, and so double-byte character sets (DBCS)
were developed to provide enough space for the thousands of ideographic characters
in East Asian writing systems. Here, the encoding is still byte-based, but
each two bytes together represent a single character.</p>
<p>Even in East Asia,
text contains letters from small alphabets like Latin or Katakana. These are
represented more efficiently with single bytes. Multi-byte character sets
(MBCS) provide for this by using a variable number of bytes per character,
which distinguishes them from the DBCS encodings. MBCSs are often compatible
with ASCII; that is, the Latin letters are represented in such encodings with
the same bytes that ASCII uses. Some less often used characters may be encoded
using three or even four bytes.</p>
<p>An important feature of MBCSs is that
they have byte value ranges that are dedicated for lead bytes and trail bytes.
Special ranges for lead bytes, the first bytes in multibyte sequences, make
it possible to decide how many bytes belong together to encode a single character.
Traditional MBCS encodings are designed so that it is easy to go forwards
through a stream of bytes and read characters. However, it is often complicated
and very dependent on the properties of the encoding to go backwards in text:
going backwards, it is often hard to find out which variable number of bytes
represents a single character, and sometimes it is necessary to go forward
from the beginning of the text to do this.</p>
<p>Examples
of commonly used MBCS encodings are Shift-JIS and EUC-JP (for Japanese), with
up to 2 or 3 bytes per character.</p>
</div>
<div class="section"><h4 class="sectiontitle">Stateful encodings</h4><p>Some encodings are stateful;
they have bytes or byte sequences that switch the meanings of the following
bytes. Simple encodings, like mixed-byte EBCDIC, use Shift-In and Shift-Out
control characters (bytes) to switch between two states. Sometimes, the bytes
after a Shift-In are interpreted as a certain SBCS encoding, and the bytes
after a Shift-Out as a certain DBCS encoding. This is very different from
an MBCS encoding where the bytes for each character indicate the length of
the byte sequence.</p>
<p>The most common stateful encoding is ISO 2022 and
its language-specific variations. It uses Escape sequences (byte sequences
starting with an ASCII Escape character, byte value 27) to switch between
many different embedded encodings. It can also <em>announce</em> encodings that
are to be used with special shifting characters in the embedded byte stream.
Language-specific variants like ISO-2022-JP limit the set of embeddable encodings
and specify only a small set of acceptable Escape sequences for them.</p>
<p>Such
encodings are very powerful for data exchange but hard to use in an application.
Their flexibility allows you to embed many other encodings, but direct use
in programs and conversions to and from other encodings are complicated. For
direct use, a program has to keep track not only of the current position in
the text, but also of the state--which embeddable encoding is currently active--or
must be able to determine the state for a position from considerable context.
For conversions to other encodings, converting software might
need to have mappings for many embeddable encodings, and for conversions from
other encodings, special code must figure out which embeddable encoding to
choose for each character.</p>
</div>
<div class="section"><h4 class="sectiontitle">Why Unicode?</h4><p>Hundreds of encodings have been developed,
each for small groups of languages and special purposes. As a result, the
interpretation of text, input, sorting, display, and storage depends on the
knowledge of all the different types of character sets and their encodings.
Programs are written to either handle one single encoding at a time and switch
between them, or to convert between external and internal encodings.</p>
<p>Part
of the problem is that there is no single, authoritative source of precise
definitions of many of the encodings and their names. Transferring of text
from one machine to another one often causes some loss of information. Also,
if a program has the code and the data to perform conversion between a significant
subset of traditional encodings, then it carries several megabytes of data
around.</p>
<p>Unicode provides a single character set that covers the languages
of the world, and a small number of machine-friendly encoding forms and schemes
to fit the needs of existing applications and protocols. It is designed for
best interoperability with both ASCII and ISO-8859-1, the most widely used
character sets, to make it easier for Unicode to be used in applications and
protocols.</p>
<p>Unicode is in use today, and it is the preferred character
set for the Internet, especially for HTML and XML. It is slowly being adopted
for use in e-mail, too. Its most attractive property is that it covers all
the characters of the world (with exceptions, which will be added in the future).
Unicode makes it possible to access and manipulate characters by unique numbers
(that is, their Unicode code points) and use older encodings only for input
and output, if at all.</p>
</div>
</div>
<div>
<div class="familylinks">
<div class="parentlink"><strong>Parent topic:</strong> <a href="rbagsunicodeucs2.htm" title="Unicode is a standard that precisely defines a character set as well as a small number of encodings for it. It enables you to handle text in any language efficiently. It allows a single application to work for a global audience.">Work with Unicode</a></div>
</div>
</div>
</body>
</html>