254 lines
17 KiB
HTML
254 lines
17 KiB
HTML
<?xml version="1.0" encoding="utf-8"?>
|
|
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
|
|
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
|
|
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-us">
|
|
<head>
|
|
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
|
|
<meta name="dc.language" scheme="rfc1766" content="en-us" />
|
|
<!-- All rights reserved. Licensed Materials Property of IBM -->
|
|
<!-- US Government Users Restricted Rights -->
|
|
<!-- Use, duplication or disclosure restricted by -->
|
|
<!-- GSA ADP Schedule Contract with IBM Corp. -->
|
|
<meta name="dc.date" scheme="iso8601" content="2005-09-19" />
|
|
<meta name="copyright" content="(C) Copyright IBM Corporation 1998, 2006" />
|
|
<meta name="security" content="public" />
|
|
<meta name="Robots" content="index,follow"/>
|
|
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
|
|
<meta name="keywords" content="character conversion, character set, code page,
|
|
code point, coded character set, encoding scheme, substitution character,
|
|
Unicode, combining characters, normalization, surrogates,
|
|
CDRA (Character Data Representation Architecture),
|
|
Character Data Representation Architecture (CDRA), definition,
|
|
CCSID (coded character set identifier), default" />
|
|
<title>Character conversion</title>
|
|
<link rel="stylesheet" type="text/css" href="ibmidwb.css" />
|
|
<link rel="stylesheet" type="text/css" href="ic.css" />
|
|
</head>
|
|
<body>
|
|
<a id="Top_Of_Page" name="Top_Of_Page"></a><!-- Java sync-link -->
|
|
<script language = "Javascript" src = "../rzahg/synch.js" type="text/javascript"></script>
|
|
|
|
|
|
<a name="ccseta"></a>
|
|
<h2 id="ccseta"><a href="rbafzmst02.htm#ToC_60">Character conversion</a></h2><a id="idx168" name="idx168"></a>
|
|
<p>A <span class="italic">string</span> is a sequence of bytes that may represent
|
|
characters. Within a string, all the characters are represented by a common
|
|
coding representation. In some cases, it might be necessary to convert these
|
|
characters to a different coding representation. The process of conversion
|
|
is known as <span class="italic">character conversion</span>.<sup class="fn"><a id="wq39" name="wq39" href="rbafzmstccseta.htm#wq40">6</a></sup></p>
|
|
<p>Character conversion can occur when an SQL statement is executed remotely.
|
|
Consider, for example, these two cases: </p>
|
|
<ul>
|
|
<li>The values of variables sent from the application requester to the current
|
|
server.</li>
|
|
<li>The values of result columns sent from the current server to the application
|
|
requester.</li></ul><p class="indatacontent"> In either case, the string could have a different representation at
|
|
the sending and receiving systems. Conversion can also occur during string
|
|
operations on the same system.</p>
|
|
<p>Note that SQL statements are strings and are therefore subject to character
|
|
conversion.</p>
|
|
<p>The following list defines some of the terms used when discussing character
|
|
conversion. </p>
|
|
<dl>
|
|
<dt class="bold">character set</dt><a id="idx169" name="idx169"></a><a id="idx170" name="idx170"></a>
|
|
<dd>A defined set of characters. For example, the following character set
|
|
appears in several code pages:
|
|
<ul>
|
|
<li>26 non-accented letters A through Z</li>
|
|
<li>26 non-accented letters a through z</li>
|
|
<li>digits 0 through 9</li>
|
|
<li>. , : ; ? ( ) ' " / - _ & + % * = < ></li></ul>
|
|
</dd>
|
|
<dt class="bold">code page</dt><a id="idx171" name="idx171"></a><a id="idx172" name="idx172"></a>
|
|
<dd>A set of assignments of characters to code points. In EBCDIC, for example, <span>"A"</span> is assigned code point <span class="hex">X'C1'</span> and <span>"B"</span> is assigned
|
|
code point <span class="hex">X'C2'</span>. Within a code page, each code point has only one
|
|
specific meaning.
|
|
</dd>
|
|
<dt class="bold">code point</dt><a id="idx173" name="idx173"></a><a id="idx174" name="idx174"></a>
|
|
<dd>A unique bit pattern that represents a character within a code page.
|
|
</dd>
|
|
<dt class="bold">coded character set</dt><a id="idx175" name="idx175"></a>
|
|
<dd>A set of unambiguous rules that establish a character set and the one-to-one
|
|
relationships between the characters of the set and their coded representations.
|
|
</dd>
|
|
<dt class="bold">encoding scheme</dt><a id="idx176" name="idx176"></a><a id="idx177" name="idx177"></a>
|
|
<dd>A set of rules used to represent character data. For example:
|
|
<ul>
|
|
<li>Single-byte EBCDIC</li>
|
|
<li>Single-byte ASCII</li>
|
|
<li>Double-byte EBCDIC</li>
|
|
<li>Mixed single- and double-byte ASCII<sup class="fn"><a id="wq41" name="wq41" href="rbafzmstccseta.htm#wq42">7</a></sup></li>
|
|
<li>Unicode (UTF-8, UCS-2, and UTF-16 universal coded character sets).</li></ul>
|
|
</dd>
|
|
<dt class="bold">substitution character</dt><a id="idx178" name="idx178"></a><a id="idx179" name="idx179"></a>
|
|
<dd>A unique character that is substituted during character conversion for
|
|
any characters in the source coding representation that do not have a match
|
|
in the target coding representation.
|
|
</dd>
|
|
<dt class="bold">Unicode</dt><a id="idx180" name="idx180"></a><a id="idx181" name="idx181"></a>
|
|
<dd>A universal encoding scheme for written characters and text that enables
|
|
the exchange of data internationally. It provides a character set standard
|
|
that can be used all over the world. It uses a 16-bit encoding form that provides
|
|
code points for more than 65,000 characters and an extension called UTF-16
|
|
that allows for encoding as many as a million more characters. It provides
|
|
the ability to encode all characters used for the written languages of the
|
|
world and treats alphabetic characters, ideographic characters, and symbols
|
|
equivalently because it specifies a numeric value and a name for each of its
|
|
characters. It includes punctuation marks, mathematical symbols, technical
|
|
symbols, geometric shapes, and dingbats. Three encoding forms are supported:
|
|
<ul>
|
|
<li>UTF-8: Unicode Transformation Format, a 8-bit encoding form designed for
|
|
ease of use with existing ASCII-based systems. UTF-8 data is stored in character
|
|
data types. The CCSID value for data in UTF-8 format is 1208.
|
|
<p>A UTF-8 character
|
|
can be 1,2,3 or 4 bytes in length. A UTF-8 data string can contain any combination
|
|
of SBCS and DBCS data, including surrogates and combining characters.</p></li>
|
|
<li>UCS-2: Universal Character Set coded in 2 octets, which means that characters
|
|
are represented in 16-bits per character. UCS-2 data is stored in graphic
|
|
data types. The CCSID value for data in UCS-2 format is 13488.
|
|
<p>UCS-2 is
|
|
a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports
|
|
combining characters and surrogates. Since UCS-2 is a simpler form of UTF-16,
|
|
UCS-2 data will typically perform better than UTF-16.<sup class="fn"><a id="wq43" name="wq43" href="rbafzmstccseta.htm#wq44">8</a></sup></p></li>
|
|
<li>UTF-16: Unicode Transformation Format, a 16-bit encoding form designed
|
|
to provide code values for over a million characters. UTF-16 data is stored
|
|
in graphic data types. The CCSID value for data in UTF-16 format is 1200.<a id="idx182" name="idx182"></a><a id="idx183" name="idx183"></a><a id="idx184" name="idx184"></a><a id="idx185" name="idx185"></a>
|
|
<p>Both UTF-8 and UTF-16 data can contain <var class="pv">combining characters</var>. Combining character support allows a resulting
|
|
character to be comprised of more than one character. After the first character,
|
|
hundreds of different non-spacing accent characters (umlauts, accents, etc.)
|
|
can follow in the data string. The resulting character may already be defined
|
|
in the character set. In this case, there are multiple representations for
|
|
the same character. For example, in UTF-16, an <span class="italic">é</span> can
|
|
be represented either by X'00E9' (the normalized representation) or X'00650301'
|
|
(the non-normalized combining character representation).</p>
|
|
<p>Since multiple
|
|
representations of the same character will not compare equal, it is usually
|
|
not a good idea to store both forms of the characters in the database. <var class="pv">Normalization</var> is a process that replaces the string of combining characters
|
|
with equivalent characters that do not include combining characters. After
|
|
normalization has occurred, only one representation of any specific character
|
|
will exist in the data. For example, in UTF-16, any instances of X'00650301'
|
|
(the non-normalized combining character representation of <span class="italic">é </span>) will be converted to X'00E9' (the normalized representation of <span class="italic">é </span>).<sup class="fn"><a id="wq45" name="wq45" href="rbafzmstccseta.htm#wq46">9</a></sup></p><a id="idx186" name="idx186"></a><a id="idx187" name="idx187"></a>
|
|
<p>Both UTF-8 and UTF-16 can contain 4 byte characters called <var class="pv">surrogates</var>. Surrogates are 4 byte sequences that can address one million more characters
|
|
than would be available in a 2 byte character set.</p></li></ul>
|
|
</dd>
|
|
</dl>
|
|
<a name="wq47"></a>
|
|
<h3 id="wq47"><a href="rbafzmst02.htm#ToC_61">Character sets and code pages</a></h3>
|
|
<p>The following example shows how a typical character set might map to different
|
|
code points in two different code pages.</p>
|
|
<a name="wq48"></a>
|
|
<div class="fignone" id="wq48">
|
|
<div class="mmobj">
|
|
<img src="rv2f976.gif" alt="How a character set might map to different code points in two different code pages. Graphic described in text." /></div></div>
|
|
<p>Even with the same encoding scheme there are many different coded character
|
|
sets, and the same code point can represent a different character in different
|
|
coded character sets. Furthermore, a byte in a character string does not necessarily
|
|
represent a character from a single-byte character set (SBCS). Character strings
|
|
are also used for mixed data (a mixture of single-byte characters and double-byte
|
|
characters) and for data that is not associated with any character set (called
|
|
bit data). This is not the case with graphic strings; the database manager
|
|
assumes that every pair of bytes in every graphic string represents a character
|
|
from a double-byte character set (DBCS) or universal coded character set (UCS-2
|
|
or UTF-16).</p>
|
|
<p>A coded character set identifier (CCSID) in a native encoding scheme is
|
|
one of the coded character sets in which data may be stored at that site.
|
|
A CCSID in a foreign encoding scheme is one of the coded character sets in
|
|
which data cannot be stored at that site. For example, DB2 UDB for iSeries can store data
|
|
in a CCSID with an EBCDIC encoding scheme, but not in an ASCII encoding scheme.</p>
|
|
<p>A variable containing data in a foreign encoding scheme is always converted
|
|
to a CCSID in the native encoding scheme when the variable is used in a function
|
|
or in the <span class="italic">select-list</span>. A variable containing data
|
|
in a foreign encoding scheme is also effectively converted to a CCSID in the
|
|
native encoding scheme when used in comparison or in an operation that combines
|
|
strings. Which CCSID in the native encoding scheme the data is converted to
|
|
is based on the foreign CCSID and the default CCSID.</p>
|
|
<p>For details on character conversion, see: </p>
|
|
<ul>
|
|
<li><a href="rbafzmstch2bas.htm#craj">Conversion rules for assignments</a></li>
|
|
<li><a href="rbafzmstch2bas.htm#crcj">Conversion rules for comparison</a></li>
|
|
<li><a href="rbafzmstuuall.htm#uuall">Conversion rules for operations that combine strings</a></li>
|
|
<li><a href="rbafzmstch2drda.htm#drconsider">Data representation considerations</a>.</li></ul>
|
|
<p>If CCSID conversion is necessary to evaluate the result set
|
|
of a query, the query cannot contain:</p>
|
|
<ul>
|
|
<li>EXCEPT or INTERSECT operations,</li>
|
|
<li>OLAP specifications,</li>
|
|
<li>recursive common table expressions,</li>
|
|
<li>ORDER OF, or</li>
|
|
<li>scalar fullselects (scalar subselects are supported).</li></ul>
|
|
<a name="conccsid"></a>
|
|
<h3 id="conccsid"><a href="rbafzmst02.htm#ToC_62">Coded character sets and CCSIDs</a></h3><a id="idx188" name="idx188"></a><a id="idx189" name="idx189"></a><a id="idx190" name="idx190"></a>
|
|
<p>IBM®'s Character Data Representation Architecture (CDRA) deals with
|
|
the differences in string representation and encoding. The <span class="italic">Coded Character Set Identifier (CCSID)</span> is a key element of this architecture.
|
|
A CCSID is a 2-byte (unsigned) binary number that uniquely identifies an encoding
|
|
scheme and one or more pairs of character sets and code pages.</p>
|
|
<p>A CCSID is an attribute of strings, just as length is an attribute of strings.
|
|
All values of the same string column have the same CCSID.</p>
|
|
<p>In each database manager, character conversion involves the use of a <span class="italic">CCSID Conversion Selection Table</span>. The Conversion Selection
|
|
Table contains a list of valid source and target combinations. For each pair
|
|
of CCSIDs, the Conversion Selection Table contains information used to perform
|
|
the conversion from one coded character set to the other. This information
|
|
includes an indication of whether conversion is required. (In some cases,
|
|
no conversion is necessary even though the strings involved have different
|
|
CCSIDs.)</p>
|
|
<p>Different types of conversions may be supported by the database manager.
|
|
Round-trip conversions attempt to preserve characters in one CCSID that are
|
|
not defined in the target CCSID so that if the data is subsequently converted
|
|
back to the original CCSID, the same original characters result. Enforced
|
|
subset match conversions do not attempt to preserve such characters. For more
|
|
information, see IBM's Character Data Representation Architecture (CDRA).</p>
|
|
<a name="wq49"></a>
|
|
<h3 id="wq49"><a href="rbafzmst02.htm#ToC_63">Default CCSID</a></h3><a id="idx191" name="idx191"></a>
|
|
<p>Every application server and application requester has a default CCSID (or default
|
|
CCSIDs in installations that support DBCS data). The CCSID of the following
|
|
types of strings is determined at the current server: </p>
|
|
<ul>
|
|
<li>String constants (including string constants that represent datetime values)
|
|
when the CCSID of the source is in a foreign encoding scheme</li>
|
|
<li>Special registers with string values (such as USER and CURRENT SERVER)</li>
|
|
<li>CAST specifications where the result is a character or graphic string</li>
|
|
<li>Results of CHAR, DATAPARTITIONNAME, DAYNAME, DBPARTITIONNAME, DIGITS,
|
|
HEX, MONTHNAME, SOUNDEX, and SPACE scalar functions</li>
|
|
<li>Results of DECRYPT_CHAR, DECRYPT_DB, CHAR, GRAPHIC, VARCHAR, and VARGRAPHIC
|
|
scalar functions when a CCSID is not specified as an argument</li>
|
|
<li>Results of the CLOB and DBCLOB scalar functions when a CCSID is not specified
|
|
as an argument<sup class="fn"><a href="rbafzmstccseta.htm#no65535">10</a></sup></li>
|
|
<li>String columns defined by the CREATE TABLE or ALTER TABLE statements when
|
|
an explicit CCSID is not specified for the column<sup class="fn"><a href="rbafzmstccseta.htm#no65535">10</a></sup></li>
|
|
<li>String parameters defined by CREATE FUNCTION or CREATE PROCEDURE
|
|
statements when an explicit CCSID is not specified for the parameter <sup class="fn"><a href="rbafzmstccseta.htm#no65535">10</a></sup></li></ul>
|
|
<p>If one of the types of strings above is used in a CREATE VIEW
|
|
statement, the default CCSID is determined at the time the view is created.</p>
|
|
<p>In a distributed application, the default CCSID of variables is determined
|
|
by the application requester. In a non-distributed application, the default
|
|
CCSID of variables is determined by the application server. On i5/OS, the default
|
|
CCSID is determined by the CCSID job attribute. For more information about
|
|
CCSIDs, see the <a href="../nls/rbagscdra.htm">Work with CCSIDs</a> topic in the Globalization
|
|
section of the iSeries Information Center.</p>
|
|
<hr /><div class="fnnum"><a id="wq40" name="wq40" href="rbafzmstccseta.htm#wq39">6</a>.</div>
|
|
<div class="fntext">Character conversion,
|
|
when required, is automatic and is transparent to the application when it
|
|
is successful. A knowledge of conversion is, therefore, unnecessary when all
|
|
the strings involved in a statement's execution are represented in the
|
|
same way. Thus, for many readers, character conversion may be irrelevant.</div><div class="fnnum"><a id="wq42" name="wq42" href="rbafzmstccseta.htm#wq41">7</a>.</div>
|
|
<div class="fntext">UTF-8 unicode data is also mixed
|
|
data. In this book, however, mixed data refer to mixed single- and double-byte
|
|
data.</div><div class="fnnum"><a id="wq44" name="wq44" href="rbafzmstccseta.htm#wq43">8</a>.</div>
|
|
<div class="fntext">UCS-2 can contain
|
|
surrogates and combining characters, however, they are not recognized as such.
|
|
Each 16–bits is considered to be a character.</div><div class="fnnum"><a id="wq46" name="wq46" href="rbafzmstccseta.htm#wq45">9</a>.</div>
|
|
<div class="fntext">Since normalization can significantly affect
|
|
performance (from 2.5 to 25 percent extra CPU), the default in column definitions
|
|
is NOT NORMALIZED.</div><div class="fnnum"><a id="no65535" name="no65535">10</a>.</div>
|
|
<div class="fntext">If
|
|
the default CCSID is 65535, the CCSID used will be the value of the DFTCCSID
|
|
job attribute (or an associated CCSID of the DFTCCSID).</div>
|
|
<br />
|
|
<hr /><br />
|
|
[ <a href="#Top_Of_Page">Top of Page</a> | <a href="rbafzmststoragestruc.htm">Previous Page</a> | <a href="rbafzmstsortsequence.htm">Next Page</a> | <a href="rbafzmst02.htm#wq1">Contents</a> |
|
|
<a href="rbafzmstindex.htm#index">Index</a> ]
|
|
|
|
<a id="Bot_Of_Page" name="Bot_Of_Page"></a>
|
|
</body>
|
|
</html>
|