ibm-information-center/dist/eclipse/plugins/i5OS.ic.db2_5.4.0.1/rbafzmstccseta.htm

254 lines
17 KiB
HTML

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-us">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="dc.language" scheme="rfc1766" content="en-us" />
<!-- All rights reserved. Licensed Materials Property of IBM -->
<!-- US Government Users Restricted Rights -->
<!-- Use, duplication or disclosure restricted by -->
<!-- GSA ADP Schedule Contract with IBM Corp. -->
<meta name="dc.date" scheme="iso8601" content="2005-09-19" />
<meta name="copyright" content="(C) Copyright IBM Corporation 1998, 2006" />
<meta name="security" content="public" />
<meta name="Robots" content="index,follow"/>
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.icra.org/ratingsv02.html" l gen true r (cz 1 lz 1 nz 1 oz 1 vz 1) "http://www.rsac.org/ratingsv01.html" l gen true r (n 0 s 0 v 0 l 0) "http://www.classify.org/safesurf/" l gen true r (SS~~000 1))' />
<meta name="keywords" content="character conversion, character set, code page,
code point, coded character set, encoding scheme, substitution character,
Unicode, combining characters, normalization, surrogates,
CDRA (Character Data Representation Architecture),
Character Data Representation Architecture (CDRA), definition,
CCSID (coded character set identifier), default" />
<title>Character conversion</title>
<link rel="stylesheet" type="text/css" href="ibmidwb.css" />
<link rel="stylesheet" type="text/css" href="ic.css" />
</head>
<body>
<a id="Top_Of_Page" name="Top_Of_Page"></a><!-- Java sync-link -->
<script language = "Javascript" src = "../rzahg/synch.js" type="text/javascript"></script>
<a name="ccseta"></a>
<h2 id="ccseta"><a href="rbafzmst02.htm#ToC_60">Character conversion</a></h2><a id="idx168" name="idx168"></a>
<p>A <span class="italic">string</span> is a sequence of bytes that may represent
characters. Within a string, all the characters are represented by a common
coding representation. In some cases, it might be necessary to convert these
characters to a different coding representation. The process of conversion
is known as <span class="italic">character conversion</span>.<sup class="fn"><a id="wq39" name="wq39" href="rbafzmstccseta.htm#wq40">6</a></sup></p>
<p>Character conversion can occur when an SQL statement is executed remotely.
Consider, for example, these two cases: </p>
<ul>
<li>The values of variables sent from the application requester to the current
server.</li>
<li>The values of result columns sent from the current server to the application
requester.</li></ul><p class="indatacontent"> In either case, the string could have a different representation at
the sending and receiving systems. Conversion can also occur during string
operations on the same system.</p>
<p>Note that SQL statements are strings and are therefore subject to character
conversion.</p>
<p>The following list defines some of the terms used when discussing character
conversion. </p>
<dl>
<dt class="bold">character set</dt><a id="idx169" name="idx169"></a><a id="idx170" name="idx170"></a>
<dd>A defined set of characters. For example, the following character set
appears in several code pages:
<ul>
<li>26 non-accented letters A through Z</li>
<li>26 non-accented letters a through z</li>
<li>digits 0 through 9</li>
<li>. , : ; ? ( ) ' " / - _ &amp; + % * = &lt; ></li></ul>
</dd>
<dt class="bold">code page</dt><a id="idx171" name="idx171"></a><a id="idx172" name="idx172"></a>
<dd>A set of assignments of characters to code points. In EBCDIC, for example, <span>"A"</span> is assigned code point <span class="hex">X'C1'</span> and <span>"B"</span> is assigned
code point <span class="hex">X'C2'</span>. Within a code page, each code point has only one
specific meaning.
</dd>
<dt class="bold">code point</dt><a id="idx173" name="idx173"></a><a id="idx174" name="idx174"></a>
<dd>A unique bit pattern that represents a character within a code page.
</dd>
<dt class="bold">coded character set</dt><a id="idx175" name="idx175"></a>
<dd>A set of unambiguous rules that establish a character set and the one-to-one
relationships between the characters of the set and their coded representations.
</dd>
<dt class="bold">encoding scheme</dt><a id="idx176" name="idx176"></a><a id="idx177" name="idx177"></a>
<dd>A set of rules used to represent character data. For example:
<ul>
<li>Single-byte EBCDIC</li>
<li>Single-byte ASCII</li>
<li>Double-byte EBCDIC</li>
<li>Mixed single- and double-byte ASCII<sup class="fn"><a id="wq41" name="wq41" href="rbafzmstccseta.htm#wq42">7</a></sup></li>
<li>Unicode (UTF-8, UCS-2, and UTF-16 universal coded character sets).</li></ul>
</dd>
<dt class="bold">substitution character</dt><a id="idx178" name="idx178"></a><a id="idx179" name="idx179"></a>
<dd>A unique character that is substituted during character conversion for
any characters in the source coding representation that do not have a match
in the target coding representation.
</dd>
<dt class="bold">Unicode</dt><a id="idx180" name="idx180"></a><a id="idx181" name="idx181"></a>
<dd>A universal encoding scheme for written characters and text that enables
the exchange of data internationally. It provides a character set standard
that can be used all over the world. It uses a 16-bit encoding form that provides
code points for more than 65,000 characters and an extension called UTF-16
that allows for encoding as many as a million more characters. It provides
the ability to encode all characters used for the written languages of the
world and treats alphabetic characters, ideographic characters, and symbols
equivalently because it specifies a numeric value and a name for each of its
characters. It includes punctuation marks, mathematical symbols, technical
symbols, geometric shapes, and dingbats. Three encoding forms are supported:
<ul>
<li>UTF-8: Unicode Transformation Format, a 8-bit encoding form designed for
ease of use with existing ASCII-based systems. UTF-8 data is stored in character
data types. The CCSID value for data in UTF-8 format is 1208.
<p>A UTF-8 character
can be 1,2,3 or 4 bytes in length. A UTF-8 data string can contain any combination
of SBCS and DBCS data, including surrogates and combining characters.</p></li>
<li>UCS-2: Universal Character Set coded in 2 octets, which means that characters
are represented in 16-bits per character. UCS-2 data is stored in graphic
data types. The CCSID value for data in UCS-2 format is 13488.
<p>UCS-2 is
a subset of UTF-16. UCS-2 is identical to UTF-16 except that UTF-16 also supports
combining characters and surrogates. Since UCS-2 is a simpler form of UTF-16,
UCS-2 data will typically perform better than UTF-16.<sup class="fn"><a id="wq43" name="wq43" href="rbafzmstccseta.htm#wq44">8</a></sup></p></li>
<li>UTF-16: Unicode Transformation Format, a 16-bit encoding form designed
to provide code values for over a million characters. UTF-16 data is stored
in graphic data types. The CCSID value for data in UTF-16 format is 1200.<a id="idx182" name="idx182"></a><a id="idx183" name="idx183"></a><a id="idx184" name="idx184"></a><a id="idx185" name="idx185"></a>
<p>Both UTF-8 and UTF-16 data can contain <var class="pv">combining characters</var>. Combining character support allows a resulting
character to be comprised of more than one character. After the first character,
hundreds of different non-spacing accent characters (umlauts, accents, etc.)
can follow in the data string. The resulting character may already be defined
in the character set. In this case, there are multiple representations for
the same character. For example, in UTF-16, an <span class="italic">&eacute;</span> can
be represented either by X'00E9' (the normalized representation) or X'00650301'
(the non-normalized combining character representation).</p>
<p>Since multiple
representations of the same character will not compare equal, it is usually
not a good idea to store both forms of the characters in the database. <var class="pv">Normalization</var> is a process that replaces the string of combining characters
with equivalent characters that do not include combining characters. After
normalization has occurred, only one representation of any specific character
will exist in the data. For example, in UTF-16, any instances of X'00650301'
(the non-normalized combining character representation of <span class="italic">&eacute; </span>) will be converted to X'00E9' (the normalized representation of <span class="italic">&eacute; </span>).<sup class="fn"><a id="wq45" name="wq45" href="rbafzmstccseta.htm#wq46">9</a></sup></p><a id="idx186" name="idx186"></a><a id="idx187" name="idx187"></a>
<p>Both UTF-8 and UTF-16 can contain 4 byte characters called <var class="pv">surrogates</var>. Surrogates are 4 byte sequences that can address one million more characters
than would be available in a 2 byte character set.</p></li></ul>
</dd>
</dl>
<a name="wq47"></a>
<h3 id="wq47"><a href="rbafzmst02.htm#ToC_61">Character sets and code pages</a></h3>
<p>The following example shows how a typical character set might map to different
code points in two different code pages.</p>
<a name="wq48"></a>
<div class="fignone" id="wq48">
<div class="mmobj">
<img src="rv2f976.gif" alt="How a character set might map to different code points in two different code pages. Graphic described in text." /></div></div>
<p>Even with the same encoding scheme there are many different coded character
sets, and the same code point can represent a different character in different
coded character sets. Furthermore, a byte in a character string does not necessarily
represent a character from a single-byte character set (SBCS). Character strings
are also used for mixed data (a mixture of single-byte characters and double-byte
characters) and for data that is not associated with any character set (called
bit data). This is not the case with graphic strings; the database manager
assumes that every pair of bytes in every graphic string represents a character
from a double-byte character set (DBCS) or universal coded character set (UCS-2
or UTF-16).</p>
<p>A coded character set identifier (CCSID) in a native encoding scheme is
one of the coded character sets in which data may be stored at that site.
A CCSID in a foreign encoding scheme is one of the coded character sets in
which data cannot be stored at that site. For example, DB2 UDB for iSeries can store data
in a CCSID with an EBCDIC encoding scheme, but not in an ASCII encoding scheme.</p>
<p>A variable containing data in a foreign encoding scheme is always converted
to a CCSID in the native encoding scheme when the variable is used in a function
or in the <span class="italic">select-list</span>. A variable containing data
in a foreign encoding scheme is also effectively converted to a CCSID in the
native encoding scheme when used in comparison or in an operation that combines
strings. Which CCSID in the native encoding scheme the data is converted to
is based on the foreign CCSID and the default CCSID.</p>
<p>For details on character conversion, see: </p>
<ul>
<li><a href="rbafzmstch2bas.htm#craj">Conversion rules for assignments</a></li>
<li><a href="rbafzmstch2bas.htm#crcj">Conversion rules for comparison</a></li>
<li><a href="rbafzmstuuall.htm#uuall">Conversion rules for operations that combine strings</a></li>
<li><a href="rbafzmstch2drda.htm#drconsider">Data representation considerations</a>.</li></ul>
<p>If CCSID conversion is necessary to evaluate the result set
of a query, the query cannot contain:</p>
<ul>
<li>EXCEPT or INTERSECT operations,</li>
<li>OLAP specifications,</li>
<li>recursive common table expressions,</li>
<li>ORDER OF, or</li>
<li>scalar fullselects (scalar subselects are supported).</li></ul>
<a name="conccsid"></a>
<h3 id="conccsid"><a href="rbafzmst02.htm#ToC_62">Coded character sets and CCSIDs</a></h3><a id="idx188" name="idx188"></a><a id="idx189" name="idx189"></a><a id="idx190" name="idx190"></a>
<p>IBM&reg;'s Character Data Representation Architecture (CDRA) deals with
the differences in string representation and encoding. The <span class="italic">Coded Character Set Identifier (CCSID)</span> is a key element of this architecture.
A CCSID is a 2-byte (unsigned) binary number that uniquely identifies an encoding
scheme and one or more pairs of character sets and code pages.</p>
<p>A CCSID is an attribute of strings, just as length is an attribute of strings.
All values of the same string column have the same CCSID.</p>
<p>In each database manager, character conversion involves the use of a <span class="italic">CCSID Conversion Selection Table</span>. The Conversion Selection
Table contains a list of valid source and target combinations. For each pair
of CCSIDs, the Conversion Selection Table contains information used to perform
the conversion from one coded character set to the other. This information
includes an indication of whether conversion is required. (In some cases,
no conversion is necessary even though the strings involved have different
CCSIDs.)</p>
<p>Different types of conversions may be supported by the database manager.
Round-trip conversions attempt to preserve characters in one CCSID that are
not defined in the target CCSID so that if the data is subsequently converted
back to the original CCSID, the same original characters result. Enforced
subset match conversions do not attempt to preserve such characters. For more
information, see IBM's Character Data Representation Architecture (CDRA).</p>
<a name="wq49"></a>
<h3 id="wq49"><a href="rbafzmst02.htm#ToC_63">Default CCSID</a></h3><a id="idx191" name="idx191"></a>
<p>Every application server and application requester has a default CCSID (or default
CCSIDs in installations that support DBCS data). The CCSID of the following
types of strings is determined at the current server: </p>
<ul>
<li>String constants (including string constants that represent datetime values)
when the CCSID of the source is in a foreign encoding scheme</li>
<li>Special registers with string values (such as USER and CURRENT SERVER)</li>
<li>CAST specifications where the result is a character or graphic string</li>
<li>Results of CHAR, DATAPARTITIONNAME, DAYNAME, DBPARTITIONNAME, DIGITS,
HEX, MONTHNAME, SOUNDEX, and SPACE scalar functions</li>
<li>Results of DECRYPT_CHAR, DECRYPT_DB, CHAR, GRAPHIC, VARCHAR, and VARGRAPHIC
scalar functions when a CCSID is not specified as an argument</li>
<li>Results of the CLOB and DBCLOB scalar functions when a CCSID is not specified
as an argument<sup class="fn"><a href="rbafzmstccseta.htm#no65535">10</a></sup></li>
<li>String columns defined by the CREATE TABLE or ALTER TABLE statements when
an explicit CCSID is not specified for the column<sup class="fn"><a href="rbafzmstccseta.htm#no65535">10</a></sup></li>
<li>String parameters defined by CREATE FUNCTION or CREATE PROCEDURE
statements when an explicit CCSID is not specified for the parameter <sup class="fn"><a href="rbafzmstccseta.htm#no65535">10</a></sup></li></ul>
<p>If one of the types of strings above is used in a CREATE VIEW
statement, the default CCSID is determined at the time the view is created.</p>
<p>In a distributed application, the default CCSID of variables is determined
by the application requester. In a non-distributed application, the default
CCSID of variables is determined by the application server. On i5/OS, the default
CCSID is determined by the CCSID job attribute. For more information about
CCSIDs, see the <a href="../nls/rbagscdra.htm">Work with CCSIDs</a> topic in the Globalization
section of the iSeries Information Center.</p>
<hr /><div class="fnnum"><a id="wq40" name="wq40" href="rbafzmstccseta.htm#wq39">6</a>.</div>
<div class="fntext">Character conversion,
when required, is automatic and is transparent to the application when it
is successful. A knowledge of conversion is, therefore, unnecessary when all
the strings involved in a statement's execution are represented in the
same way. Thus, for many readers, character conversion may be irrelevant.</div><div class="fnnum"><a id="wq42" name="wq42" href="rbafzmstccseta.htm#wq41">7</a>.</div>
<div class="fntext">UTF-8 unicode data is also mixed
data. In this book, however, mixed data refer to mixed single- and double-byte
data.</div><div class="fnnum"><a id="wq44" name="wq44" href="rbafzmstccseta.htm#wq43">8</a>.</div>
<div class="fntext">UCS-2 can contain
surrogates and combining characters, however, they are not recognized as such.
Each 16&ndash;bits is considered to be a character.</div><div class="fnnum"><a id="wq46" name="wq46" href="rbafzmstccseta.htm#wq45">9</a>.</div>
<div class="fntext">Since normalization can significantly affect
performance (from 2.5 to 25 percent extra CPU), the default in column definitions
is NOT NORMALIZED.</div><div class="fnnum"><a id="no65535" name="no65535">10</a>.</div>
<div class="fntext">If
the default CCSID is 65535, the CCSID used will be the value of the DFTCCSID
job attribute (or an associated CCSID of the DFTCCSID).</div>
<br />
<hr /><br />
[ <a href="#Top_Of_Page">Top of Page</a> | <a href="rbafzmststoragestruc.htm">Previous Page</a> | <a href="rbafzmstsortsequence.htm">Next Page</a> | <a href="rbafzmst02.htm#wq1">Contents</a> |
<a href="rbafzmstindex.htm#index">Index</a> ]
<a id="Bot_Of_Page" name="Bot_Of_Page"></a>
</body>
</html>