Unicode considerations for database files

Unicode is a universal encoding scheme for written characters and text that enables the exchange of data internationally. Follow this topic to learn about how to specify DDS position 30 through 37 and position 45 through 80 for describing database files. Positions not mentioned have no special considerations for Unicode.

A Unicode field can contain all types of characters used on an IBM^® iSeries™ server, including double-byte character set (DBCS) characters. Unicode data is composed of code units, which represent the minimal byte combination that can represent a unit of text.

There are three transformation formats (encoding forms) of Unicode that are supported with physical and logical file DDS:

UTF-8 is an 8-bit encoding form designed for ease of use with existing ASCII-based systems. UTF-8 data is stored in character data types. The CCSID value for data in UTF-8 format is 1208.
A UTF-8 code unit is 1 byte in length. A UTF-8 character can be 1, 2, 3, or 4 code units in length. A UTF-8 data string can contain any character, including surrogates and combining characters.
UTF-16 is a 16-bit encoding form designed to provide code values for over a million characters, and a superset of UCS-2. UTF-16 data is stored in graphic data types. The CCSID value for data in UTF-16 format is 1200.
A UTF-16 code unit is 2 bytes in length. A UTF-16 character can be 1 or 2 code units (2 or 4 bytes) in length. A UTF-16 data string can contain any character, including UTF-16 surrogates and combining characters.
UCS-2 is the Universal Character Set coded in 2 octets, which means that characters are represented in 16-bits per character. UCS-2 data is stored in graphic data types. The CCSID value for data in UCS-2 format is 13488.
UCS-2 is a subset of UTF-16, and can no longer support all of the characters defined by Unicode. UCS-2 is identical to UTF-16, except that UTF-16 also supports combining characters and surrogates. If you do not need support for combining characters and surrogates, then you can choose to use the UCS-2 type, because there is more database functionality available for it.

Note: In this topic, references to UTF-16 imply UCS-2 as well.

Length (positions 30 through 34)
Specify the length of the field in these positions. The length of a field containing UTF-16 data can range from 1 through 16 383 code units. The length of a field containing UTF-8 data can range from 1 through 32 766 code units.
Data type (position 35)
The valid data types for Unicode data are the G (Graphic) data type and the A (Character) type.
Decimal positions (positions 36 and 37)
Leave these positions blank when using Unicode data.
Keyword considerations (positions 45 through 80)
Learn about how Unicode data is used with some DDS keywords.

Parent topic: DDS for physical and logical files