Unicode considerations for display files

Unicode is a universal encoding scheme for written characters and text that enables the exchange of data internationally. Two transformation formats, UTF_16 and UCS_2, of Unicode are supported with DDS.

A Unicode field in a display file can contain UCS-2 or UTF-16 data. Unicode data is composed of code units, which represent the minimal byte combination that can represent a unit of text.

There are two transformation formats (encoding forms) of Unicode that are supported with DDS:

UTF-16 is a 16-bit encoding form designed to provide code values for over a million characters and a superset of Unicode. UTF-16 data is stored in graphic data types. The CCSID value for data in UTF-16 format is 1200.
A UTF-16 code unit is 2 bytes in length. A UTF-16 character can be 1 or 2 code units (2 or 4 bytes) in length. A UTF-16 data string can contain any character including UTF-16 surrogates and combining characters.
UCS-2 is the Universal Character Set coded in 2 octets, which means that characters are represented in 16 bits per character. One code unit is used in this topic to describe the size of a UCS-2 character. UCS-2 data is stored in graphic data types. The CCSID value for data in UCS-2 format is 13488.
UCS-2 is a subset of UTF-16 and can no longer support all of the characters defined by Unicode. UCS-2 is identical to UTF-16 except that UTF-16 also supports the combining of characters and surrogates. If you do not need support for the combining of characters and surrogates, you can choose to continue to use the UCS-2 format.

Unicode data is not supported on display devices that currently support the 5250 data stream. Therefore, conversions between the Unicode data and EBCDIC are necessary during input and output. On output, the Unicode data is converted to the CCSID of the device. On input, the data is converted from the device CCSID to the Unicode CCSID.

Because the device CCSID, which is determined from the device configuration, determines what the Unicode data is converted to, the converted data will appear differently on different devices. For example, a Unicode code unit that maps to a SBCS character will be displayed as a DBCS replacement character on a graphic-DBCS capable device. On a DBCS or SBCS capable device, the code unit will appear as a SBCS character. A Unicode code unit that maps to a DBCS character will be displayed as a graphic-DBCS character on a graphic-DBCS capable device. On a DBCS device, a DBCS character will be displayed and bracketed (enclosed in a shift-out and shift-in). A SBCS replacement character will be displayed on a SBCS device.

It is also suggested that all fields that are capable of Unicode are initialized in the output buffer before writing the fields to the screen. Unpredictable results might occur if default initialization is allowed to take place.