WorldServer Encodings

Character encoding determines how content is displayed. You select the encoding of content associated with a particular locale. You also select the encoding when exporting and importing a terminology database.

When you create a locale, you choose the encoding from the Encoding list, which presents the most common encodings at the top.

The following table describes the most common encodings used in WorldServer.

Encoding Description
ISO-8859-1 Western European languages
UTF-8 Unicode – can display any language
UTF-16 Unicode – can display any language
Shift_JIS Japanese
EUC-JP Japanese
Big5 Traditional Chinese
GB2312 Simplified Chinese
ASCII English
Windows-1252 Western European languages
EUC-KR Korean

Certain other areas of WorldServer are affected by character encodings. The general.properties file contains several settings that control encodings, as described in the following paragraphs. Please note that changing general.properties usually requires access to the filesystem of the WorldServer installation computer.

Anytime WorldServer creates a ZIP file, it automatically encodes filenames in the ZIP archive using UTF-8 encoding. Some localized versions of Windows use different encodings to read the ZIP file, however. If the encodings do not match, the extracted filenames may be corrupt if they contain non-ASCII characters. To avoid this issue, it is best not to use non-ASCII characters in your filenames.

If your filenames must contain non-ASCII characters, you can work around the problem by configuring the setting zip_encoding in the general.properties file. Change this setting from the default (UTF-8) to the encoding of the end-users' local systems. For example, for Japanese use MS932. Note that this is a system-wide setting for WorldServer, so you may not want to change it if you have end users with a variety of localized versions of Windows.

Another property in general.propertiescsv_encoding — governs the encoding WorldServer uses for writing Unicode characters while exporting scoping reports in CSV format. When a CSV file is opened in Excel, the text renders properly only if it uses the encoding that Excel expects. For example, on Japanese Windows, Excel assumes CSV files will be encoded using Windows-932. To enable this conversion, remove the comment from the line # csv_encoding=Windows-932. Additionally, comment out the line csv_encoding=utf-8.

The property win1252_conversion in the general.properties file controls whether WorldServer should map character entities in the 128-159 range from Windows-1252 encoding to their proper Unicode counterparts. According to the HTML specification, numeric entities should be coded using Unicode values, and WorldServer normally expects Unicode values as well. However, sometimes entities are coded using Windows-1252, and many Web browsers display them correctly. (Windows-1252 corresponds exactly to Unicode for many characters. In the 128-159 character range, though, the values differ. For example, the Euro character is represented by the numeric value 128 in Windows-1252 and by the numeric value 8364 in Unicode.) If your HTML file uses numeric entities in this range encoded in Windows-1252, WorldServer cannot display them correctly by default. To correct the display, set win1252_conversion to true, so WorldServer converts these characters into their Unicode counterparts during translation.

For information about which encodings to use with particular file types, see the topics related to those file types.