Codepages

Introduction to Codepages

What is a Codepage?

Computers work with binary data, which can be represented in Hexadeximal codes.
However in order to represent text and characters, there must be a mapping between the code a computer understands and the representation a human understands.

These mappings are called codepages.

Historically there were 8 bits or 1 byte provided  in order to represent characters with these binary codes and for the alphabet in the western world the space for 256 characters, which include upper- and lower-case characters, numbers and special characters was sufficient to cover them.

This character space of 256 mapping locations also contains control characters, that were used for printers in the past, such as LF (Line feed, would move the printed line to the next), CR (Carriage Return, would move the printing head to the beginning of the line) and some to control data transmission, such as ETB (End Transmission Block) and others.

Most of these transmission control characters are obsolete nowadays, but are still within the codepage mappings for compatibility reasons.

Languages

While the Characters for the English language will fit into the 256 character space including the upper and lower case characters, it is not sufficient for other languages, such as French, Spanish or German, where additional special characters are available.

For this purpose there are additional codepages, which are typically referred to by their number which have different mappings to handle the respective special characters.

There are several different groups of codepage sets, but the ones most relevant to AFP are the EBCDIC and ASCII codepages.

EBCDIC

Extended Binary Coded Decimal Interchange Code (http://en.wikipedia.org/wiki/EBCDIC) are mainly used in the mainframe world and then also in some older midrange computer systems. They were historically also used within the transmission of punchcard data.

Some of the most typical Codpeage numbers are Codepae 037 (http://en.wikipedia.org/wiki/EBCDIC_037) for the English characterset, Codepage 500 (http://en.wikipedia.org/wiki/EBCDIC_500) for the international characterset and 273 for European and German characters.

http://en.wikipedia.org/wiki/EBCDIC_273

 

Below is a chart of the EBCDIC 500 Codepage, which is the International EBCDIC Single-Byte Codepage. The top edge is the first hex-digit and the left margin describes the second hex-digit.

In the example below you will see that at codepoint 40(H) there is the Space character.

Codepage 500
Codepage 500

EBCDIC Codepages were also extended in the recent years, as symbols, such as the EUR symbol had to be included in the list of available characters.

These codepages then received an update and a separate codepage number.

http://en.wikipedia.org/wiki/List_of_EBCDIC_code_pages_with_Latin-1_character_set

Examples of such codepages are:

500 -> 1148 International EBCDIC

037 -> 1140 English EBDIC

273 -> 1141 German EBCDIC

ASCII

While the mainframe and older devices, such as IBM computers use EBCDIC as their codepage standard, the PC and Server/minicomputer world typically uses the ASCII standard. The abbreviation ASCII stands for “American Standard Code for Information Interchange”.

http://en.wikipedia.org/wiki/ASCII

Compared to the EBCDIC layout of characters there are some control characters that are different and the location of the characters are on different codepoints. The Space character is here on Hex-Code 20(H), versus 40(H) in EBCDIC.

Windows

Microsoft provides it’s own set of Codepages according to the ISO standard. An example is the Windows 1252 Codepage, which is a subset of the ISO8859 Latin1 Codepage.

 

Codepage 1252
Codepage 1252

http://en.wikipedia.org/wiki/Windows-1252

http://en.wikipedia.org/wiki/ISO/IEC_8859-1

Double Byte

While Western Langauges typically are covered with their uppercase and lowercase letters within the 256 available characters, this space is not sufficient for Far East languages such as Chinese and Japanese.

The solution to provide more characters and codepoints was to provide a double byte codepage, which uses two bytes to address the possible characters. These set of codepages are called “plains”.

In order to not “waste” data storage by using two bytes for every character, even though that a singly byte would be sufficient, there is an indicator showing when to use two bytes to address a character, or when to use a single byte again.

This is called the Shift-In and Shift Out Character, which is often the Hex-0E for Shift in and Hex-0F for Shift out.

Codepage 937 Section 00
Codepage 937 Section 00
Codepage 937 Section 4C
Codepage 937 Section 4C

In the sample above in order to address the highlighted character the byte values 4C 4F are required.

 

Unicode

Codepages still have the problem, that you need to know which codepage is needed in order to interpret the data and there are different codepages for each  language.

Ideally there is a single codepage that covers all, or at least as many languages as possible.

These codepages are also prefixed with the letters UTF, abbreviated for Universal Character Set Transformation Format.

http://en.wikipedia.org/wiki/UTF-8

The advantage of this Codepage is that it contains many more plains than 256, that double byte codepages offer and that every character is addressed with a variable length.

This does require a bit more processing time, as every byte has to be evaluated to determine if the next and potentially the next one also has to be evaluated to address the single character, but there is no need to externally know the codepage and use always a fixed length of bytes for a single character.

http://www.i18nguy.com/unicode/codepages.html#unicode

Codepages and AFP

Codepages are also used in the output of AFP, when the proper font is selected and the data is being written, the codepage must match in order to have the proper characters visible.

Also when reading the variable data for documents being output in AFP, it is important to be aware of the codepage for the data, in order to transform it properly into the output.