Codepages
-> -> -> CPI file format

CPI file format

There are various descriptions of the CPI file format around the web; this is my attempt at one. The structure names and definitions used are based on those in Andries Brouwer's format documentation. These, in turn, appear to originate from the MS-DOS Programmer's Reference (my copy is for MS-DOS 5: ISBN 1-55615-329-5).

CPI files are used to store fonts allowing devices to display in multiple codepages. They can refer either to screen fonts, or printer fonts. Screen CPI files can hold one or more fonts per codepage - usually, at 8x16, 8x14 and 8x8 sizes. DRDOS screen codepage files also contain an 8x6 font (actually 6x6, but the file headers all say 8x6) which is used by ViewMAX screen drivers.

According to this blog comment by Larry Osterman, one of the developers of MSDOS, NLS functions were ported to PC-DOS by IBM from their mainframe systems. Presumably this included codepages, in which case the CPI file format may be derived from a mainframe file format.

There are three main CPI format variants -- FONT (used by MSDOS, PCDOS and Windows 9x), FONT.NT (used by Windows NT and its successors) and DRFONT (used by DRDOS screen fonts). There is a file format specification in the MSDOS programmer's reference which covers FONT; I know of no formal specification for FONT.NT or DRFONT. Even in the case of FONT, a bit of expansion and clarification wouldn't come amiss in some places.

In this document (on the principle of being conservative in what you generate and liberal in what you accept) emphasized text indicates restrictions on the file format that you should try to follow when generating a CPI file, but which you shouldn't rely on when reading. It is sometimes followed by a footnote [0] saying which utility has this restriction.

Here's one, for instance: CPI files in FONT format should not exceed 64k in size - use FONT.NT or DRFONT if you need to get more codepages in a file than will fit in 64k [1]. If you know that your CPI file will only be parsed by utilities that understand 32-bit file offsets, you can write CPI files bigger than 64k. Just don't try to use them with, in this case, the PC-DOS 3.3 DISPLAY.SYS. And don't assume that all FONT-format CPI files will be 64k or less.

The principal programs which have to parse CPI files - and on which I've based this specification - are:

All numbers are stored in little-endian format. 'short' is 2 bytes, 'long' is 4 bytes.

Overview

FONT or FONT.NT

FontFileHeader
FontInfoHeader
    CodePageEntryHeader
     |  |
     |  |     either
     |  +---> CodePageInfoHeader    }
     |  .       ScreenFontHeader    }
     |  .       Screen font bitmaps } Code page body
     |  .       ScreenFontHeader    }
     |  .       Screen font bitmaps }
     |  .       ...
     |  .     or
     |  +---> CodePageInfoHeader    }
     |          PrinterFontHeader   } Code page body
     |          Printer font data   }
     v
    CodePageEntryHeader
     |  |
     |  +---> Code page body
    ...         ...

DRFONT

FontFileHeader
DRDOSExtendedFontFileHeader
 |  FontInfoHeader
 |      CodePageEntryHeader
 |       |  |
 |       |  +---> CodePageInfoHeader      }
 |       |          ScreenFontHeader      }
 |       |          ScreenFontHeader      } Code page body
 |       |          ...                   }
 |       |          Character index table }
 |       v
 |      CodePageEntryHeader
 |       |  |
 |       |  +---> Code page body
 |       ...         ...
 v
Screen font bitmaps

FontFileHeader

A CPI file begins with a fixed header. In theory its size could range from 18 bytes to just over 320k, but in practice its length is always 23 bytes, for two reasons:

  1. Some utilities hardcode the 23-byte form, and will break if it is not used.
  2. There are at least two possible ways the header can be expanded beyond 23 bytes - but which one is right?
struct
{
	char  id0;
	char  id[7];
	char  reserved[8];
	short pnum;
	char  ptyp;
	long  fih_offset;
} FontFileHeader;
id0
The first byte of the file is 0xFF for FONT and FONT.NT files, and 0x7F for DRFONT files.
id[]
This is the file format, space padded: "FONT   ", "FONT.NT" or "DRFONT ".
reserved[]
The eight reserved bytes are always zero.
pnum
This is the number of pointers in this header. In all known CPI files this is 1; the MS-DOS 5 Programmer's Reference says that "for current versions of MS-DOS" it should be 1. With the count of pointers set to 1, the total header size is 23 bytes. A value of 0 here would result in a degenerate 18-byte CPI file consisting only of the FontFileHeader.
ptyp
The type of the pointer in the header. In all known CPI files this is 1; the MS-DOS reference says that "for current versions of MS-DOS" it should be 1. Meanings for other values are presumably not defined.
fih_offset
The offset in the file of the FontInfoHeader. In FONT and FONT.NT files, this is usually 0x17, pointing to immediately after the FontFileHeader - though files with other values are known to exist [10]. In DRFONT files, it should point to immediately after the DRDOSExtendedFontFileHeader [2], which for a four-font CPI file puts it at 0x2C.

DRDOSExtendedFontFileHeader

In a DRFONT font, this immediately follows the FontFileHeader.

	struct
	{
		char num_fonts_per_codepage;
		char font_cellsize[N];
		long dfd_offset[N];
	} DRDOSExtendedFontFileHeader;
num_fonts_per_codepage
The number of fonts defined by each codepage. This is 4 for the codepages distributed with DRDOS. The DRDOS MODE.COM supports values up to 10, and ViewMAX has no limit at all. The length of the DRDOSExtendedFontFileHeader is 1 plus five times the value in this byte.
font_cellsize
This array has num_fonts_per_codepage entries. It lists the size of a character in bytes (in all existing DRFONT files this is equal to the character height) for each font in this file. The original DRDOS EGA.CPI has sizes 6, 8, 14 and 16.
dfd_offset
This array also has num_fonts_per_codepage entries. Each entry is the offset, from the start of the file, of the first character bitmap in the corresponding size.

Notes

FontInfoHeader

	struct
	{
		short num_codepages;
	} FontInfoHeader;
num_codepages
This contains a count of codepages in the file. A value of 0 is possible but very uninteresting.

This should immediately follow the FontFileHeader or DRDOSExtendedFontFileHeader [2].

CodePageEntryHeader

The FontInfoHeader is immediately followed by the first CodePageEntryHeader; these form a linked list of codepages that the CPI file implements.

struct
{
	short cpeh_size;
	long next_cpeh_offset;
	short device_type;
	char device_name[8];
	short codepage;
	char reserved[6];
	long cpih_offset;
} CodePageEntryHeader;
cpeh_size
This is the size of the CodePageEntryHeader structure, i.e. 0x1C bytes. Some CPI files have other values here, most often 0x1A. Some utilities ignore this field and always load 0x1C bytes; others believe it.
next_cpeh_offset
This is the offset of the next CodePageEntryHeader in the file. In FONT and DRFONT files, the address is relative to the start of the file; in FONT.NT files, it is relative to the start of this CodePageEntryHeader. At least one pathological CPI file is known to exist where values above 64k are stored as segment:offset rather than a 32-bit pointer (eg: 0x1000abcd rather than 0x0001abcd). The file EGA.ICE[10] is even worse - all its pointers, even those below 64k, are stored as apparently arbitrary segment:offset combinations.
In the last CodePageEntryHeader, the value of this field has no meaning. Some files set it to 0, some to -1, and some to point at where the next CodePageEntryHeader would be. The MS-DOS 5 Programmer's Reference says it should be 0.
device_type
1 for screen, 2 for printer. Some printer CPI files from early DRDOS versions have device_type=1; a suggested workaround is to check for a device name of
  • "4201    "
  • "4208    "
  • "5202    "
  • "1050    "
and force the device type to 2. Printer codepages should only be present in FONT files, not FONT.NT or DRFONT.
device_name
The ASCII device name. For screens, it refers to the display hardware ("EGA     " for EGA/VGA and "LCD     " for the IBM Convertible LCD). For printers, it is usually one of:
  • "4201    "
  • "4208    "
  • "5202    "
  • "1050    "
  • "EPS     "
  • "PPDS    "
The MS-DOS 5 Programmer's Reference says this should match the filename, but this isn't really practical (and was dropped in later versions with files like EGA2.CPI).
codepage
This is the number of the codepage this header describes. Traditionally, DOS codepages had 3-digit IDs (1-999) but the number can range from 1-65533 - see the "Code Page Global Identifier" section in IBM's Character Data Representation Architecture. IDs 65280-65533 are 'reserved for customer use' - ie, this is the range to use for user-defined codepages.
reserved
The reserved bytes are always zero.
cpih_offset
The offset of the CodePageInfoHeader for this codepage. In FONT and DRFONT files, it is relative to the start of the file; in FONT.NT files it is relative to the start of this CodePageEntryHeader. As with next_cpeh_offset, the field is normally treated as a 32-bit pointer but some programs may instead populate it with segment:offset values.

The CodePageInfoHeader for a codepage should immediately follow the CodePageEntryHeader - rather than, for example, all the CodePageEntryHeaders together at the start and then all the CodePageInfoHeaders with their fonts. [3]. This is particularly important in a DRFONT file [4].

The fields next_cpeh_offset and cpih_offset should not point to addresses earlier in the file than this CodePageEntryHeader, for the same reason.

CodePageInfoHeader

At the start of the data block for each codepage is a CodePageInfoHeader:

struct 
{
	short version;
	short num_fonts;
	short size;
} CodePageInfoHeader;
version

This is 1 if the following codepage is in FONT format, 2 if it is in DRFONT format. Putting a DRFONT codepage in a FONT-format file will not work. You shouldn't put a FONT codepage in a DRFONT-format file either [5].

LCD.CPI from Toshiba MS-DOS 3.30 sets this field to 0, which should be treated as 1.

num_fonts
If this is a screen font, it gives the number of font records that follow. For printer fonts, it should be assumed to be 1; some DRDOS printer CPI files have it wrongly set to 2.
size
This is the number of bytes that follow up to the end of this codepage (if version is 1) or up to the character index table (if version is 2).

Printer Fonts

If the CPI is for a printer, the CodePageInfoHeader is followed by:

struct
{
	short printer_type;
	short escape_length;
} PrinterFontHeader;
printer_type
This is 1 if the character set is downloaded to the printer, 2 if the printer already has the character set and selects it with escape codes.
escape_length
The number of bytes in the escape sequences that follow.

This structure is in turn followed by the printer data. If printer_type is 1, there are two escape sequences; if printer_type is 2, there is one. The first escape sequence selects the builtin code page; the second selects the downloaded codepage. An escape sequence is stored as a Pascal string (the first byte is the length). After the escape sequence(s), any remaining data up to the size given in CodePageInfoHeader are the definition of the font, to be downloaded to the printer.

Screen fonts

If the CPI is for the screen, the CodePageInfoHeader is followed by screen font definitions for each size. In a FONT or FONT.NT file, each entry consists of a ScreenFontHeader followed by the font bitmap; in a DRFONT, just the ScreenFontHeader is provided.

struct
{
	char height;
	char width;
	char yaspect;
	char xaspect;
	short num_chars;
} ScreenFontHeader;
height
This is the character height in pixels.
width
This is the character width in pixels; in all known CPI files it is 8. Values other than 8 can cause trouble in any font format [6], but particularly in DRFONT fonts [7] and FONT.NT fonts [8].
yaspect
Vertical aspect ratio. In all known CPI files this is unused and set to zero.
xaspect
Horizontal aspect ratio. In all known CPI files this is unused and set to zero.
num_chars
Number of characters in the font. In known CPI files this is always 256. Some utilities may assume that it is 256, and malfunction if it is not.

Except in DRFONT fonts, the bitmap follows the ScreenFontHeader; its length is num_chars * height * ((width+7)/8), and it contains glyphs for each character in increasing order. Some loaders calculate the size simply as height * num_chars, and so will miscalculate if the width is wider than 8.

Character index table

In a DRFONT, after the ScreenFontHeaders, there follows a table describing where the character bitmaps come from.
struct 
{
	short FontIndex[256];
} CharacterIndexTable;

The DRDOS utilities assume that there are always 256 entries in this table; so the character count in a DRFONT ScreenFontHeader should always be 256 [9].

Each entry in FontIndex describes the number of the bitmap for the corresponding character in the bitmap tables pointed to by the DRDOSExtendedFontFileHeader. To find the bitmap for a particular letter, take the FontIndex entry, multiply it by the character length in bytes, and add the dfd_offset for the size in question.

To determine the number of characters in bitmap tables in a DRFONT, a program therefore has to walk all FontIndex entries in the file and take the highest value.

Trailing data

Some CPI files don't end immediately after the last font. Usually, what follows is a copyright message (possibly terminated by 0x1A) and/or some zero bytes. The MS-DOS 5 Programmer's Reference says that a CPI file 'always ends with a copyright notice' and that this is at most 0x150 bytes long.

Ambiguities

Among the things that the format seems to support but some or all utilities do not, we find:

FontFileHeader: Multiple pointers

If pnum were to be greater than 1, there are two possibilities for how the extra data would be stored:

	struct					struct
	{					{
		char  id0;				char id0;
		char  id[7];				char id[7];
		char  reserved[8];			char reserved[8];
		short pnum;				short pnum;
		char  ptyp[N];				struct {
		long  fih_offset[N];			  char ptyp;
							  long fih_offset
							} pointers[N];
	} FontFileHeader;			} FontFileHeader;
-- that is, either all the types come first and then all the pointers, or types and pointers alternate. The second is backward-compatible, in that programs which only understood the 1-pointer format would be able to follow the first pointer as usual.

FontFileHeader: Pointer types other than 1

ptyp is always 1. What might other values mean?

Codepages for multiple devices

Technically, there's no reason why a CPI file shouldn't hold codepages for multiple devices (eg, each codepage appears three times: once for "EGA", once for "LCD", and once for the "4201" printer). How would utilities handle this?

Backwards pointers

Even if a CPI file can't be streamed because of the order of the records, all the pointers in it will almost certainly point forwards - that is, to bytes further from the start of the file than where the pointer is. What happens if the blocks are so perversely arranged that this is not the case?

In this situation, a FONT.NT file would actually have negative values in its offset fields, and this might cause trouble on systems that treated them as unsigned.

Repetition

How should utilities handle the case of the same codepage appearing multiple times for the same device, or the same font size appearing multiple times within a codepage?

Aspect ratio

What was the aspect ratio intended for? Can the same font size appear multiple times in a codepage if the aspect ratio is different?

Footnotes

These explain the reasons for particular recommendations.

[0]
Example footnote
[1]
PC-DOS 3.3 DISPLAY.SYS does not seem to be able to handle CPI files larger than 64k.
[2]
ViewMAX display drivers and DRDOS MODE both assume that the FontInfoHeader immediately follows the DRDOSExtendedFontFileHeader.
[3]
FONT-format CPI files are passed to DISPLAY.SYS using a streaming interface that can seek forward but not back, and therefore objects in a CPI file should be in the order that DISPLAY.SYS would process them.
[4]
DRDOS MODE assumes that the CodePageInfoHeader immediately follows the CodePageEntryHeader.
[5]
ViewMAX display drivers assume that all fonts in a DRFONT-format file will be DRFONT fonts.
[6]
Fonts with a width greater than 8 cause problems with utilities that assume characters are 1 byte wide. Values less than 8 may also cause problems, because it isn't clear whether characters should be left- or right- aligned in the 8-pixel wide character cell. This may be why the 6-pixel fonts in DRDOS describe themselves as 6x8 even though they are actually only 6x6.
[7]
ViewMAX display drivers and DRDOS MODE both assume that the height of a character is equal to the number of bytes in its bitmap. This will not be true for characters wider than 8 pixels.
[8]
The codepage loader in the Windows NT DOS box rejects fonts whose character width is not 8.
[9]
DRDOS MODE assumes that the character index table has 256 entries, but should correctly handle a font with fewer characters. The ViewMAX screen drivers also assume that there are 256 entries, and always try to copy 256 characters. A possible workaround for fonts with fewer than 256 characters is to write the table with 256 entries and set the unused ones to 0.
[10]
EGA.ICE (which I found on a Compaq Concerto laptop, and which was apparently distributed with MS-DOS 6.0), is an unusual codepage file in several ways:
  • The copyright message is not at the end of the file, but the beginning; it is located between the FontFileHeader and the FontInfoHeader.
  • The copyright message reads:
    EXEC-NW.CPI  Version E3
    437 850 860 861 865
    Copyright (c) 1991, AST Europe Ltd. All rights reserved.
    Therefore "EGA.ICE" is not the original filename, and the file was not created by Microsoft.
  • All pointers in the CodePageEntryHeader are stored as segment:offset values, whether or not they are below 64k. By way of example, the first CodePageInfoHeader is at file offset B6h, but its pointer is 00090026h (ie, 0090:0026).
  • There are four font sizes: 8x8, 8x14, 8x16 and 8x19. The latter is presumably intended to produce a 25-line display in a 640x480 video mode.

John Elliott 2006-10-14