# Character Sets and Collation Sequences - Firebird

The character set chosen for storing text data determines

• The characters that can be used in CHAR, VARCHAR, and BLOB SUB_TYPE 1 (text) columns.
• The number of bytes allocated to each character.
• The default collation sequence (“alphabetical order”) to be used when sorting CHAR and VARCHAR columns. (BLOBs cannot be sorted—hence collation sequence does not apply to them.)

The default character set of the database will be used if you don’t specify one. If no default character set is defined for the database, columns default to CHARACTER SET NONE. If your database is for use in an English-only language environment, you can be tempted to ignore character sets. Don’t be tempted! Character set NONE will accept any single -byte characters without complaining. The problems start when—in a non-U.S.- English or mixed-language environment —you find that you are getting transliterationErrors when you SELECT your text data. What comes out isn’t always what goes in!

Text input from keyboards and other input devices such as bar-code readers— potentially for all characters—is encoded specifically according to a certain standard code page that may be linked to the locale that the input device is set up for. It is fairly typical for input devices to be fitted with adapter software to allow users to switch code pages at will.

In one code page, the numeric code that maps to a certain character image may be quite different from its mapping in another. Broadly, each Firebird character set reflects a certain code page or group of related code pages. Some character sets work with more than one code page; in some cases, a code page will work with more than one character set. Different languages may share one overall character set but map different uppercase/lowercase pairs, currency symbols, and so on.

Beyond character sets, different country, language, or even cultural groups using the same character mappings use different sequences to determine “alphabetical order” for sorting and comparisons. Hence, for most character sets, Firebird provides a variety of collation sequences. Some collation sequences also take in uppercase/ lower-case pairings to resolve case-insensitive orderings. The COLLATE clause is used in several contexts where collation sequence is important, although it is not declared at the database level.

The server needs to know what character set is to be stored, in order to compute the storage space and to assess the collation characteristics for proper ordering, comparison, uppercasing, and so on. Beyond that, it is unconcerned about the character of text input.

Client Character Set

What really matters with regard to character set is the interaction between the server and client. The Firebird client library must be passed a character set attribute as part of the structure of a connection request.

If the engine detects a difference between the client’s character set setting and the character set of the target storage destination, translation—“transliteration”—is performed automatically, on the assumption that the incoming codes are correct for the client’s code page. The incoming codes are exchanged for codes that are correct for the corresponding characters in the character set of the storage target.

That makes it possible to store text in different targets with character sets that are not the same as the default character set of the database.

If the client and target character sets are the same, the engine assumes that the input it is receiving really is codes for the defined character set and stores the input unmodified. Troubles ensue if the input was not what the client said it was. When the data is selected, searched, or restored after a backup, it causes transliteration errors.

For more information about transliteration errors and how to fix them, refer to the section “Transliteration” later in this chapter.

Connecting applications must pass the character set of the database to the API via the database parameter block (DPB) in the parameter isc_dpb_lc_ctype. An ESQL application—including the isql utility—must execute a SET NAMES statement immediately before the CONNECT statement. SET NAMES <character-set> is also the command you use to set the character set in the isql utility. GUI admin tools usually provide an input field of some sort for selecting the client character set.

If you need to cater for a non-English language, spend some time studying the character sets available and choose the one which is most appropriate for most of your text input storage and output requirements. Remember to include that character set in the database attributes when you create the database.

Character Set Overrides

Having set the global default for the database, you can override it further down the food chain, if required. You can include a character set attribute when you create a domain. You can override either the database default or a domain setting in an individual column definition.

Firebird Character Sets

Firebird supports an increasingly broad variety of international character sets, including a number of 2-byte sets and a 3-byte Unicode set. In many cases, choices of collation (sorting) sequence are available. In this section, we look at

• Some background about character sets
• The global default character set for a database
• Alternative character set and collation sequence for character domains and columns
• Collation sequence for
• A text value in a comparison operation
• An ORDER BY or GROUP BY clause
• How to ask the server to translate input data to a particular character set

A character set is a collection of symbols that includes at least one character repertoire. A character repertoire is a set of characters used by a particular culture for its publications, written communication, and —in the context of a database—for computer input and output. For example, ISO Latin_1 is a character set that encompasses the English repertoire (A, B, C . . . Z) and the French repertoire (A, Á, À, B, C, Ç, D . . . Z), making it useful for systems that span both cultural communities.

Naming of Character Sets

Most Firebird character sets are defined by standards, and their names closely reflect the standards that define them. For example, Microsoft defines Windows 1252 and Firebird implements it as WIN1252. Character set ISO8859_1 is “the set of characters defined by ISO Standard 8859-1, encoded by the values defined in ISO standard 8859-1, having each value represented by a single 8-bit byte.”

Aliases

Character-set alias names support naming standard differences from one platform to another. For example, if you find yourself on an operating system that uses the identifier WIN_1252 for the WIN1252 character set, you can use an alias that is defined in the system table RDB$TYPES, as described in the next section. Storage of Character Sets and Aliases Characters sets, currently, are “hard-wired” in a database at creation time. One of the system tables built automatically is RDB$CHARACTER_SETS. For a listing by character set name, including the name of the default collate sequence in each case, execute this query:

Aliases are added as required to RDB$TYPES, another system table that stores enumerated sets of various kinds used by the database engine. To see all of the aliases that are set up at database-creation time, run this query, filtering RDB$TYPES to see just the enumerated set of character set names:

Storage Restrictions

It is important to understand how your choice of character set affects the storage restrictions for the data you plan to accommodate. In the case of the CHAR and VARCHAR columns, Firebird restricts the maximum amount of storage in any field in the column to 32,767 bytes and 32,765 bytes, respectively. The actual number required may be very limited indeed, according to a number of factors regarding its usage.

Non-indexed columns using the default collation sequence can store (number of characters) * (number of bytes per character) up to the byte limit for the data type. For example, a VARCHAR(32765) in character set ISO_8859_1 can store up to 32,765 characters, while in character set UNICODE_FSS (which uses 3 bytes per character) the upper limit is 10,291 characters.

If a column is to be indexed and/or modified with a COLLATE clause, a significant number of “spare” bytes must be allowed. Even the least demanding index—on a single VARCHAR column using a single-byte character set and the default collation sequence— is limited to 252 bytes in Firebird versions up to and including 1.5. For multi-byte character set columns, the character limit is less than (252/(number of bytes per character). Multi-column indexes consume more bytes than do single-column indexes; those using a non-default collation sequence consume still more. For more details about these effects, refer to the section “Collation Sequence and Index Size” later in this chapter.

Storage for BLOB columns, which are non-indexable, is not restricted according to the storage character set.

Default Character Set for the Database

If you do not specify a default character set in the CREATE DATABASE declaration, the character set defaults to NONE. Character set NONE makes no character set assumptions for text columns, storing data exactly as entered. If the client connection specified no character set, then data will also be retrieved exactly as entered. Alphabetical ordering is in strict ASCII code order and uppercase/lowercase mapping is supported only in U.S. ASCII codes 65–90 and 97–102, respectively.

Specify a valid character set code in the DEFAULT CHARACTER SET clause:

Field-Level Character Set Overrides

A character set attribute can be added to the individual definition of a domain, table column, or PSQL variable of type CHAR, VARCHAR, or BLOB SUB_TYPE 1 to override the default character set of the database.

For example, the following script fragment creates a database with a default character set of ISO8859_1 and a table containing different language versions of similar data in separate columns:

Another fragment from the same script creates a domain for storing BLOB data in the Cyrillic character set:

Later in the script, we define a table that stores some text in Cyrillic:

This fragment defines a stored procedure that converts string input to a different character set before storing it in a table:

Statement-Level Character Set Overrides

The character set for text values in a statement is interpreted according to the connection’s character set at runtime (not according to the character set defined for the column when it was created) unless you specify a character set marker (or “introducer”) to indicate a different character set.

Character Set Marker: INTRODUCER

A character set marker—also known as an INTRODUCER—consists of the character set name prefixed by an underscore character. It is required to “introduce” an input string when the client application is connected to the database using a character set that is different from that of the destination column in the database.

Position the marker directly to the left of the text value being marked. For example, the introducer for input to a UNICODE_FSS field is _UNICODE_FSS:

String Literal

A string literal in a test or search condition, for example, in a WHERE clause, is interpreted according to the character set of the client’s connection at the time the condition is tested. An introducer will be required if the database column being searched has a character set which is different from that of the client connection:

Transliteration

Converting characters from one Firebird character set to another—for example, converting from DOS437 to ISO8859 _1 —is transliteration. Firebird’s transliterations preserve character fidelity: by design, they will not substitute any kind of “placeholder” character for an input character that is not represented in the output character set. The purpose of this restriction is to guarantee that it will be possible to transliterate the same passage of text from one character set to another, in either direction, without losing any characters in the process.

Transliteration Errors

Firebird reports a transliteration error if a character in the input set does not have an exact representation in the output set.

An example of where a transliteration error may occur is when an application passes input of some unspecified character set into a column defined with NONE and later tries to select that data for input to another column that has been defined with a different character set. Even though you thought it should work because the character images looked as though they belonged to the character set of the destination column, the entire transliteration fails because a character is encountered which is not represented in the destination character set.

Fixing Transliteration Errors

How can you deal with a bunch of character data that you have stored using the wrong character set? The “trick” is to use character set OCTETS as a “staging post” between the wrong and the right encoding. Because OCTETS is a special character set that blindly stores only what you poke into it—without transliteration—it is ideal for making the character codes neutral with respect to code page.

For example, suppose your problem table has a column COL_ORIGINAL that you accidentally created in character set NONE, when you meant it to be CHARACTER SET ISO8859_2. You have been loading this column with Hungarian data, but every time you try to select from it, you get that darned transliteration error.

Here’s what you can do:

Now you have a temporary column designed to store Hungarian text—and it is storing all of your “lost” text from the unusable COL_ORIGINAL. You can proceed to drop COL_ORIGINAL, and then add a new COL_ORIGINAL having the correct character set. Simply copy the data from the temporary column and, after committing, drop the temporary column:

Character Set for the Client Connection

When a client application, such as isql , connects to a database, it is part of its connection protocol to inform the server of its character set requirements. The character set for a connection is the character set–neutral NONE, unless specified otherwise using

• SET NAMES in embedded applications or in isql.
• The isc_dpb_lc_ctype parameter of the database parameter block (DPB) for the isc_attach_database() function of the API. RAD database connection classes for Delphi, Java, et al. generally surface this parameter as a property.

The client application specifies its character set before it connects to the database. For example, the following ISQL command specifies that isql is using the ISO8859_1 character set. The next command connects to the authors.fdb database of our earlier example:

Special Character Sets

The general rule for character sets is that every byte (or pair or trio, in the case of multibyte sets) is specifically defined by the standard it implements. There are four special exceptions—NONE, OCTETS, ASCII, and UNICODE_FSS. Table explains the special qualities of these sets.

Special Character Sets

ISO8859_1 (LATIN_1) and WIN1252

The Firebird ISO8859_1 character set is often specified to support European languages. ISO8859_1, also known as LATIN_1, is a proper subset of WIN1252: Microsoft added characters in positions that ISO specifically defines as Not a character (not “undefined,” but specifically “not a character”). Firebird supports both WIN1252 and ISO8859_1. You can always transliterate ISO8859_1 to WIN1252, but transliterating WIN1252 to ISO8859_1 can result in errors.

Character Sets for Microsoft Windows

Five character sets support Windows client applications, such as Paradox for Windows. These character sets are WIN1250, WIN1251, WIN1252, WIN1253, and WIN1254.

Due to Borland’s historical association with Paradox and dBase, the names of collation sequences for these character sets that are specific to Paradox for Windows begin “PXW” and correspond to the Paradox/dBase language drivers supplied in the now obsolete Borland Database Engine (BDE).

For more information about Windows character sets and Paradox for Windows collations, see the appropriate BDE documentation and driver books.

For a list of the international character sets and collation sequences that Firebird supports out of the box,

Collation Sequence

Each character set has a default collation sequence that specifies how its symbols are sorted and ordered. Collation sequence determines the rules of precedence that Firebird uses to sort, compare, and transliterate character data.

Because each character set has its own subset of possible collation sequences, the character set that you choose when you define the column limits your choice. You must choose a collation sequence that is supported for the column’s character set.

Collation sequence for a column is specified when the column is created or altered. When set at the column level, it overrides any collation sequence set at the domain level.

Listing Available Collation Sequences

The following query yields a list of character sets with the available collation sequences:

Naming of Collation Sequences

Many Firebird collation names use the naming convention XX_YY, where XX is a two- letter language code and YY is a two-letter country code. For example, DE_DE is the collation name for German as used in Germany; FR_FR is for French as used in France; and FR_CA is for French as used in Canada.

Where a character set offers a choice of collations, the one with the name matching the character set is the default collation sequence, which implements binary collation for the character set. Binary collation sorts a character set by the numeric codes5 used to represent the characters. Some character sets support alternative collation sequences using different rules for determining precedence.

This section explains how to specify collation sequence for character sets in domains and table columns, in string comparisons, and in ORDER BY and GROUP BY clauses.

Collation Sequence for a Column

When a CHAR or VARCHAR column is created for a table, either with CREATE TABLE or ALTER TABLE, the collation sequence for the column can be specified using the COLLATE clause. COLLATE is especially useful for character sets such as ISO8859_1 or DOS437 that support many different collation sequences.

For example, the following dynamic ALTER TABLE statement adds a new column to a table, and specifies both a character set and a collation sequence:

Collation Sequence for String Comparisons

It can be necessary to specify a collation sequence when CHAR or VARCHAR values are compared in a WHERE clause, if the values being compared use different collation sequences and it matters to the result.

To specify the collation sequence to use for a value during a comparison, include a COLLATE clause after the value. For example, the following WHERE clause fragment forces the column value to the left of the equivalence operator to be compared with the input parameter using a specific collation sequence:

In this case, without matching collation sequences, the candidates for “greater than” might be different for each collation sequence.

Collation Sequence in Sort Criteria

When CHAR or VARCHAR columns are ordered in a SELECT statement, it can be necessary to specify a collation order for the ordering, especially if columns used for ordering use different collation sequences.

To specify the collation sequence to use for ordering a column in the ORDER BY clause, include a COLLATE clause after the column name. For example, in the following ORDER BY clause, the collation sequences for two columns are specified:

For the complete syntax of the ORDER BY clause, see Chapter Ordered and Aggregated Sets.

Collation Sequence in a GROUP BY Clause

When CHAR or VARCHAR columns are grouped in a SELECT statement, it can be necessary to specify a collation order for the grouping, especially if columns used for grouping use different collation sequences.

To specify the collation sequence to use for grouping columns in the GROUP BY clause, include a COLLATE clause after the column name. For example, in the following GROUP BY clause, the collation sequences for multiple columns are specified:

Collation Sequence and Index Size

If you specify a non-binary collation (one other than the default collation) for a character set, the index key can become larger than the stored string if the collation includes precedence rules of second, third, or fourth order.

Non-binary collations for ISO8859_1, for example, use full dictionary sorts, with spaces and punctuation of fourth-order importance:

First order: A is different from B. Second order: A is different from À. Third order: A is different from a.

Fourth order: The type of punctuation symbol (hyphen, space, apostrophe) is important.

For example:

Greenfly Green fly Green-fly Greensleeves Green sleeves Green spot

If spaces and punctuation marks are treated instead as a first-order difference, the same list would be sorted as follows:

Greenfly Greensleeves Green fly Green sleeves Green spot Green-fly

How Non-Binary Collations Can Limit Index Size

When an index is created, it uses the collation sequence defined for each text segment in the index. Using ISO8859_1, a single-byte character set, with the default collation, the index structure can hold about 252 characters (fewer if it is a multi-segment index). However, if you choose a non-binary collation for ISO8859_1, the index structure can hold only 84 characters, despite the fact that the characters in the column being indexed occupy only 1 byte each.

Custom Character Sets and Collations

It is possible to write your own character sets and collations and have the Firebird engine load them from a shared library, which should be named fbintl2 in order to be recognized and linked.

It is also possible to implement custom character sets or collations using user-defined functions (UDFs) to transliterate input. The Firebird 1.5 engine automatically uses UDFs with names specially formatted to be recognized as character sets and collations. The name "USER_CHARSET_nnn" indicates a character set, while "USER _TRANSLATE _nnn _nnn" and "USER_TEXTTYPE_nnn" indicate a character set plus collation sequence. (nnn represents three-digit numbers, usually in the range 128 to 254.)

It is an advanced topic, beyond the scope of this guide. The developer of the fbintl2 plug-in for custom character sets, David Brookestone Schnepper, makes a do-it-yourself kit freely available that contains sample C code, mappings, and instructions. Because the kit includes lucid instructions for creating character sets, it is also a useful reference if you plan to use the UDF approach to implement a custom character set.

In an extreme situation—where you are using a non-standard operating system that requires a character set name that Firebird does not already support as an alias—you can add one. It is not very straightforward and requires updating the system tables directly—a practice that, as a general rule, should be avoided at all costs. Before you decide to add a custom alias, be certain that Firebird does not already support the alias you need.

Installing a custom aliasinvolves inserting a row directly into the RDB$TYPES table. Get the character set ID of the character set that you want to alias —in RDB$CHARACTER _SETS it is the value in the RDB$CHARACTER _SET_ID column— and make sure you have the exact literal string that your operating system recognizes for the character set you want supported. Suppose you want to add an alias for the character set ISO8859_1 that your OS recognizes by the literal “LC _ISO88591”. First, get the character set ID by querying RDB$CHARACTER_SETS using isql or another interactive query tool:

The example returns the character set ID “21”. Next, prepare and run an INSERT statement to add your alias to RDB$TYPES: The technique is relatively painless if your custom alias represents a character set needed for a specific column or domain, but is not required as the default character set of the database. Just make sure that the alias definition exists before you create any column or domain that needs to use it. There is a catch-22 problem if your operating system genuinely cannot support an existing character set or alias for the character set you need to use as the default. Your database cannot “know” about your custom alias until after the database is created and an RDB$TYPES table actually exists. The default character set of the database is defined in CREATE DATABASE, at which point only the system-defined aliases are available. By the time RDB\$TYPES exists, it is too late to assign the default character set.

Because Firebird SQL currently provides no way to update the default character set attribute —it is not supported in the ALTER DATABASE syntax —the only workaround is to create the database first and then, before doing anything else, install your alias as described, commit it, and proceed to update the database header record directly: