I18N - Charset: Difference between revisions

From The DarkMod Wiki
Jump to navigationJump to search
Geep (talk | contribs)
Character remapping: Rewrite intro & future extensions, change heading nesting
Geep (talk | contribs)
Rearrange sections
Line 3: Line 3:
The D3 code that handles the GUI bitmap font can only load a specific range of bytes as characters. To get the most out of the available entries, special charsets are used. The fonts (Carleton for the menu f.i.) are build/patched so that the right characters appear in the right place.
The D3 code that handles the GUI bitmap font can only load a specific range of bytes as characters. To get the most out of the available entries, special charsets are used. The fonts (Carleton for the menu f.i.) are build/patched so that the right characters appear in the right place.


== Encodings ==
== LANG File Encodings ==


=== all.lang ===
=== all.lang ===
Line 34: Line 34:
See '''[[I18N - Character mapping|Character mapping]]''' for more information.
See '''[[I18N - Character mapping|Character mapping]]''' for more information.


== Encoding for European Languages ==
== Font Encoding for European Languages ==


This TDM-specific 8-bit encoding applies to all TDM fonts for English/European languages, that is, everything except the Cyrillic-based Russian.
This TDM-specific 8-bit encoding applies to all TDM fonts for English/European languages, that is, everything except the Cyrillic-based Russian. This encoding controls the ordering within a font's binary DAT file.


Note that the double accented characters in Hungarian '''Ő, ő, Ű and ű''' look a bit different from '''Ö, ö, Ü and ü'''!
Note that the double accented characters in Hungarian '''Ő, ő, Ű and ű''' look a bit different from '''Ö, ö, Ü and ü'''!
Line 403: Line 403:
* Carleton font (regular, bold, glow) uses codepoint 0x23, ordinarily "#", to represent "superscript #", which is used in the main menu system to indicate FMs for which a translation pack is available.
* Carleton font (regular, bold, glow) uses codepoint 0x23, ordinarily "#", to represent "superscript #", which is used in the main menu system to indicate FMs for which a translation pack is available.
* Treasure Map font has characters for "pirate skull and crossbones" and "bottle of booze (or poison)".
* Treasure Map font has characters for "pirate skull and crossbones" and "bottle of booze (or poison)".
== Encoding for Russian Cyrillic ==
Characters conform to the [https://en.wikipedia.org/wiki/Win-1251 WIN-1251 native encoding, shown in the Wikipedia article]. Exception: the character '''0xFF''' (я) is mapped to '''0xB6''' upon loading. Therefore any Russian font must contain я at the place 0xB6. (This was to overcome a historical Doom3 bug that is now fixed. But to date "the remapping for Russian is still in place, tho, to avoid having to patch all the russian fonts", says bug report [https://bugs.thedarkmod.com/view.php?id=2812 "0002812: Character 0xFF does not work in fonts"]).
Within any given font, character coverage may be incomplete. A 2025 effort extended the Mason 48pt font to all 256 codepoints.
== Future Extensions to Unicode and/or Asian Languages (Korean, Chinese, Japanese)? ==
It is possible to visualize a transition to a DAT format (and engine code) that supported UCS (16-bit) Unicode, allowing the existing fonts to then be expanded to cover more characters. Then perhaps all.lang would be the only .lang needed.
As for Asian languages, the original D3 had support for them, so it might be possible to add them to TDM, too, but with a heavy burden on font development and translation. Also, writing from right-to-left (Hebrew) or top-down (Japanese) might be tricky or outright impossible in our GUI without more work in the C++ code. Plus, these languages use more than 256 different characters, and an 8 bit table will not hold these.


== European Character Implementation ==
== European Character Implementation ==
Line 697: Line 685:


In 2024, Stone 24pt, important for subtitles and a number of readables, was further extended to cover all 256 codepoints. This was tested with TDM 2.13[betas] and released with 2.14.
In 2024, Stone 24pt, important for subtitles and a number of readables, was further extended to cover all 256 codepoints. This was tested with TDM 2.13[betas] and released with 2.14.
In 2025, the Russian Mason 48pt font was improved and extended to cover all TDM Russian codepoints, and included in TDM 2.14.


In 2026, English/European Carleton 24pt, widely used in the main menu and also some readables, was improved to:
In 2026, English/European Carleton 24pt, widely used in the main menu and also some readables, was improved to:
Line 708: Line 694:


Stone 48pt improvements are under consideration for TDM 2.15.
Stone 48pt improvements are under consideration for TDM 2.15.
== Font Encoding for Russian Cyrillic ==
Characters conform to the [https://en.wikipedia.org/wiki/Win-1251 WIN-1251 native encoding, shown in the Wikipedia article]. Exception: the character '''0xFF''' (я) is mapped to '''0xB6''' upon loading. Therefore any Russian font must contain я at the place 0xB6. (This was to overcome a historical Doom3 bug that is now fixed. But to date "the remapping for Russian is still in place, tho, to avoid having to patch all the russian fonts", says bug report [https://bugs.thedarkmod.com/view.php?id=2812 "0002812: Character 0xFF does not work in fonts"]).
=== Cyrillic Character Implementation ===
Within any given font, character coverage may be incomplete.
In 2025, the Russian Mason 48pt font was improved and extended to cover all TDM Russian codepoints, and included in TDM 2.14.
== Future Extensions to Unicode and/or Asian Languages (Korean, Chinese, Japanese)? ==
It is possible to visualize a transition to a DAT format (and engine code) that supported UCS (16-bit) Unicode, allowing the existing fonts to then be expanded to cover more characters. Then perhaps all.lang would be the only .lang needed.
As for Asian languages, the original D3 had support for them, so it might be possible to add them to TDM, too, but with a heavy burden on font development and translation. Also, writing from right-to-left (Hebrew) or top-down (Japanese) might be tricky or outright impossible in our GUI without more work in the C++ code. Plus, these languages use more than 256 different characters, and an 8 bit table will not hold these.


[[Category:Fonts]]
[[Category:Fonts]]


{{i18n}}
{{i18n}}

Revision as of 20:54, 31 May 2026

Introduction

The D3 code that handles the GUI bitmap font can only load a specific range of bytes as characters. To get the most out of the available entries, special charsets are used. The fonts (Carleton for the menu f.i.) are build/patched so that the right characters appear in the right place.

LANG File Encodings

all.lang

This file is in UTF-8, and converted with the help of either the script devel/gen_lang.pl:

perl devel/gen_lang.pl

or, more recently, the Windows C++ utility program gen_lang_plus. Gen_Lang_Programs has details of these.

This process ensures that the generated language files are in their proper encodings (see below).

All other language files

Language-specific files (f.i. german.lang) provide string dictionaries. A system-wide set is provided in tdm_base01.pk4/strings/. A deployed FM may offer similar files in its /strings/ folder... at the very least english.lang. All such files are expected to be in the following 8-bit encodings:

The core dictionaries are automatically generated in the right encoding, but make sure that you use the right encoding for the FM dictionary, too!

Character Remapping

The characters are remapped upon loading the dictionary/readable, from their native encoding to the two special ones (respectively for Latin and Cyrillic) that TDM uses and that are described next. Responsible for the remapping are mapping files, f.i. "strings/czech.map". If a map file for a specific language is not found in "tdm_base01.pk4/strings" (and there is no "default.map" there, which is the case these days), then no remapping takes place. Generally, ISO-8859-1 languages do not require remapping.

See Character mapping for more information.

Font Encoding for European Languages

This TDM-specific 8-bit encoding applies to all TDM fonts for English/European languages, that is, everything except the Cyrillic-based Russian. This encoding controls the ordering within a font's binary DAT file.

Note that the double accented characters in Hungarian Ő, ő, Ű and ű look a bit different from Ö, ö, Ü and ü!

In the table below, the original ISO 8859-1 characters are given in () below the TDM character.

Color code:

UnusableUnusedUsable in v1.08Usable in v2.03Changed from ISO 8859-1

…0 …1 …2 …3 …4 …5 …6 …7 …8 …9 …A …B …C …D …E …F
0… 00
01
02
03
04
05
06
07
08
09
0A
0B
0C
0D
0E
0F
1… 10
11
12
13
14
15
16
17
18
19
1A
1B
1C
1D
1E
1F
2… 20
 
21
!
22
"
23
#
24
$
25
%
26
&
27
''
28
(
29
)
2A
*
2B
+
2C
,
2D
-
2E
.
2F
/
3… 30
0
31
1
32
2
33
3
34
4
35
5
36
6
37
7
38
8
39
9
3A
:
3B
;
3C
<
3D
=
3E
>
3F
?
4… 40
@
41
A
42
B
43
C
44
D
45
E
46
F
47
G
48
H
49
I
4A
J
4B
K
4C
L
4D
M
4E
N
4F
O
5… 50
P
51
Q
52
R
53
S
54
T
55
U
56
V
57
W
58
X
59
Y
5A
Z
5B
[
5C
\
5D
]
5E
^
5F
_
6… 60
`
61
a
62
b
63
c
64
d
65
e
66
f
67
g
68
h
69
i
6A
j
6B
k
6C
l
6D
m
6E
n
6F
o
7… 70
p
71
q
72
r
73
s
74
t
75
u
76
v
77
w
78
x
79
y
7A
z
7B
{
7C
|
7D
}
7E
~
7F
8… 80
Ň
81
Ś
82
Ć
83
Ż
84
Ź
85
Ŝ
86
Ĉ
87
88
Ô [1]
89
Ŕ
8A
Ǔ
8B
Ă
8C
Ń
8D
Ș
8E
Ț
8F
9… 90
đ
91
ś
92
ć
93
ż
94
ź
95
ŝ
96
ĉ
97
98
ô [1]
99
ŕ
9A
ǔ
9B
ă
9C
ń
9D
ș
9E
ț
9F
A… A0
NBSP
[2]
A1
ň
(¡)
A2
Ű
(¢)
A3
ě
(£)
A4
ű
(¤)
A5
Ě
(¥)
A6
Š
(¦)
A7
§
A8
š
(¨)
A9
Ů
(©)
AA
Ą
(ª)
AB
Ę
(«)
AC
Č
(¬)
AD
SHY
[2]
AE
č
(®)
AF
ů
(¯)
B… B0
Ő
(°)
B1
Ł
(±)
B2
Ť
(²)
B3
Ď
(³)
B4
Ž
(´)
B5
ł
(µ)
B6
ť
(¶)
B7
ď
(·)
B8
ž
(¸)
B9
ő
(¹)
BA
ą
(º)
BB
ę
(»)
BC
Œ
(¼)
BD
œ
(½)
BE
Ÿ
(¾)
BF
¿
C… C0
À
C1
Á
C2
Â
C3
Ã
C4
Ä
C5
Å
C6
Æ
C7
Ç
C8
È
C9
É
CA
Ê
CB
Ë
CC
Ì
CD
Í
CE
Î
CF
Ï
D… D0
Ð
D1
Ñ
D2
Ò
D3
Ó
D4
Ô
D5
Õ
D6
Ö
D7
Ř
(×)
D8
Ø
D9
Ù
DA
Ú
DB
Û
DC
Ü
DD
Ý
DE
Þ
DF
ß
E… E0
à
E1
á
E2
â
E3
ã
E4
ä
E5
å
E6
æ
E7
ç
E8
è
E9
é
EA
ê
EB
ë
EC
ì
ED
í
EE
î
EF
ï
F… F0
ð
F1
ñ
F2
ò
F3
ó
F4
ô
F5
õ
F6
ö
F7
ř
(÷)
F8
ø
F9
ù
FA
ú
FB
û
FC
ü
FD
ý
FE
þ
FF
ÿ

Table Notes

[1] As discussed here, the TDM char set has a redundant treatment of 2 characters:

  • Ô appears at 0x88 and 0xD4
  • ô appears at 0x98 and 0xF4

Starting with TDM 2.13, the redundancy can be removed and these new characters (from ISO-8859-3) introduced:

  • Ğ appears at 0x88
  • ğ appears at 0x98

[2] Avoid using the non-breaking space (NBSP, 0xA0) and the soft hyphen (SHY, 0xAD) in your strings. The TDM engine has no code to respect these during word wrap. Font maintainers: probably map NBSP to the <space> glyph, SHY to undefined/hollow box or zero-sized box.

Corresponding Unicode

For a mapping of these 256 codepoints to Unicode U+NNNN values and formal names, download 'TDM 8859-Style Font Map to Unicode-16.txt':

Each of these files (by Geep, 2024) is in a standardized format so that it can also be imported into font design programs like FontForge as a custom 256-position map. In the comments, there is additional information about:

  • ISO 8859-x sourcing of each character.
  • alternative representations of some European and control characters.

Fonts with Special Characters

Certain fonts have a few codepoints that purposefully diverge from the table entry.

  • Carleton font (regular, bold, glow) uses codepoint 0x23, ordinarily "#", to represent "superscript #", which is used in the main menu system to indicate FMs for which a translation pack is available.
  • Treasure Map font has characters for "pirate skull and crossbones" and "bottle of booze (or poison)".

European Character Implementation

Priority Early-On - the "Top 50"

Some of the special characters are used more often than others. Here is a statistic over the entire string set of the TDM core, from TDM v1.08, showing the top 50 most-used characters (excluding a-z, 0-9 and russian characters):

Rank Occurances Letter Remarks Rank Occurances Letter Remarks
1 í 715 25 ć 67
2 é 674 26 è 65
3 á 524 27 ú 56
4 ø 303 Danish 28 ê 52
5 č 288 29 ö 48 German
6 ó 283 30 É 46
7 ü 270 German 31 ñ 37
8 ł 203 Polish 32 õ 32
9 æ 200 Danish 33 ń 26
10 ě 182 34 Ł 24
11 ř 175 Czech 35 Š 21
12 ã 168 36 â 21
13 ž 148 Czech 37 ź 20
14 ý 142 38 ß 18 German
15 ę 141 39 Ó 18
16 ą 140 40 ň 15
17 ż 119 41 Ú 15
18 å 109 Danish 42 Á 13
19 š 99 43 î 12
20 ś 97 44 ť 11
21 ç 91 45 ô 9
22 ä 86 German 46 Ž 8
23 à 83 47 Ż 7
24 ů 77 48 Č 7
25 ć 67 49 ù 6

Although ö, ä and ü do not appear that often, with only these and Ü, Ö, Ä and ß, the entire German language works. So adding these letters to the fonts is quite important.

Preferably, all foreign letters would be added to the fonts (see Font Patcher or Refont). However, if time permits only adding a few, í would be more important than, say, ô.

It is commonplace for missing accented letters to be redirected in the .dat file to the corresponding unaccented base letter.

Coverage Expansion & Remaining Limitations

By 2014, all the system (e.g., main menu) fonts (Carleton, Carleton_condensed, Stone in sizes 24pt and 48pt; Mason and Mason_glow in 48pt) had good coverage of the "Top 50" and beyond, although not all 256 codepoints. This was confirmed more specifically in 2024, as part of an Analysis of 2.12 Fonts. This analysis indicated that, unlike the system fonts, the FM fonts generally did not provide specific glyphs beyond ASCII.

In 2024, Stone 24pt, important for subtitles and a number of readables, was further extended to cover all 256 codepoints. This was tested with TDM 2.13[betas] and released with 2.14.

In 2026, English/European Carleton 24pt, widely used in the main menu and also some readables, was improved to:

  • extend coverage to all codepoints
  • replace sloppy hand-drawn Latin-1 glyphs with fresh glyphs, most of them generated from TTF then edge-darkened
  • for simplicity, re-implement the red "glow" effect as if a drop-shadow.

Testing of this is planned with TDM 2.15 betas.

Stone 48pt improvements are under consideration for TDM 2.15.

Font Encoding for Russian Cyrillic

Characters conform to the WIN-1251 native encoding, shown in the Wikipedia article. Exception: the character 0xFF (я) is mapped to 0xB6 upon loading. Therefore any Russian font must contain я at the place 0xB6. (This was to overcome a historical Doom3 bug that is now fixed. But to date "the remapping for Russian is still in place, tho, to avoid having to patch all the russian fonts", says bug report "0002812: Character 0xFF does not work in fonts").

Cyrillic Character Implementation

Within any given font, character coverage may be incomplete.

In 2025, the Russian Mason 48pt font was improved and extended to cover all TDM Russian codepoints, and included in TDM 2.14.

Future Extensions to Unicode and/or Asian Languages (Korean, Chinese, Japanese)?

It is possible to visualize a transition to a DAT format (and engine code) that supported UCS (16-bit) Unicode, allowing the existing fonts to then be expanded to cover more characters. Then perhaps all.lang would be the only .lang needed.

As for Asian languages, the original D3 had support for them, so it might be possible to add them to TDM, too, but with a heavy burden on font development and translation. Also, writing from right-to-left (Hebrew) or top-down (Japanese) might be tricky or outright impossible in our GUI without more work in the C++ code. Plus, these languages use more than 256 different characters, and an 8 bit table will not hold these.


See Also

Translation resources

Overview of translations

Translation discussions