I18N - Charset: Difference between revisions

Latest revision as of 19:04, 21 June 2024

Introduction

The D3 code that handles the GUI bitmap font can only load a specific range of bytes as characters. To get the most out of the available entries, special charsets are used. The fonts (Carleton for the menu f.i.) are build/patched so that the right characters appear in the right place.

Encodings

all.lang

This file is in UTF-8, and converted with the help of the script devel/gen_lang.pl:

perl devel/gen_lang.pl

This ensures that the generated language files are in their proper encodings (see below).

All other language files

Note that the language files (f.i. strings/german.lang) as well as the readables and the FM dictionariaries are expected to be in the following encodings:

Czech, Hungarian, Slovak, Polish: ISO-8859-2 (not WIN-1250!)
Russian: WIN-1251
French: ISO-8859-15
Romanian: ISO-8859-16
All other languages: ISO-8859-1 (German, Dutch, Danish, Swedish, Portuguese, etc.)

The core dictionaries are automatically generated in the right encoding, but make sure that you use the right encoding for the FM dictionary, too!

Character remapping

The characters are remapped upon loading the dictionary/readable, from their native encoding to the special one that TDM uses and that is described here. Responsible for the remapping are mapping files, f.i. "strings/czech.map". If a map file for a specific language is not found, "strings/default.map" is used instead, if this is not found, no remapping takes place.

See Character mapping for more information.

European Languages

This mapping is used for European languages, f.i. Czech, French, German, Spanish, Portuguese, Polish. Note that the double accented characters in Hungarian Ő, ő, Ű and ű look a bit different from Ö, ö, Ü and ü!

In the table below, the original ISO 8859-1 characters are given in () below the TDM character.

Color code:

UnusableUnusedUsable in v1.08Usable in v2.03Changed from ISO 8859-1

	…0	…1	…2	…3	…4	…5	…6	…7	…8	…9	…A	…B	…C	…D	…E	…F
0…	00 –	01 –	02 –	03 –	04 –	05 –	06 –	07 –	08 –	09 –	0A –	0B –	0C –	0D –	0E –	0F –
1…	10 –	11 –	12 –	13 –	14 –	15 –	16 –	17 –	18 –	19 –	1A –	1B –	1C –	1D –	1E –	1F –
2…	20	21 !	22 "	23 #	24 $	25 %	26 &	27 ''	28 (	29 )	2A *	2B +	2C ,	2D -	2E .	2F /
3…	30 0	31 1	32 2	33 3	34 4	35 5	36 6	37 7	38 8	39 9	3A :	3B ;	3C <	3D =	3E >	3F ?
4…	40 @	41 A	42 B	43 C	44 D	45 E	46 F	47 G	48 H	49 I	4A J	4B K	4C L	4D M	4E N	4F O
5…	50 P	51 Q	52 R	53 S	54 T	55 U	56 V	57 W	58 X	59 Y	5A Z	5B [	5C \	5D ]	5E ^	5F _
6…	60 `	61 a	62 b	63 c	64 d	65 e	66 f	67 g	68 h	69 i	6A j	6B k	6C l	6D m	6E n	6F o
7…	70 p	71 q	72 r	73 s	74 t	75 u	76 v	77 w	78 x	79 y	7A z	7B {	7C \|	7D }	7E ~	7F �
8…	80 Ň	81 Ś	82 Ć	83 Ż	84 Ź	85 Ŝ	86 Ĉ	87 Ẑ	88 Ô [note]	89 Ŕ	8A Ǔ	8B Ă	8C Ń	8D Ș	8E Ț	8F �
9…	90 đ	91 ś	92 ć	93 ż	94 ź	95 ŝ	96 ĉ	97 ẑ	98 ô [note]	99 ŕ	9A ǔ	9B ă	9C ń	9D ș	9E ț	9F �
A…	A0 NBSP	A1 ň (¡)	A2 Ű (¢)	A3 ě (£)	A4 ű (¤)	A5 Ě (¥)	A6 Š (¦)	A7 §	A8 š (¨)	A9 Ů (©)	AA Ą (ª)	AB Ę («)	AC Č (¬)	AD SHY	AE č (®)	AF ů (¯)
B…	B0 Ő (°)	B1 Ł (±)	B2 Ť (²)	B3 Ď (³)	B4 Ž (´)	B5 ł (µ)	B6 ť (¶)	B7 ď (·)	B8 ž (¸)	B9 ő (¹)	BA ą (º)	BB ę (»)	BC Œ (¼)	BD œ (½)	BE Ÿ (¾)	BF ¿
C…	C0 À	C1 Á	C2 Â	C3 Ã	C4 Ä	C5 Å	C6 Æ	C7 Ç	C8 È	C9 É	CA Ê	CB Ë	CC Ì	CD Í	CE Î	CF Ï
D…	D0 Ð	D1 Ñ	D2 Ò	D3 Ó	D4 Ô	D5 Õ	D6 Ö	D7 Ř (×)	D8 Ø	D9 Ù	DA Ú	DB Û	DC Ü	DD Ý	DE Þ	DF ß
E…	E0 à	E1 á	E2 â	E3 ã	E4 ä	E5 å	E6 æ	E7 ç	E8 è	E9 é	EA ê	EB ë	EC ì	ED í	EE î	EF ï
F…	F0 ð	F1 ñ	F2 ò	F3 ó	F4 ô	F5 õ	F6 ö	F7 ř (÷)	F8 ø	F9 ù	FA ú	FB û	FC ü	FD ý	FE þ	FF ÿ

[note] As discussed here, the TDM char set has a redundant treatment of 2 characters:

Ô appears at 0x88 and 0xD4
ô appears at 0x98 and 0xF4

Planned for TDM 2.13, the redundancy can be removed and these new characters (from ISO-8859-3) introduced:

Ğ appears at 0x88
ğ appears at 0x98

For a mapping of these 256 codepoints to Unicode U+NNNN values and formal names, download 'TDM 8859-Style Font Map to Unicode-16.txt':

Each of these files (by Geep, 2024) is in a standardized format so that it can also be imported into font design programs like FontForge as a custom 256-position map. In the comments, there is additional information about:

ISO 8859-x sourcing of each character.
alternative representations of some European and control characters.

Russian

Characters conform to the WIN-1251 native encoding, shown in the Wikipedia article. Exception: the character 0xFF (я) is mapped to 0xB6 upon loading. Therefore any Russian font must contain я at the place 0xB6.

Asian Languages (Korean, Chinese, Japanese)

The original D3 had support for these languages, so it might be possible to add them to TDM, too. At the moment, however, we lack the fonts and translators. Also, writing from right-to-left (Hebrew) or top-down (Japanese) might be tricky or outright impossible in our GUI without more work in the C++ code. Plus, these languages use more than 256 different characters, and an 8 bit table will not hold these.

European Character Implementation Priority - the "Top 50"

Some of the special characters are used more often than others. Here is a statistic over the entire string set of the TDM core, from TDM v1.08, showing the top 50 most-used characters (excluding a-z, 0-9 and russian characters):

Rank	Occurances	Letter	Remarks	Rank	Occurances	Letter	Remarks
1	í	715		25	ć	67
2	é	674		26	è	65
3	á	524		27	ú	56
4	ø	303	Danish	28	ê	52
5	č	288		29	ö	48	German
6	ó	283		30	É	46
7	ü	270	German	31	ñ	37
8	ł	203	Polish	32	õ	32
9	æ	200	Danish	33	ń	26
10	ě	182		34	Ł	24
11	ř	175	Czech	35	Š	21
12	ã	168		36	â	21
13	ž	148	Czech	37	ź	20
14	ý	142		38	ß	18	German
15	ę	141		39	Ó	18
16	ą	140		40	ň	15
17	ż	119		41	Ú	15
18	å	109	Danish	42	Á	13
19	š	99		43	î	12
20	ś	97		44	ť	11
21	ç	91		45	ô	9
22	ä	86	German	46	Ž	8
23	à	83		47	Ż	7
24	ů	77		48	Č	7
25	ć	67		49	ù	6

Although ö, ä and ü do not appear that often, with only these and Ü, Ö, Ä and ß, the entire German language works. So adding these letters to the fonts is quite important.

Preferably, all foreign letters would be added to the fonts (see Font Patcher or Refont). However, if time permits only adding a few, í would be more important than, say, ô.

It is commonplace for missing accented letters to be redirected in the .dat file to the corresponding unaccented base letter.

By 2014, all the system (e.g., main menu) fonts (Carleton, Carleton_condensed, Stone in sizes 24pt and 48pt; Mason and Mason_glow in 48pt) had good coverage of the "Top 50" and beyond, although not all 256 codepoints. This was confirmed more specifically in 2024, as part of an Analysis of 2.12 Fonts. This analysis indicated that, unlike the system fonts, the FM fonts generally did not provide specific glyphs beyond ASCII.

In 2024 for TDM 2.13, Stone 24pt, important for subtitles, was further extended to cover all 256 codepoints.

I18N - Charset: Difference between revisions

Latest revision as of 19:04, 21 June 2024

Contents

Introduction

Encodings

all.lang

All other language files

Character remapping

European Languages

Russian

Asian Languages (Korean, Chinese, Japanese)

European Character Implementation Priority - the "Top 50"

See Also

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

Tools

@@ Line 3: / Line 3: @@
 The D3 code that handles the GUI bitmap font can only load a specific range of bytes as characters. To get the most out of the available entries, special charsets are used. The fonts (Carleton for the menu f.i.) are build/patched so that the right characters appear in the right place.
-=== Encodings ===
+== Encodings ==
+=== all.lang ===
+This file is in '''UTF-8''', and converted with the help of the script '''devel/gen_lang.pl''':
+ perl devel/gen_lang.pl
+This ensures that the generated language files are in their proper encodings (see below).
+=== All other language files ===
 Note that the language files (f.i. '''strings/german.lang''') as well as the readables and the FM dictionariaries are expected to be in the following encodings:
-* '''Czech:''' ISO-8859-2
+* '''Czech''', '''Hungarian''', '''Slovak''', '''Polish:''' [https://secure.wikimedia.org/wikipedia/en/wiki/ISO/IEC_8859-2 ISO-8859-2] ('''not WIN-1250!)
-* '''Russian:''' WIN-1251
+* '''Russian:''' [https://secure.wikimedia.org/wikipedia/en/wiki/Win-1251 WIN-1251]
-* '''All other languages:''' ISO-8859-1
+* '''French:''' [https://secure.wikimedia.org/wikipedia/en/wiki/ISO/IEC_8859-15 ISO-8859-15]
+* '''Romanian:''' [https://secure.wikimedia.org/wikipedia/en/wiki/ISO/IEC_8859-16 ISO-8859-16]
+* '''All other languages:''' [https://secure.wikimedia.org/wikipedia/en/wiki/ISO/IEC_8859-1 ISO-8859-1] (German, Dutch, Danish, Swedish, Portuguese, etc.)
-The characters are remapped upon loading the dictionary/readable. Responsible for the remapping are [[I18N - Character mapping|mapping files]], f.i. "strings/czech.map". If a map file for a specific language is not found, "strings/default.map" is used instead.
-== European Languages ==
+{{infobox|The core dictionaries are automatically generated in the right encoding, but make sure that you use the right encoding for the FM dictionary, too!}}
+== Character remapping ==
+The characters are remapped upon loading the dictionary/readable, from their native encoding to the special one that TDM uses and that is described here.  Responsible for the remapping are [[I18N - Character mapping|mapping files]], f.i. "strings/czech.map". If a map file for a specific language is not found, "strings/default.map" is used instead, if this is not found, no remapping takes place.
+See '''[[I18N - Character mapping|Character mapping]]''' for more information.
+=== European Languages ===
 This mapping is used for European languages, f.i. '''Czech''', '''French''', '''German''', '''Spanish''', '''Portuguese''', '''Polish'''. Note that the double accented characters in Hungarian '''Ő, ő, Ű and ű''' look a bit different from '''Ö, ö, Ü and ü'''!
-In the table below, the original IS= 8859-1 characters are given in ''()'' below the TDM character.
+In the table below, the original ISO 8859-1 characters are given in ''()'' below the TDM character.
 '''Color code:'''
-{{box|#f0d0d0|Character not displayed by D3|Unusable}}{{box|#d0d0f0|Changed from the ISO-8859-1 default|Changed}}
+{{box|#f0d0d0|Character not usable by TDM|Unusable}}{{box|#d0e0d0|Character not yet used in TDM|Unused}}{{box|#c0ffc0|Character displayed in v1.08 or newer|Usable in v1.08}}{{box|#80f080|Character displayed in v2.03 or newer|Usable in v2.03}}{{box|#d0d0f0|Changed from the ISO-8859-1 default, usable by TDM 1.0 or newer|Changed from ISO 8859-1}}
 {|class="wikitable" border=1 style="border-collapse: collapse; font-size: 95%" cellspacing=0 cellpadding=2 width=100%
@@ Line 194: / Line 213: @@
 |align='center'|7D<br>'''}'''
 |align='center'|7E<br>'''~'''
-|align='center' style='background: #f0d0d0'|7F<br>'''–'''
+|align='center' style='background: #d0e0d0'|7F<br>'''�'''
 |-
 !8…
-|align='center' style='background: #f0d0d0'|80<br>'''–'''
+|align='center' style='background: #c0ffc0'|80<br>'''Ň'''
-|align='center' style='background: #f0d0d0'|81<br>'''–'''
+|align='center' style='background: #c0ffc0'|81<br>'''Ś'''
-|align='center' style='background: #f0d0d0'|82<br>'''–'''
+|align='center' style='background: #c0ffc0'|82<br>'''Ć'''
-|align='center' style='background: #f0d0d0'|83<br>'''–'''
+|align='center' style='background: #c0ffc0'|83<br>'''Ż'''
-|align='center' style='background: #f0d0d0'|84<br>'''–'''
+|align='center' style='background: #c0ffc0'|84<br>'''Ź'''
-|align='center' style='background: #f0d0d0'|85<br>'''–'''
+|align='center' style='background: #c0ffc0'|85<br>'''Ŝ'''
-|align='center' style='background: #f0d0d0'|86<br>'''–'''
+|align='center' style='background: #c0ffc0'|86<br>'''Ĉ'''
-|align='center' style='background: #f0d0d0'|87<br>'''–'''
+|align='center' style='background: #c0ffc0'|87<br>'''Ẑ'''
-|align='center' style='background: #f0d0d0'|88<br>'''–'''
+|align='center' style='background: #c0ffc0'|88<br>'''Ô''' [note]
-|align='center' style='background: #f0d0d0'|89<br>'''–'''
+|align='center' style='background: #c0ffc0'|89<br>'''Ŕ'''
-|align='center' style='background: #f0d0d0'|8A<br>'''–'''
+|align='center' style='background: #c0ffc0'|8A<br>'''Ǔ'''
-|align='center' style='background: #f0d0d0'|8B<br>'''–'''
+|align='center' style='background: #c0ffc0'|8B<br>'''Ă'''
-|align='center' style='background: #f0d0d0'|8C<br>'''–'''
+|align='center' style='background: #c0ffc0'|8C<br>'''Ń'''
-|align='center' style='background: #f0d0d0'|8D<br>'''–'''
+|align='center' style='background: #80f080'|8D<br>'''Ș'''
-|align='center' style='background: #f0d0d0'|8E<br>'''–'''
+|align='center' style='background: #80f080'|8E<br>'''Ț'''
-|align='center' style='background: #f0d0d0'|8F<br>'''–'''
+|align='center' style='background: #d0e0d0'|8F<br>'''�'''
 |-
 !9…
-|align='center' style='background: #f0d0d0'|90<br>'''–'''
+|align='center' style='background: #80f080'|90<br>'''đ'''
-|align='center' style='background: #f0d0d0'|91<br>'''–'''
+|align='center' style='background: #c0ffc0'|91<br>'''ś'''
-|align='center' style='background: #f0d0d0'|92<br>'''–'''
+|align='center' style='background: #c0ffc0'|92<br>'''ć'''
-|align='center' style='background: #f0d0d0'|93<br>'''–'''
+|align='center' style='background: #c0ffc0'|93<br>'''ż'''
-|align='center' style='background: #f0d0d0'|94<br>'''–'''
+|align='center' style='background: #c0ffc0'|94<br>'''ź'''
-|align='center' style='background: #f0d0d0'|95<br>'''–'''
+|align='center' style='background: #c0ffc0'|95<br>'''ŝ'''
-|align='center' style='background: #f0d0d0'|96<br>'''–'''
+|align='center' style='background: #c0ffc0'|96<br>'''ĉ'''
-|align='center' style='background: #f0d0d0'|97<br>'''–'''
+|align='center' style='background: #c0ffc0'|97<br>'''ẑ'''
-|align='center' style='background: #f0d0d0'|98<br>'''–'''
+|align='center' style='background: #c0ffc0'|98<br>'''ô''' [note]
-|align='center' style='background: #f0d0d0'|99<br>'''–'''
+|align='center' style='background: #c0ffc0'|99<br>'''ŕ'''
-|align='center' style='background: #f0d0d0'|9A<br>'''–'''
+|align='center' style='background: #c0ffc0'|9A<br>'''ǔ'''
-|align='center' style='background: #f0d0d0'|9B<br>'''–'''
+|align='center' style='background: #c0ffc0'|9B<br>'''ă'''
-|align='center' style='background: #f0d0d0'|9C<br>'''–'''
+|align='center' style='background: #c0ffc0'|9C<br>'''ń'''
-|align='center' style='background: #f0d0d0'|9D<br>'''–'''
+|align='center' style='background: #80f080'|9D<br>'''ș'''
-|align='center' style='background: #f0d0d0'|9E<br>'''–'''
+|align='center' style='background: #80f080'|9E<br>'''ț'''
-|align='center' style='background: #f0d0d0'|9F<br>'''–'''
+|align='center' style='background: #d0e0d0'|9F<br>'''�'''
 |-
@@ Line 349: / Line 368: @@
 |}
+[note] As discussed [https://forums.thedarkmod.com/index.php?/topic/22427-analysis-of-212-tdm-fonts/&do=findComment&comment=494855 here], the TDM char set has a redundant treatment of 2 characters:
+* Ô appears at 0x88 and 0xD4
+* ô appears at 0x98 and 0xF4
+Planned for TDM 2.13, the redundancy can be removed and these new characters (from ISO-8859-3) introduced:
+* Ğ appears at 0x88
+* ğ appears at 0x98
+For a mapping of these 256 codepoints to Unicode U+NNNN values and formal names, download 'TDM 8859-Style Font Map to Unicode-16.txt':
+* [https://drive.google.com/file/d/1wLEtuFvnrZ-WHK8ign8i3diIXZdriZ4F/view?usp=sharing for TDM 2.13]
+* [https://drive.google.com/file/d/1UAz9jSZpT_j33STP3So_Re8JWa8QuAmz/view?usp=sharing for TDM 2.12 and earlier]
+Each of these files (by Geep, 2024) is in a standardized format so that it can also be imported into font design programs like FontForge as a custom 256-position map. In the comments, there is additional information about:
+* ISO 8859-x sourcing of each character.
+* alternative representations of some European and control characters.
 === Russian ===
-The character '''0xFF''' (я) is mapped to '''0xB6''' upon loading. Therefore any Russian font must contain я at the place 0xB6.
+Characters conform to the [https://en.wikipedia.org/wiki/Win-1251 WIN-1251 native encoding, shown in the Wikipedia article]. Exception: the character '''0xFF''' (я) is mapped to '''0xB6''' upon loading. Therefore any Russian font must contain я at the place 0xB6.
+=== Asian Languages (Korean, Chinese, Japanese) ===
+The original D3 had support for these languages, so it might be possible to add them to TDM, too. At the moment, however, we lack the fonts and translators. Also, writing from right-to-left (Hebrew) or top-down (Japanese) might be tricky or outright impossible in our GUI without more work in the C++ code. Plus, these languages use more than 256 different characters, and an 8 bit table will not hold these.
+== European Character Implementation Priority - the "Top 50" ==
+Some of the special characters are used more often than others. Here is a statistic over the entire string set of the TDM core, from TDM v1.08, showing the top 50 most-used characters (excluding a-z, 0-9 and russian characters):
+{|class="wikitable" border=1 style="border-collapse: collapse; font-size: 85%" cellspacing=0 cellpadding=2
+|-
+|Rank
+|Occurances
+|Letter
+|Remarks
+|Rank
+|Occurances
+|Letter
+|Remarks
+|-
+|1
+|í
+|715
+|
+|25
+|ć
+|67
+|
+|-
+|2
+|é
+|674
+|
+|26
+|è
+|65
+|
+|-
+|3
+|á
+|524
+|
+|27
+|ú
+|56
+|
+|-
+|4
+|ø
+|303
+|Danish
+|28
+|ê
+|52
+|
+|-
+|5
+|č
+|288
+|
+|29
+|ö
+|48
+|German
+|-
+|6
+|ó
+|283
+|
+|30
+|É
+|46
+|
+|-
+|7
+|ü
+|270
+|German
+|31
+|ñ
+|37
+|
+|-
+|8
+|ł
+|203
+|Polish
+|32
+|õ
+|32
+|
+|-
+|9
+|æ
+|200
+|Danish
+|33
+|ń
+|26
+|
+|-
+|10
+|ě
+|182
+|
+|34
+|Ł
+|24
+|
+|-
+|11
+|ř
+|175
+|Czech
+|35
+|Š
+|21
+|
+|-
+|12
+|ã
+|168
+|
+|36
+|â
+|21
+|
+|-
+|13
+|ž
+|148
+|Czech
+|37
+|ź
+|20
+|
+|-
+|14
+|ý
+|142
+|
+|38
+|ß
+|18
+|German
+|-
+|15
+|ę
+|141
+|
+|39
+|Ó
+|18
+|
+|-
+|16
+|ą
+|140
+|
+|40
+|ň
+|15
+|
+|-
+|17
+|ż
+|119
+|
+|41
+|Ú
+|15
+|
+|-
+|18
+|å
+|109
+|Danish
+|42
+|Á
+|13
+|
+|-
+|19
+|š
+|99
+|
+|43
+|î
+|12
+|
+|-
+|20
+|ś
+|97
+|
+|44
+|ť
+|11
+|
+|-
+|21
+|ç
+|91
+|
+|45
+|ô
+|9
+|
+|-
+|22
+|ä
+|86
+|German
+|46
+|Ž
+|8
+|
+|-
+|23
+|à
+|83
+|
+|47
+|Ż
+|7
+|
+|-
+|24
+|ů
+|77
+|
+|48
+|Č
+|7
+|
+|-
+|25
+|ć
+|67
+|
+|49
+|ù
+|6
+|
+|}
+Although ö, ä and ü do not appear that often, with only these and Ü, Ö, Ä and ß, the entire German language works. So adding these letters to the fonts is quite important.
+Preferably, all foreign letters would be added to the fonts (see [[Font Patcher]] or [[Refont]]). However, if time permits only adding a few, '''í''' would be more important than, say, '''ô'''.
+It is commonplace for missing accented letters to be redirected in the .dat file to the corresponding unaccented base letter.
+By 2014, all the system (e.g., main menu) fonts (Carleton, Carleton_condensed, Stone in sizes 24pt and 48pt; Mason and Mason_glow in 48pt) had good coverage of the "Top 50" and beyond, although not all 256 codepoints.
+This was confirmed more specifically in 2024, as part of an [https://forums.thedarkmod.com/index.php?/topic/22427-analysis-of-212-tdm-fonts/ Analysis of 2.12 Fonts]. This analysis indicated that, unlike the system fonts, the FM fonts generally did not provide specific glyphs beyond ASCII.
+In 2024 for TDM 2.13, Stone 24pt, important for subtitles, was further extended to cover all 256 codepoints.
+[[Category:Fonts]]
 {{i18n}}
-[[Category:Fonts]]