Gen Lang Programs

By Geep, February 2025

Introduction to Gen_lang.pl ("PL") and Gen_lang_plus ("PLUS")

This describes two software implementations of command-line programs that essentially do the same thing;

The Perl program "gen_lang.pl" (dubbed here PL), created by Tels in the 2010-14 timeframe. This is easily deployed under Linux that has Perl capabilities. Version 17 Is current.
The C++ program "gen_lang_plus" (dubbed here PLUS), a re-implementation with minor improvements created by Geep in 2025. This is lightly Windows-specific at this time. (Please volunteer if you want to make it run under Linux.) Version 1.0 is current.

The purpose of both is as follows.

Read an "all.lang" file, either that associated with the main-menu-system, or with a specific FM under development. This UTF-8 input file has translations of strings for all 16 TDM-supported languages. There are individual sections for each target language. Each string is of form

"#str_<id>" "<string content>" // optional comment.

Generate a set of individual "<language>.lang" files, with the appropriate translated strings. Each file is generated in a standard 8-bit encoding (e.g., defined by one of the iso-8859 standards) designated for the target language. If a translator did not provide a given string in the target language, the master English string is automatically substituted.
Finally, as requested by command-line options, various analyses can be performed. For example, a set of "missing_<language>.txt" files can be generated, to suggest where translation efforts are still needed.

This article details both programs, with important differences noted. Running this functionality is intended to be done, infrequently, by the person overseeing translation improvements to all.lang.

For more quick overviews of this process, see

the wiki article Internationalization. This also covers the i18n.pl program, which is beyond the scope here.
the preamble of the all.lang file, found in tdm_base01.pk4's \strings\.

It is helpful to mention what these two programs do not take into account:

the <language>.map file that reroutes European codepoints from a standardized encoding to the TDM-specific custom 8-bit encoding discussed in I18N_-_Charset.
what UTF8 characters are allowed by that TDM-specific coding (as opposed to the ISO-8859 encoding).
the font and fontsize a specific string will use, and what limitations that entails, and what within-DAT glyph-substitutions may be in effect.

Runtime Environment for PLUS

To function in all aspects, PLUS requires Windows with certain UTF8 capabilities, introduced in later versions of Windows 10. Because the PLUS build is told explicitly to use UTF8, in theory what your "system locale" is set to should not matter. In particular, you are not required you set the system locale to the Windows 11 beta UTF8 codepage. (PLUS was developed on a dev box with the English (US) locale.)

You may run it under the traditional Windows console. However, that will not display UTF8 characters correctly on screen, which is mainly an issue with the -charstats and -charpunts options. Instead, use Windows Terminal (or Powershell with UTF8 encoding). Or pipe the console output to a file (i.e., "> myfile.txt"), that can be viewed as UTF8 with Notepad and other text editors.

Overview of Command-Line Options

For both PL and PLUS

--csv		Generate CSV file
--missing	Generate files with lists of missing strings for each language.
--charstats	Generate statistics about foreign characters.
--show-ranges	Show occupied string index ranges.
--show-holes	Show holes in the string index ranges.

With PLUS, one may use a single "-" before an option, instead of "--", and/or leave off the "show-" prefix. (Regarding "charstats", for consistency, since this option outputs to screen, PL should have used the "show-" prefix, but didn't. PLUS does, optionally.)

For PLUS Only

-old-preamble					Keep <language>.lang preamble just like that of gen_lang.pl. Don't include file generation date or PLUS version number.
-charpunts or --show-charpunts			Reports lines in which "?" was substituted for any character.
-charpunts-extra or --show-charpunts-extra	Like charpunt, but includes "// comment" characters.
-fm						Program used for an fm, not main menu; affects what's considered a valid numeric #str_<id> range.
-cvs-unsorted					Like -csv, but row order stays like that of all.lang [English].

Charpunts and Charpunts-Extra Options [PLUS Only]

During generation of a particular <language>.lang file, if a valid data line contains 1 or more characters that can't be rendered in the target encoding, then a 3-line report is issued to console. Specifically:

-charpunts		Considers characters within the line's content (excluding comments).
-charpunts-extra	Same as charpunts, but also considers characters within the line's "// comment".

When entering these options, you may use the "show-" prefix if you'd like, with 1 or 2 leading dashes.

By "// comment", is meant the optional comment that starts with "//" at the end of a #str_<id> line (but still excluding any other comment forms; these latter won't be present in <language>.lang).

Note – it is assumed that #str_<id> contains only 0-9, A-Z, a-z, or underline... so is not inspected by these options.

An example report for [Hungarian] encoded as iso-8859-2:

Output line 177...
from:   "#str_02208"    "Küldetési idõ"
  to:   "#str_02208"    "K�ldet�si id?"

Important: in this example, only õ, the last character in the "from:" line, could not be encoded in iso-8859-2 and so is replaced by "?" in the "to:" line. The "to:" line characters indicated by � were successfully encoded, but now are not valid UTF8, so can no longer be shown as such. That is, the "to:" line has the identical string placed into the hungarian.lang file.

Here is an -charpunts-extra example where only the comment has characters that don't encode, in this case into iso-8859-1 for [English]:

Output line 396...
from:   "#str_02482"    "Srpski"// Serbian (Српски - same exception as Russian)
  to:   "#str_02482"    "Srpski"// Serbian (?????? - same exception as Russian)

TIP: If the results of this option is too voluminous, recall that output from a console app can be directed into a file using "... > filename.txt".

These reports give the output line number within <language>.lang, rather than the input line in all.lang, because of difficulties getting the latter value right in the face of, for instance, multiline comments. (Any line numbers that PL reports are usually off.)

Old Preamble Option [PLUS Only]

Each <language>.lang file gets a standard preamble. For PL, the first line looks like this example:

// String table - english (iso-8859-1)

For PLUS, the first line by default also has UTC time and program version, like this example:

// String table - english (iso-8859-1) - created on 2025-02-15T04:00:21Z  with gen_lang_plus v. 1.0

But you can ask PLUS to use PL's format, with the "-old-preamble" option. Why? To simplify comparing a particular <language>.lang file generated with both programs. Since this is more a debugging option, it's not listed in the help, just here.

Missing Option

This option alerts translators to untranslated strings. For each language (other than English, the master) it generates a file named "missing_<language>.txt, with lists of missing UTF8 strings. Each line has the quoted #str_<id> and quoted content, but not comments.

What's Considered "Missing"?

This categorization was established in PL and now mimicked in PLUS.

If for a given master [English] #str_<id>, there's no matching #str_<id> in the target language.

It's "missing", unless there's a (hard-coded) exception:

The master string's content is in the "master_non_translatable" group, namely:
- An empty or whitespaces-only string.
- "TDM" or "The Dark Mod".
- A prefix designating a function key ("F") or port ("AUX" or "JOY") followed by 1 or 2 digits.
- A special numeric format like "320x400;640x480" or "123;345" or "1x;2x".
The master string's content is a font selector, i.e., beginning with "fonts/".
The #str_<id> has a numeric <id> within a "master_id_non_translatable" range. Just one range, 02460 – 02490, is so reserved, for language names.

If the #str_<id> in English and the target language have identical content (not counting comments)

It's "missing", unless there's an exception:

The target line has a "//" comment indicating "stays the same"
The "master_non_translatable", font selector, or "master_id_non_translatable" exclusions apply.

Otherwise, it's not missing.

That covers this additional case, where for the matching #str_<id> in English and the target language, there's different content (ignoring comments). Surprisingly, this is considered "not missing" even when the English content exists and the target language content is an empty string.

Rarely, the master content can be empty, but the translated string may or may not be. An example is #str_02313, representing an optional second line needed for some languages.

CSV Options

These options read the contents of all.lang and rearrange them for export as a comma-separated-values "tdm_all_lang.csv" UTF8 file in the "strings" folder. The options have different generated row order:

-csv		Rows sorted by #str_<id>; seriously buggy ordering in PL, works in PLUS.
-csv-unsorted	PLUS only. Keeps row order of [English] in all.lang

The exported CSV file can then be imported into other tools for translators, e.g., Excel spreadsheets. Be aware as you process this data further: maintaining the UTF8 (or other Unicode) encoding ensures character integrity.

Generated CVS File Format

The first line of the file has the field (aka column) headers, with languages in alphabetic order:

ID,czech,danish,dutch,english,french,german,hungarian,italian,polish,portuguese,romanian,russian,slovak,spanish,swedish,turkish,comment

There are 18 comma-separated fields here, so the results is very wide. (Recall in a spreadsheet, you can hide columns not of current interest.)

Broadly, the .csv format comes in different variants. Traditionally, as here, comma is the separator. If a field contains a comma, the field overall is wrapped in double quotes. PL also so wraps if there's a contained double-quote, colon, or semi-colon. PLUS mimics that policy. PL unfortunately only considers the language fields for wrapping. PLUS goes a bit further, to consider the "comment" field.

So, after the header each subsequent line has fields separated by commas, where:

The first field, "ID", has #str_<id>. Unlike in all.lang, there are no enclosing double quotes.
Remaining fields have translations. As expected, language fields are in the alphabetic header order shown, not necessarily the [language] section order in all.lang. English is in field 5. A field might be empty, have translation text, or have a copy of the master [English] text. (see "What's in Each Language Field" below). There are enclosing double quotes as needed.
The last field has any comment from the master {English], beginning with "//". If there was no comment, a trailing comma ends the line.

An example row with some empty fields, for the English word "Error", with no comment:

#str_02000,,Fejl,,Error,Erreur,Fehler,Hiba,Errore,Błąd,Erro,Eroare,Ошибка,Chyba,,,Hata,

Comments on source lines in sections other than [english] are not included.

Sorted and Unsorted Row Orders

With the -csv option:

For PLUS, the row order is sorted by #str_<id>.
While PL intended the row order to be sorted by #str_<id>, the actual sorting is scattershot. (A cause may be the more-recent presence of numerous non-numeric IDs. The sort uses a custom comparator, which knows only about numeric IDs. Other IDs are "tied" in order, not good for many sort algorithms. The Perl code that writes the <language>.lang files uses a better comparator.) The workaround is to import the data into your spreadsheet (keeping UTF8 encoding) and do row sorting there.

With the -csv-unsorted option [PLUS only]:

Sometimes, it is more helpful to have rows in the same order as in the master [english] section of all.lang, e.g., not necessarily sorted; so perhaps meaningfully grouped. (Caution: in all.lang, there is sometimes a preceding full-line comment that describes a given grouping. That will not be present in the .csv file.)

What's in Each Language Field

As discussed above, a field will have enclosing double quotes as needed.

The master "english" field, will simply have the [English] content (which could in rare cases be empty).

The situation for other languages is more complex. PL appears, in its separate csv implementation, to be following a middle course, between:

the "always use translation if available; otherwise English" policy for <language>.lang generation.
the categorization of the Missing Option.

To make clear the difference, we repeat the Missing Options description, but with lines crossed out below. (PLUS follows what PL does here, but FYI, there is commented-out code that could implement the crossed-out conditions.)

A field considered "missing" will be empty (e.g., between 2 commas).

If for a given master [English] #str_<id>, there's no matching #str_<id> in the target language.

It's "missing", unless there's a (hard-coded) exception:

The master string's content is in the "master_non_translatable" group, namely:
- An empty or whitespaces-only string.
- "TDM" or "The Dark Mod".
- A prefix designating a function key ("F") or port ("AUX" or "JOY") followed by 1 or 2 digits.
- A special numeric format like "320x400;640x480" or "123;345" or "1x;2x".
~~The master string's content is a font selector, i.e., beginning with "fonts/".~~
~~The #str_<id> has a numeric <id> within a "master_id_non_translatable" range. Just one range, 2460 – 2490, is so reserved, for language names.~~

In this case, if it's not missing, the master string's content will be used.

If the #str_<id> in English and the target language have identical content (not counting comments)

It's "missing", unless there's an exception:

~~The target line has a "// comment" indicating "stays the same".~~
The "master_non_translatable", ~~font selector, or "master_id_non_translatable"~~ exclusions apply.

In this case, if it's not missing, the translated string content (same as master string's content) will be used. As indicated, any "stays the same" phrase in the line's comment within the [language] section is ignored.

Otherwise, it's not missing.

The translated string's content will be used. This is the normal "translation provided" case, where for the matching #str_<id> in English and the target language, there's different content (ignoring comments). Note that if English content exists but the target language content is an empty string, while this categorization would call it "not missing", it still has the same result: nothing between the commmas.

Differences in Comments between PL and PLUS

As noted above , PLUS will enclose the comment field in double quotes if needed. Also, if the comment field contains any unescaped double-quotes, they are backslash-escaped. PL will do none of that. This is a bug. When the .csv file is imported into a spreadsheet, extra spurious columns may result, requiring manual fix-up. Also, there is a difference in handling of how characters appear in comments. PLUS will preserve UTF8 characters. PL will replace characters not in ISO-8859-1 with "?" (resulting in the same appearance as in english.lang).

Charstats Option

Creating bitmap font glyphs can be a time-consuming process. This option provides information to suggest, among European (i.e., non-ASCII, non-Cyrillic) characters, which ones should take priority in that process, based on what translators have provided to all.lang.

The results go to the console. PL and PLUS formats differ. PL groups compactly by "Top 5", "Top 10", and so on, and is not particularly Unicode-aware. PLUS gives each character its own line, with rank and additional data. (And the output here is in utf8 form, better viewed in Windows Terminal than the traditional console. Or pipe to a text file that, e.g., Notepad can treat as UTF8.) Example output:

Non-ASCII UTF8 characters in 'all.lang' across all languages except [Russian],
by descending frequency. (Excludes characters in comments.)

Rank, unicode codepoint, utf8 char, frequency (i.e., count out of 13563 total listed)

 1 U+00e1 á 1504
 2 U+00e9 é 1420
 3 U+00ed í 1099
 4 U+0131 ı 1075
 5 U+00fc ü 755
 6 U+010d č 644
 7 U+00f3 ó 498
 8 U+00f6 ö 418
 9 U+00fa ú 407
 ....
 93 U+0152 Œ 1
 94 U+0158 Ř 1
 95 U+0179 Ź 1

Ranges Option and Holes Option

Traditionally, #str_<id>s had only 5-digit numeric values. When a new string needed to be added (i.e., to the [English] master), it was helpful to know what numeric ranges were already fully in use, and conversely, what ranges are unoccupied, dubbed "holes". With these options, that info gets reported to the console. Note that for PL, "holes" reports only if "ranges" is not given as an option. PLUS removes that restriction. PL processes and reports <id>s in [English] listing order; if that is non-monotonic, this can cause anomalies for the "holes" output. To fix that, PLUS does a pre-sort into monotonic order. PL and PLUS formats differ slightly. An example PLUS output for "-ranges":

There are 48 occupied numeric #str_ ranges in [English]:

  1000-1019 (with 20 entries)
  1050-1062 (with 13 entries)
  1500-1503 (with 4 entries)
  2000-2016 (with 17 entries)
  ...
  8980-8988 (with 9 entries)
  10000-10162 (with 163 entries)
  10180-10185 (with 6 entries)

Also, there are 54 non-numeric #str_ entries.

In the last line, you see that, unlike PL, PLUS is aware of non-numeric <id>s, but just reports them as a total. An example PLUS output for "-holes":

There are 44 unoccupied numeric #str_ ranges (aka holes) in [English]:

  0-999 (with 1000 openings)
  1020-1049 (with 30 openings)
  1063-1499 (with 437 openings)
  1504-1999 (with 496 openings)
  ....
  8989-9999 (with 1011 openings)
  10163-10179 (with 17 openings)
  10185-19999 (with 9814 openings)

The range from 20000-99999 is reserved for FMs. Using #str_<id> in that range in all.lang will trigger a fatal error (with the –range or –hole option). (See also the FM Option.)

FM Option [PLUS only]

By default, when either the -ranges or -holes option is in effect, the valid range for #str_<id> numeric values is enforced by PLUS as 0-19999. If you are using this for FM strings, select this option to make the valid range 20000-99999.

These days, many but not all main menu #str_<id>s contain a substring "_menu_". There could be future enforcement of alphanumeric substrings.

Details about Main File Formats and Processing

All.lang File Input Format

The file begins with a preamble (in C-style comment form) of notes for translators, some of which we restate here. Then an opening "{" by itself, followed by a series of language sections. Each starts with a line with one of these headers (shown in the order currently seen in the main menu all.lang file):

	[English]
	[German]
	[French]
	[Italian]
	[Spanish
	[Polish]
	[Romanian]
	[Russian]
	[Portuguese}
	[Czech]
	[Hungarian]
	[Slovak]
	[Swedish]
	[Danish]
	[Dutch]
	[Turkish]

Case of the language name doesn't matter, nor does order in which sections are listed. But it is a convention to list English first. After the header comes a series of convertible lines and comment lines, as discussed in next sections. The last line of the file has the closing "}".

Converting Each Line

Within any particular language section of all.lang, a convertible line will have in this order:

A #str_<id>, in double quotes, where <id> is an unsigned integer, or a string composed of ASCII characters a-z, A-Z, 0-9, or underscore. Preferred for the latter are all lower case with underscores as word separators. Camel-case is tolerated, but not spaces or dashes. Ignoring the "#str_" prefix, the <id> must be no less than 5 and no more than 63 characters long. Leading zeros should be used to pad a numeric <id> to 5 digits. [Note: program i18n.pl possibly deals only with numeric <id>s.]
The language-specific UTF8 <string content>, in double quotes
An optional comment beginning with "//". The next section covers overall treatment of comments.

These are led and separated by whitespace. A leading tab is normal, with tab separators.

No particular ordering of #str_<id>s is required, though ordering numeric <id>s (at least within sub-groups) is desirable.

Duplicates Across Lines Within a Section

If a particular <id> value appears more than once in a section, the last value is the effective one. A console warning will generally occur during parsing, in the case of PLUS like:

Warning - #str_02518 redefined within [english]...
 from: malformed
 to:   100

In the [English] section, there might be separate <id>s with the same content. This will generally cause a warning, in the case of PLUS like:

Caution - These strings have the same content ('Master'): #str_08206, #str_03008

To avoid the latter warning, a group of duplicates can be assigned a group number and hard-coded in the "master_double_exceptions" list. Only one group of ten members is so defined currently, those with a required empty string as content. More generally (says a PL comment): "Sometimes it can make sense to have two different strings as when they are translated into different strings in another language. A [theoretical] example would be "Saw" (I saw something, ich hab etwas gesehen) and "Saw" (Saw, Säge)."

Substituting English when No Translation

Ideally, a #str_<id> in the master [English] section has a match in a target language's section. Then the translated content will always be used for encoding and listing in the <language>.lang file. If no match, the English content is substituted. (Note that only a subset of these substituted strings will categorized as "Missing" if the -missing option is used. Also note that the substitution policy is slightly different with the -csv option.)

Conversion Differences between PL and PLUS

To do encoding, PL uses the Perl Encode module, that can autoload language-specific encode/decode tables.
In the C++ implementation, source file "UTF8_8859_convert.cpp" includes similar (albeit custom) tables for the codepages TDM supports. The overall approach is convert from UTF8 to UTF16, then to the target ISO encoding. That process does not use <locale> facets (which are problematic for our purposes), but standard C++ library "codecvt_utf8_utf16" and "wstring_convert" capabilities.

The <language>.Lang File Output Format

A preamble states the language name and specific ISO encoding.
This is followed by the "#str_<id> ..." lines; sorted in alphabetic order; <id> in numeric form will be listed before those in alphabetic form. Among the latter, an upper case letter will sort before all lower case letters.
Trailing "//" comments on "#str_<id>" lines from all.lang are preserved.
Comments of the form /*...*/ in all.lang are skipped, whether spanning multiple lines or part of a single line.
Blank lines, or lines with only whitespace or a "//" comment, are likewise skipped. The latter comments typically contain headers to organize groups of items in all.lang, but that organization is not retained, replaced by alphanumeric ordering as mentioned.

Output File Differences between PL and PLUS

These differences are not thought to have functional implications:

The first line of the preamble will differ unless PLUS's -old_preamble option is used.
Between a string content and any comment, PLUS preserves tabs and spaces found in all.lang, while PL substitutes two tabs. This is means that simple file comparison programs (like Windows "fc", even with /T or /W options) will find lots of trivial differences. It is suggested you use a more sophisticated comparison tool, for instance, Notepad++ with the ComparePlus plugin and "Ignore changed spaces" selected.

Notes on PLUS Internals

A hand-crafted conversion from a static analysis of PL, PLUS preserves many variable names, some reimagined as functions. Befitting a straightforward conversion, it offers a procedural, C-like style, without much in the way of a class hierarchy, overarching object-oriented design pattern, or advanced C++ features. (This may be of benefit if part or all of its functionality is ever integrated into the TDM engine and/or DR.) Not much effort was spent on performance optimization. However, as a compiled language, it will of course run much faster than PL. Also, the C++ implementation avoids use of regular expressions, which are heavily featured in PL and are notoriously slow and also hard for most coders to read.

Platform-independent C++ standard-library structures like vectors, (ordered) sets, maps, and multimaps are employed. However, to do the UTF8 to ISO conversions, a Windows-specific approach is taken, to convert via unencoding from utf8 to wide characters and then re-encoding to target ISOs. (While it is possible to do such conversions entirely within 8-bit characters, the resulting code is a puzzle box, best left to third-party libraries.) Instead of using a third-party library, the encode/decode implementation here includes use of the "deprecated" C++ standard "codecvt" feature; while deprecated, the C++ standard committee has not (as of 2025) come up with a viable replacement.

Like PL, PLUS has some DEBUG statements defined (requiring recompile for C++). These reflect specific concerns during development and testing and are of limited general interest.

Downloads

For PLUS v1.0 of 2025, Feb 28:

gen_lang_plus_source_v1.0.zip. This is a Visual Studio project.
gen_lang_plus.exe