grub2: Internationalisation
17 Internationalisation
***********************
17.1 Charset
============
GRUB uses UTF-8 internally other than in rendering where some
GRUB-specific appropriate representation is used. All text files
(including config) are assumed to be encoded in UTF-8.
17.2 Filesystems
================
NTFS, JFS, UDF, HFS+, exFAT, long filenames in FAT, Joliet part of
ISO9660 are treated as UTF-16 as per specification. AFS and BFS are
read as UTF-8, again according to specification. BtrFS, cpio, tar,
squash4, minix, minix2, minix3, ROMFS, ReiserFS, XFS, ext2, ext3, ext4,
FAT (short names), RockRidge part of ISO9660, nilfs2, UFS1, UFS2 and ZFS
are assumed to be UTF-8. This might be false on systems configured with
legacy charset but as long as the charset used is superset of ASCII you
should be able to access ASCII-named files. And it's recommended to
configure your system to use UTF-8 to access the filesystem, convmv may
help with migration. ISO9660 (plain) filenames are specified as being
ASCII or being described with unspecified escape sequences. GRUB
assumes that the ISO9660 names are UTF-8 (since any ASCII is valid
UTF-8). There are some old CD-ROMs which use CP437 in non-compliant
way. You're still able to access files with names containing only ASCII
characters on such filesystems though. You're also able to access any
file if the filesystem contains valid Joliet (UTF-16) or RockRidge
(UTF-8). AFFS, SFS and HFS never use unicode and GRUB assumes them to
be in Latin1, Latin1 and MacRoman respectively. GRUB handles filesystem
case-insensitivity however no attempt is performed at case conversion of
international characters so e.g. a file named lowercase greek alpha is
treated as different from the one named as uppercase alpha. The
filesystems in questions are NTFS (except POSIX namespace), HFS+
(configurable at mkfs time, default insensitive), SFS (configurable at
mkfs time, default insensitive), JFS (configurable at mkfs time, default
sensitive), HFS, AFFS, FAT, exFAT and ZFS (configurable on per-subvolume
basis by property "casesensitivity", default sensitive). On ZFS
subvolumes marked as case insensitive files containing lowercase
international characters are inaccessible. Also like all supported
filesystems except HFS+ and ZFS (configurable on per-subvolume basis by
property "normalization", default none) GRUB makes no attempt at check
of canonical equivalence so a file name u-diaresis is treated as
distinct from u+combining diaresis. This however means that in order to
access file on HFS+ its name must be specified in normalisation form D.
On normalized ZFS subvolumes filenames out of normalisation are
inaccessible.
17.3 Output terminal
====================
Firmware output console "console" on ARC and IEEE1275 are limited to
ASCII.
BIOS firmware console and VGA text are limited to ASCII and some
pseudographics.
None of above mentioned is appropriate for displaying international
and any unsupported character is replaced with question mark except
pseudographics which we attempt to approximate with ASCII.
EFI console on the other hand nominally supports UTF-16 but actual
language coverage depends on firmware and may be very limited.
The encoding used on serial can be chosen with 'terminfo' as either
ASCII, UTF-8 or "visual UTF-8". Last one is against the specification
but results in correct rendering of right-to-left on some readers which
don't have own bidi implementation.
On emu GRUB checks if charset is UTF-8 and uses it if so and uses
ASCII otherwise.
When using gfxterm or gfxmenu GRUB itself is responsible for
rendering the text. In this case GRUB is limited by loaded fonts. If
fonts contain all required characters then bidirectional text, cursive
variants and combining marks other than enclosing, half (e.g. left half
tilde or combining overline) and double ones. Ligatures aren't
supported though. This should cover European, Middle Eastern (if you
don't mind lack of lam-alif ligature in Arabic) and East Asian scripts.
Notable unsupported scripts are Brahmic family and derived as well as
Mongolian, Tifinagh, Korean Jamo (precomposed characters have no
problem) and tonal writing (2e5-2e9). GRUB also ignores deprecated (as
specified in Unicode) characters (e.g. tags). GRUB also doesn't handle
so called "annotation characters" If you can complete either of two
lists or, better, propose a patch to improve rendering, please contact
developer team.
17.4 Input terminal
===================
Firmware console on BIOS, IEEE1275 and ARC doesn't allow you to enter
non-ASCII characters. EFI specification allows for such but author is
unaware of any actual implementations. Serial input is currently
limited for latin1 (unlikely to change). Own keyboard implementations
(at_keyboard and usb_keyboard) supports any key but work on
one-char-per-keystroke. So no dead keys or advanced input method. Also
there is no keymap change hotkey. In practice it makes difficult to
enter any text using non-Latin alphabet. Moreover all current input
consumers are limited to ASCII.
17.5 Gettext
============
GRUB supports being translated. For this you need to have language *.mo
files in $prefix/locale, load gettext module and set "lang" variable.
17.6 Regexp
===========
Regexps work on unicode characters, however no attempt at checking
cannonical equivalence has been made. Moreover the classes like
[:alpha:] match only ASCII subset.
17.7 Other
==========
Currently GRUB always uses YEAR-MONTH-DAY HOUR:MINUTE:SECOND [WEEKDAY]
24-hour datetime format but weekdays are translated. GRUB always uses
the decimal number format with [0-9] as digits and . as descimal
separator and no group separator. IEEE1275 aliases are matched
case-insensitively except non-ASCII which is matched as binary. Similar
behaviour is for matching OSBundleRequired. Since IEEE1275 aliases and
OSBundleRequired don't contain any non-ASCII it should never be a
problem in practice. Case-sensitive identifiers are matched as raw
strings, no canonical equivalence check is performed. Case-insenstive
identifiers are matched as RAW but additionally [a-z] is equivalent to
[A-Z]. GRUB-defined identifiers use only ASCII and so should
user-defined ones. Identifiers containing non-ASCII may work but aren't
supported. Only the ASCII space characters (space U+0020, tab U+000b,
CR U+000d and LF U+000a) are recognised. Other unicode space characters
aren't a valid field separator. 'test' (test) tests <, >, <=,
>=, -pgt and -plt compare the strings in the lexicographical order of
unicode codepoints, replicating the behaviour of test from coreutils.
environment variables and commands are listed in the same order.