elisp: Character Properties
32.6 Character Properties
=========================
A “character property” is a named attribute of a character that
specifies how the character behaves and how it should be handled during
text processing and display. Thus, character properties are an
important part of specifying the character’s semantics.
On the whole, Emacs follows the Unicode Standard in its
implementation of character properties. In particular, Emacs supports
the Unicode Character Property Model
(http://www.unicode.org/reports/tr23/), and the Emacs character property
database is derived from the Unicode Character Database (UCD). See the
Character Properties chapter of the Unicode Standard
(http://www.unicode.org/versions/Unicode6.2.0/ch04.pdf), for a detailed
description of Unicode character properties and their meaning. This
section assumes you are already familiar with that chapter of the
Unicode Standard, and want to apply that knowledge to Emacs Lisp
programs.
In Emacs, each property has a name, which is a symbol, and a set of
possible values, whose types depend on the property; if a character does
not have a certain property, the value is ‘nil’. As a general rule, the
names of character properties in Emacs are produced from the
corresponding Unicode properties by downcasing them and replacing each
‘_’ character with a dash ‘-’. For example, ‘Canonical_Combining_Class’
becomes ‘canonical-combining-class’. However, sometimes we shorten the
names to make their use easier.
Some codepoints are left “unassigned” by the UCD—they don’t
correspond to any character. The Unicode Standard defines default
values of properties for such codepoints; they are mentioned below for
each property.
Here is the full list of value types for all the character properties
that Emacs knows about:
‘name’
Corresponds to the ‘Name’ Unicode property. The value is a string
consisting of upper-case Latin letters A to Z, digits, spaces, and
hyphen ‘-’ characters. For unassigned codepoints, the value is
‘nil’.
‘general-category’
Corresponds to the ‘General_Category’ Unicode property. The value
is a symbol whose name is a 2-letter abbreviation of the
character’s classification. For unassigned codepoints, the value
is ‘Cn’.
‘canonical-combining-class’
Corresponds to the ‘Canonical_Combining_Class’ Unicode property.
The value is an integer. For unassigned codepoints, the value is
zero.
‘bidi-class’
Corresponds to the Unicode ‘Bidi_Class’ property. The value is a
symbol whose name is the Unicode “directional type” of the
character. Emacs uses this property when it reorders bidirectional
text for display (Bidirectional Display). For unassigned
codepoints, the value depends on the code blocks to which the
codepoint belongs: most unassigned codepoints get the value of ‘L’
(strong L), but some get values of ‘AL’ (Arabic letter) or ‘R’
(strong R).
‘decomposition’
Corresponds to the Unicode properties ‘Decomposition_Type’ and
‘Decomposition_Value’. The value is a list, whose first element
may be a symbol representing a compatibility formatting tag, such
as ‘small’(1); the other elements are characters that give the
compatibility decomposition sequence of this character. For
characters that don’t have decomposition sequences, and for
unassigned codepoints, the value is a list with a single member,
the character itself.
‘decimal-digit-value’
Corresponds to the Unicode ‘Numeric_Value’ property for characters
whose ‘Numeric_Type’ is ‘Decimal’. The value is an integer, or
‘nil’ if the character has no decimal digit value. For unassigned
codepoints, the value is ‘nil’, which means NaN, or “not a number”.
‘digit-value’
Corresponds to the Unicode ‘Numeric_Value’ property for characters
whose ‘Numeric_Type’ is ‘Digit’. The value is an integer.
Examples of such characters include compatibility subscript and
superscript digits, for which the value is the corresponding
number. For characters that don’t have any numeric value, and for
unassigned codepoints, the value is ‘nil’, which means NaN.
‘numeric-value’
Corresponds to the Unicode ‘Numeric_Value’ property for characters
whose ‘Numeric_Type’ is ‘Numeric’. The value of this property is a
number. Examples of characters that have this property include
fractions, subscripts, superscripts, Roman numerals, currency
numerators, and encircled numbers. For example, the value of this
property for the character ‘U+2155’ (VULGAR FRACTION ONE FIFTH) is
‘0.2’. For characters that don’t have any numeric value, and for
unassigned codepoints, the value is ‘nil’, which means NaN.
‘mirrored’
Corresponds to the Unicode ‘Bidi_Mirrored’ property. The value of
this property is a symbol, either ‘Y’ or ‘N’. For unassigned
codepoints, the value is ‘N’.
‘mirroring’
Corresponds to the Unicode ‘Bidi_Mirroring_Glyph’ property. The
value of this property is a character whose glyph represents the
mirror image of the character’s glyph, or ‘nil’ if there’s no
defined mirroring glyph. All the characters whose ‘mirrored’
property is ‘N’ have ‘nil’ as their ‘mirroring’ property; however,
some characters whose ‘mirrored’ property is ‘Y’ also have ‘nil’
for ‘mirroring’, because no appropriate characters exist with
mirrored glyphs. Emacs uses this property to display mirror images
of characters when appropriate (Bidirectional Display).
For unassigned codepoints, the value is ‘nil’.
‘paired-bracket’
Corresponds to the Unicode ‘Bidi_Paired_Bracket’ property. The
value of this property is the codepoint of a character’s “paired
bracket”, or ‘nil’ if the character is not a bracket character.
This establishes a mapping between characters that are treated as
bracket pairs by the Unicode Bidirectional Algorithm; Emacs uses
this property when it decides how to reorder for display
parentheses, braces, and other similar characters (
Bidirectional Display).
‘bracket-type’
Corresponds to the Unicode ‘Bidi_Paired_Bracket_Type’ property.
For characters whose ‘paired-bracket’ property is non-‘nil’, the
value of this property is a symbol, either ‘o’ (for opening bracket
characters) or ‘c’ (for closing bracket characters). For
characters whose ‘paired-bracket’ property is ‘nil’, the value is
the symbol ‘n’ (None). Like ‘paired-bracket’, this property is
used for bidirectional display.
‘old-name’
Corresponds to the Unicode ‘Unicode_1_Name’ property. The value is
a string. For unassigned codepoints, and characters that have no
value for this property, the value is ‘nil’.
‘iso-10646-comment’
Corresponds to the Unicode ‘ISO_Comment’ property. The value is
either a string or ‘nil’. For unassigned codepoints, the value is
‘nil’.
‘uppercase’
Corresponds to the Unicode ‘Simple_Uppercase_Mapping’ property.
The value of this property is a single character. For unassigned
codepoints, the value is ‘nil’, which means the character itself.
‘lowercase’
Corresponds to the Unicode ‘Simple_Lowercase_Mapping’ property.
The value of this property is a single character. For unassigned
codepoints, the value is ‘nil’, which means the character itself.
‘titlecase’
Corresponds to the Unicode ‘Simple_Titlecase_Mapping’ property.
“Title case” is a special form of a character used when the first
character of a word needs to be capitalized. The value of this
property is a single character. For unassigned codepoints, the
value is ‘nil’, which means the character itself.
-- Function: get-char-code-property char propname
This function returns the value of CHAR’s PROPNAME property.
(get-char-code-property ?\s 'general-category)
⇒ Zs
(get-char-code-property ?1 'general-category)
⇒ Nd
;; U+2084 SUBSCRIPT FOUR
(get-char-code-property ?\u2084 'digit-value)
⇒ 4
;; U+2155 VULGAR FRACTION ONE FIFTH
(get-char-code-property ?\u2155 'numeric-value)
⇒ 0.2
;; U+2163 ROMAN NUMERAL FOUR
(get-char-code-property ?\u2163 'numeric-value)
⇒ 4
(get-char-code-property ?\( 'paired-bracket)
⇒ 41 ;; closing parenthesis
(get-char-code-property ?\) 'bracket-type)
⇒ c
-- Function: char-code-property-description prop value
This function returns the description string of property PROP’s
VALUE, or ‘nil’ if VALUE has no description.
(char-code-property-description 'general-category 'Zs)
⇒ "Separator, Space"
(char-code-property-description 'general-category 'Nd)
⇒ "Number, Decimal Digit"
(char-code-property-description 'numeric-value '1/5)
⇒ nil
-- Function: put-char-code-property char propname value
This function stores VALUE as the value of the property PROPNAME
for the character CHAR.
-- Variable: unicode-category-table
The value of this variable is a char-table (Char-Tables)
that specifies, for each character, its Unicode ‘General_Category’
property as a symbol.
-- Variable: char-script-table
The value of this variable is a char-table that specifies, for each
character, a symbol whose name is the script to which the character
belongs, according to the Unicode Standard classification of the
Unicode code space into script-specific blocks. This char-table
has a single extra slot whose value is the list of all script
symbols.
-- Variable: char-width-table
The value of this variable is a char-table that specifies the width
of each character in columns that it will occupy on the screen.
-- Variable: printable-chars
The value of this variable is a char-table that specifies, for each
character, whether it is printable or not. That is, if evaluating
‘(aref printable-chars char)’ results in ‘t’, the character is
printable, and if it results in ‘nil’, it is not.
---------- Footnotes ----------
(1) The Unicode specification writes these tag names inside ‘<..>’
brackets, but the tag names in Emacs do not include the brackets; e.g.,
Unicode specifies ‘<small>’ where Emacs uses ‘small’.