eintr: sentence-end
12.1 The Regular Expression for ‘sentence-end’
==============================================
The symbol ‘sentence-end’ is bound to the pattern that marks the end of
a sentence. What should this regular expression be?
Clearly, a sentence may be ended by a period, a question mark, or an
exclamation mark. Indeed, in English, only clauses that end with one of
those three characters should be considered the end of a sentence. This
means that the pattern should include the character set:
[.?!]
However, we do not want ‘forward-sentence’ merely to jump to a
period, a question mark, or an exclamation mark, because such a
character might be used in the middle of a sentence. A period, for
example, is used after abbreviations. So other information is needed.
According to convention, you type two spaces after every sentence,
but only one space after a period, a question mark, or an exclamation
mark in the body of a sentence. So a period, a question mark, or an
exclamation mark followed by two spaces is a good indicator of an end of
sentence. However, in a file, the two spaces may instead be a tab or
the end of a line. This means that the regular expression should
include these three items as alternatives.
This group of alternatives will look like this:
\\($\\| \\| \\)
^ ^^
TAB SPC
Here, ‘$’ indicates the end of the line, and I have pointed out where
the tab and two spaces are inserted in the expression. Both are
inserted by putting the actual characters into the expression.
Two backslashes, ‘\\’, are required before the parentheses and
vertical bars: the first backslash quotes the following backslash in
Emacs; and the second indicates that the following character, the
parenthesis or the vertical bar, is special.
Also, a sentence may be followed by one or more carriage returns,
like this:
[
]*
Like tabs and spaces, a carriage return is inserted into a regular
expression by inserting it literally. The asterisk indicates that the
<RET> is repeated zero or more times.
But a sentence end does not consist only of a period, a question mark
or an exclamation mark followed by appropriate space: a closing
quotation mark or a closing brace of some kind may precede the space.
Indeed more than one such mark or brace may precede the space. These
require a expression that looks like this:
[]\"')}]*
In this expression, the first ‘]’ is the first character in the
expression; the second character is ‘"’, which is preceded by a ‘\’ to
tell Emacs the ‘"’ is _not_ special. The last three characters are ‘'’,
‘)’, and ‘}’.
All this suggests what the regular expression pattern for matching
the end of a sentence should be; and, indeed, if we evaluate
‘sentence-end’ we find that it returns the following value:
sentence-end
⇒ "[.?!][]\"')}]*\\($\\| \\| \\)[
]*"
(Well, not in GNU Emacs 22; that is because of an effort to make the
process simpler and to handle more glyphs and languages. When the value
of ‘sentence-end’ is ‘nil’, then use the value defined by the function
‘sentence-end’. (Here is a use of the difference between a value and a
function in Emacs Lisp.) The function returns a value constructed from
the variables ‘sentence-end-base’, ‘sentence-end-double-space’,
‘sentence-end-without-period’, and ‘sentence-end-without-space’. The
critical variable is ‘sentence-end-base’; its global value is similar to
the one described above but it also contains two additional quotation
marks. These have differing degrees of curliness. The
‘sentence-end-without-period’ variable, when true, tells Emacs that a
sentence may end without a period, such as text in Thai.)