Info: (eintr) Syntax

Info Catalog
eintr: Words and Symbols
eintr: Words in a defun
eintr: count-words-in-defun
eintr: Syntax

 
 14.2 What Constitutes a Word or Symbol?
 =======================================
 
 Emacs treats different characters as belonging to different “syntax
 categories”.  For example, the regular expression, ‘\\w+’, is a pattern
 specifying one or more _word constituent_ characters.  Word constituent
 characters are members of one syntax category.  Other syntax categories
 include the class of punctuation characters, such as the period and the
 comma, and the class of whitespace characters, such as the blank space
 and the tab character.  (For more information, Syntax Tables
 (elisp)Syntax Tables.)
 
    Syntax tables specify which characters belong to which categories.
 Usually, a hyphen is not specified as a word constituent character.
 Instead, it is specified as being in the class of characters that are
 part of symbol names but not words.  This means that the
 ‘count-words-example’ function treats it in the same way it treats an
 interword white space, which is why ‘count-words-example’ counts
 ‘multiply-by-seven’ as three words.
 
    There are two ways to cause Emacs to count ‘multiply-by-seven’ as one
 symbol: modify the syntax table or modify the regular expression.
 
    We could redefine a hyphen as a word constituent character by
 modifying the syntax table that Emacs keeps for each mode.  This action
 would serve our purpose, except that a hyphen is merely the most common
 character within symbols that is not typically a word constituent
 character; there are others, too.
 
    Alternatively, we can redefine the regexp used in the
 ‘count-words-example’ definition so as to include symbols.  This
 procedure has the merit of clarity, but the task is a little tricky.
 
    The first part is simple enough: the pattern must match at least one
 character that is a word or symbol constituent.  Thus:
 
      "\\(\\w\\|\\s_\\)+"
 
 The ‘\\(’ is the first part of the grouping construct that includes the
 ‘\\w’ and the ‘\\s_’ as alternatives, separated by the ‘\\|’.  The ‘\\w’
 matches any word-constituent character and the ‘\\s_’ matches any
 character that is part of a symbol name but not a word-constituent
 character.  The ‘+’ following the group indicates that the word or
 symbol constituent characters must be matched at least once.
 
    However, the second part of the regexp is more difficult to design.
 What we want is to follow the first part with optionally one or more
 characters that are not constituents of a word or symbol.  At first, I
 thought I could define this with the following:
 
      "\\(\\W\\|\\S_\\)*"
 
 The upper case ‘W’ and ‘S’ match characters that are _not_ word or
 symbol constituents.  Unfortunately, this expression matches any
 character that is either not a word constituent or not a symbol
 constituent.  This matches any character!
 
    I then noticed that every word or symbol in my test region was
 followed by white space (blank space, tab, or newline).  So I tried
 placing a pattern to match one or more blank spaces after the pattern
 for one or more word or symbol constituents.  This failed, too.  Words
 and symbols are often separated by whitespace, but in actual code
 parentheses may follow symbols and punctuation may follow words.  So
 finally, I designed a pattern in which the word or symbol constituents
 are followed optionally by characters that are not white space and then
 followed optionally by white space.
 
    Here is the full regular expression:
 
      "\\(\\w\\|\\s_\\)+[^ \t\n]*[ \t\n]*"
Info Catalog
eintr: Words and Symbols
eintr: Words in a defun
eintr: count-words-in-defun