Create a regex that matches cat in My cat is brown, but not in category or bobcat. Create another regex that matches cat in staccato, but not in any of the three previous subject strings.
Here are sample solutions for the various flavors:
Regex options: None
Regex options: None
The regular expression token
\bis called a word boundary. It matches at the start or the end of a word. By itself, it results in a zero-length match.
\bis an anchor, just like the tokens introduced in the previous section.
\bmatches in these three positions:
- Before the first character in the subject, if the first character is a word character
- After the last character in the subject, if the last character is a word character
- Between two characters in the subject, where one is a word character and the other is not a word character
None of the flavors discussed in this book have separate tokens for matching only before or only after a word. Unless you wanted to create a regex that consists of nothing but a word boundary, these aren’t needed. The tokens before or after the
\bin your regular expression will determine where
\bcan match. The
!\bcould match only at the start of a word. The
\b!could match only at the end of a word.
!\b!can never match anywhere.
To run a “whole words only” search using a regular expression, simply place the word between two word boundaries, as we did with
\bcat\b. The first
\brequires the c to occur at the very start of the string, or after a nonword character. The second
\brequires the t to occur at the very end of the string, or before a nonword character.
Line break characters are nonword characters.
\bwill match after a line break if the line break is immediately followed by a word character. It will also match before a line break immediately preceded by a word character. So a word that occupies a whole line by itself will be found by a “whole words only” search.
\bis unaffected by “multiline” mode or (?m), which is one of the reasons why this book refers to “multiline” mode as “^ and $ match at line breaks” mode.
\Bmatches at every position in the subject text where
\bdoes not match.
\Bmatches at every position that is not at the start or end of a word.
\Bmatches in these five positions:
- Before the first character in the subject, if the first character is not a word character
- After the last character in the subject, if the last character is not a word character
- Between two word characters
- Between two nonword characters
- The empty string
\Bcat\Bmatches cat in staccato, but not in My cat is brown, category, or bobcat.
To do the opposite of a “whole words only” search (i.e., excluding My cat is brown and including staccato, category, and bobcat), you need to use alternation to combine
\Bcatmatches cat in staccato and bobcat.
cat\Bmatches cat in category (and staccato if
\Bcathadn’t already taken care of that).
All this talk about word boundaries, but no talk about what a word character is. A word character is a character that can occur as part of a word.
Although all the flavors in this book support
\B, they differ in which characters are word characters.
\bmatch between two characters where one is matched by
\wand the other by
\Balways matches between two characters where both are matched by
\wis identical to [a-zA-Z0-9_]. With these flavors, you can do a “whole words only” search on words in languages that use only the letters A to Z without diacritics, such as English. But these flavors cannot do “whole words only” searches on words in other languages, such as Spanish or Russian.
.NET and Perl treat letters and digits from all scripts as word characters. With these flavors, you can do a “whole words only” search on words in any language, including those that don’t use the Latin alphabet.
Python gives you an option. Non-ASCII characters are included only if you pass the UNICODE or U flag when creating the regex. This flag affects both
Java behaves inconsistently.
\wmatches only ASCII characters. But
\bis Unicode-enabled, supporting any script. In Java,
\b\w\bmatches a single English letter, digit, or underscore that does not occur as part of a word in any language.
\bкошка\bwill correctly match the Russian word for cat, because
\bsupports Unicode. But
\w+will not match any Russian word, because