regex
Links
- Chomsky hierarchy
- Regular languages (Type 3) are a subset of
- Context-free languages (Type 2) are a subset of
- Context-sensitive languages (Type 1) are a subset of
- Recursively enumerable languages (Type 0).
- FYI, if one is looking for a fairly full featured regex
implementation for playing around, the .NET library's
System.Text.RegularExpressions
has quite a bit (more
that PCRE).
- RE2
- Tools
- Regular Expression Matching with a Trigram Index
- Ragel: State Machine Compiler
- The true power of regular expressions
- On matching context free language
{a^n b^n, n>0}
via the regex
/^(a(?1)?b)$/
.
- Python
re
module
- You can't express that via the built in
re
module.
- Note that backreferences (like
\n
) match the actual matched text and not the expression represented by the group.
- In fact, if you tried to
re.compile(r'^(a\1*b)$')
, you will get the
error sre_constants.error: cannot refer to
open group
.)
- Python regex library.
- This one allows backreferences to
expressions (i.e. recursive regex), so you
can express the regex as
regex.compile(r'(?V1)^(a(?1)*b)$')
that works. (so
regex.compile(r'(?V1)^(a(?1)*b)$').match('aaabbb')
passes.)
- Atomic Grouping
- Syntax
(?>choices)
.
- It specifies a group that won't backtrack once it
has been matched. For example, the regex
(?>ab|a)b
will match abb
but not ab
because
once the atomic group as captured ab
, there's no
backtracking to try if some later part of the regex
failed.
- Possessive Quantifiers
- Syntax:
X*+
.
- It's a shorthand notation for Atomic Grouping.
X*+
→ is the same as (?>X*)
. If X
was a
group, remember to make sure that is is included as
a group inside the Atomic Expression when you do the
transform. (i.e. (?:a|b)*+
→ (?>(?:a|b)*)
).
- Literal Characters and Special Characters
\Q
and \E
is supported by PCRE and some others
but NOT by Python even when using the regex
library.
- Unicode
- Canonical equivalence and matching
- If you have Unicode literal characters in your
pattern and have canonical equivalence turned
on, then they will match any equivalent code
point. For example, if your pattern included
the character
ö
, then it will match that
character whether it appears in the text as a
single precomposed character (e.g. \u00F6
→
ö
) OR appears in the text as a two code
points—\u006F (o)
followed by the combining
character for the umlaut \u0308
. However,
if instead you wrote it in terms of unicode code
points (using the \uXXXX
or \u{XXXX}
syntaxes supported by your regex engine), it
will NOT use Unicode equivalence and match only
the specific code points that you called out.
- Note also that the
.
character can match
individual Unicode codepoints. For instance, if
the text to be matched contains ö
written as
two code points (o
followed by the umlaut
combining character), the .
is free to match
JUST the o
and not consume the combining
character that follows. (Is this really true in
Unicode mode or only when you're using
quantifiers with the .
?)
- Unicode Categories
- Recursion and Subroutine Calls May or May Not Be Atomic
- Flags / Modifiers
- Reference