Regular Expression

Regular expression is something you will eventually confront at your work at some point if you parse data or grep rows from a large text file.

Today at work, one colleague of mine showed me a parser regex like this:

split_regex = re.compile(''',(?=(?:[^"]|'[^']*'|"[^"]*")*$)'’'), I have no idea what it meant. Time to review the regex again!

Python re provides regular expression matching operations similar to those found in Perl.

Regular expressions are made of special characters such as | and ( and ordinary characters such as literal letters and numbers. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted.

One “special” special character

re use backslash \ either escapes special characters (allow you to match special characters such as $ or \), or signal a special sequence (simple but hard to remember). However, python also uses \ in string literals. Therefore, to match a literal backslash, one would need to write \\\\ (First two \\ is regular expression, and each backslash must be expressed as \\ inside a regular Python string literal). Therefore, it is advised to use python’s raw string notation for regular expression patterns: for example, r”\n” is a two-character string containing \ and n while “\n” is one-character string containing a newline.

Repetition qualifiers

Repetition qualifiers consist of greedy qualifiers such as *, +, ? (match as much as possible) and non-greedy ones such as |

repetition qualifiers cannot be directly nested. To apply a second repetition to an inner repetition, parentheses may be used. (?:a{6}) any multiple of six ‘a’ characters

. any character except newline
^ beginning
$ end or just before the newline at the end
* 0 or more repetitions of the preceding RE
+ 1 or more repetitions of the preceding RE
? 0 or 1 repetitions of the preceding RE
adding ? after above qualifier makes it perform the match in non-greedy fashion (explain non-greedy)
{m} exactly m copies of the previous RE
{m,n} match from m to n repetitions of the preceding RE
{m,n}? match from m to n repetitions of the preceding RE, but attempt to match as few as possible
[] a set of characters
- usage:
  - [amk]: anyone of a,m,k
  - [a-z],[0-10],[0-9A-Fa-f]
  - [a\-z], [a-]
  - [(+*)]: special characters lose special meaning inside sets
  - [^]: matching complement of the set
  - [\]] or []]: matching literal ‘]’
| or
(…) match whatever inside the parenthese, and indicates the start and end of a group
(?….) extension notation (this is very commonly used)
- (?iLmsux) include the flags as part of the regular expression (instead of passing a flag argument to the re.compile() function)
- (?:) non-capturing substring matching, substring matched cannot be retrieved after performing a match or referenced later
- (?P<name>…) substring matched by the group is accessible via symbolic group name. Group names must be valid Python identifiers and each group name must be defined only once. Named groups can be referenced in three contexts, but most commonly in two contexts
  - (?P=quote) e.g. (?P<quote>[‘“]).*?(?P=quote) matching a string quoted with either single or double quotes, .*? is nongreedyly matching 0 or more repetitions of any character
  - m.group(‘quote’): m is the match object
- (?#) comment; contents of the parentheses are ignored
- (?=...) positive look ahead assertion matches (if pattern (...) matches next but does not consume any string).
- (?!...) negative look ahead assertion (matches if pattern (...) does not match next)
- (?<=...) positive look behind assertion. Find matches if the current position in the string is preceded by a match for a fixed length patten (...) that ends at the current position
- (?<!...) negative look behind assertion.
- (?(id/name)yes-pattern|no-pattern) try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t.

Special sequence

Special sequences consist of \ and a character listed below. The special sequences use python’s raw string notation. If the character is not included in the list below, it will try to match it as an literal ordinary character.

\A matches only at the start of the string
\Z matches only at the end of the string
\b matches the empty string, but only at the beginning or end of the a word.
\B matches the empty string, but not on at the beginning or end of the word.
\d equivalent of set [0-9]
\D equivalent of set [^0-9]
\s matches any whitespace character, equivalent of set [\t\n\r\f\v]
\S matches any non-whitespace character, equivalent of set [^\t\n\r\f\v]
\w matches any alphanumeric character and the underscore, equivalent of set [a-zA-Z0-9_]
\W matches complement of \w, equivalent of set [^a-zA-Z0-9_]

The complex regex parser

Let’s revisit the absurdly long regex at beginning: ''',(?=(?:[^"]|'[^']*'|"[^"]*")*$)''', at first level, it can be splitted into:

, pattern starts with a comma
(?=(?:[^"]|'[^']*'|"[^"]*")*$), this could be further split into:
- (?=...) which is positive look ahead assertion
- *$ 0 or more (?:[^"]|'[^']*'|"[^"]*") at the end of the string
- (?:[^"]|'[^']*'|"[^"]*") which could further split into:
  - (?:...) non-capturing substring matching
  - three | or conditions listed below:
    - [^"] complement of set ["]
- '[^']*' pattern with one single quote plus 0 or more non-single quote with a ending single quote, or in simpler English, a pair of single quote with some or zero ordinary character in between, e.g. ‘single’
- "[^"]*" pattern with one double quote plus 0 or more non-double quote with a ending double quote, or in simpler English, a pair of double quote with some or zero ordinary character in between, e.g. “couple”

In summary, this parser regex is a pattern beginning with ,, but after that there are no single " but paired " or '.

Why do we need a such a complex regex as a parser?

Well, let’s look at some tricky cases：

“‘87332,”City Below, The (Unter dir die Stadt) (2010)”,Drama’”

In this example, we want to split it into: ‘87332’, ‘“City Below, The (Unter dir die Stadt) (2010)”’, and ‘Drama’

However, in the second group, there is a literal , that we want to escape. However, this is not a general case (meaning that we cannot use a specific regex to escape it). If we just use , as delimiter, we will split this string into 4 parts which is against our purpose. By adding [^"], we avoid using , as delimiter within a pair of double quote. Since we allow paired ' or ", the correct delimiters are recognized.

“62,Mr. Holland’s Opus (1995),Drama”

You may ask why we don’t allow single " but do allow single '. The above example explains it. In English, it is pretty common to have a single ' within a string, such as 's or cities names O'Fallon and Coeur d'Alene.

Written on October 27, 2017