
Regular Expressions

1. Overview

Regular expressions are useful in text processing fields to extract information.

The main idea: writting patterns to match a specific sequence of characters

2. Quick Start

  • Letters

a matches a

ab matches ab

[abc] matches only a / b / c (1 character)

[^abc] matches only 1 character except a\b\c

[a-z] matches only 1 character from a to z

  • Digits

123 matches 123

\d matches any digit

\D matches any Non-digit

  • Wild Card

. matches any character

\. matches .

\w matches any Alphanumeric (alphabet + number) character

  • equals to [A-Za-z0-9_]

\W matches any non-alphanumeric character

  • Repetitions

a{3} matches aaa

a{1,3} matches a / aa / aaa

a* matches 0 or more repetition of a

a+ matches 1 or more repetition of a

a? a is optional in this case, so matches 0 / 1 repetition of a

  • Whitespace

There are many common forms of whitespace

  • space
  • tab
  • new line
  • carriage return

These can be matched by \s, so \s is extremely useful when dealing with raw input text

  • ^...$

Defines what should be matched in a line‘s begining and end

  • Group

Use ( ) to extract information for further processing

e.g. ^(IMG\d+)\.png$ will match .png file but will only capture files’ name

  • Nested Group

Use nested ( ) to extract multiple layes of information

3. Ref
