Regular Expressions

1. Overview

Regular expressions are useful in text processing fields to extract information.

The main idea: writting patterns to match a specific sequence of characters

a matches a

ab matches ab

[abc] matches only a / b / c (1 character)

[^abc] matches only 1 character except a\b\c

[a-z] matches only 1 character from a to z

123 matches 123

\d matches any digit

\D matches any Non-digit

. matches any character

\. matches .

\w matches any Alphanumeric (alphabet + number) character

\W matches any non-alphanumeric character

a{3} matches aaa

a{1,3} matches a / aa / aaa

a* matches 0 or more repetition of a

a+ matches 1 or more repetition of a

a? a is optional in this case, so matches 0 / 1 repetition of a

There are many common forms of whitespace

These can be matched by \s, so \s is extremely useful when dealing with raw input text

Defines what should be matched in a line‘s begining and end

Use ( ) to extract information for further processing

e.g. ^(IMG\d+)\.png$ will match .png file but will only capture files’ name

Use nested ( ) to extract multiple layes of information