regex

A library for parsing, compiling, and executing regular expressions. The match time is linear in the length of the text and the regular expression. So, it can handle input from untrusted users. The syntax is similar to PCRE but lacks a few features that can not be implemented while keeping the space/time complexity guarantees, ex: backreferences.

Syntax

Matching one character

.          any character except new line (includes new line with s flag)
\d         digit (\p{Nd})
\D         not digit
\pN        One-letter name Unicode character class
\p{Greek}  Unicode character class (general category or script)
\PN        Negated one-letter name Unicode character class
\P{Greek}  negated Unicode character class (general category or script)

Character classes

[xyz]         A character class matching either x, y or z (union).
[^xyz]        A character class matching any character except x, y and z.
[a-z]         A character class matching any character in range a-z.
[[:alpha:]]   ASCII character class ([A-Za-z])
[[:^alpha:]]  Negated ASCII character class ([^A-Za-z])
[\[\]]        Escaping in character classes (matching [ or ])

Composites

xy   concatenation (x followed by y)
x|y  alternation (x or y, prefer x)

Repetitions

x*       zero or more of x (greedy)
x+       one or more of x (greedy)
x?       zero or one of x (greedy)
x*?      zero or more of x (ungreedy/lazy)
x+?      one or more of x (ungreedy/lazy)
x??      zero or one of x (ungreedy/lazy)
x{n,m}   at least n x and at most m x (greedy)
x{n,}    at least n x (greedy)
x{n}     exactly n x
x{n,m}?  at least n x and at most m x (ungreedy/lazy)
x{n,}?   at least n x (ungreedy/lazy)
x{n}?    exactly n x

Empty matches

^   the beginning of text (or start-of-line with multi-line mode)
$   the end of text (or end-of-line with multi-line mode)
\A  only the beginning of text (even with multi-line mode enabled)
\z  only the end of text (even with multi-line mode enabled)
\b  a Unicode word boundary (\w on one side and \W, \A, or \z on other)
\B  not a Unicode word boundary

Grouping and flags

(exp)          numbered capture group (indexed by opening parenthesis)
(?P<name>exp)  named (also numbered) capture group (allowed chars: [_0-9a-zA-Z])
(?:exp)        non-capturing group
(?flags)       set flags within current group
(?flags:exp)   set flags for exp (non-capturing)

Flags are each a single character. For example, (?x) sets the flag x and (?-x) clears the flag x. Multiple flags can be set or cleared at the same time: (?xy) sets both the x and y flags, (?x-y) sets the x flag and clears the y flag, and (?-xy) clears both the x and y flags.

i  case-insensitive: letters match both upper and lower case
m  multi-line mode: ^ and $ match begin/end of line
s  allow . to match \L (new line)
U  swap the meaning of x* and x*? (un-greedy mode)
u  Unicode support (enabled by default)
x  ignore whitespace and allow line comments (starting with `#`)

All flags are disabled by default unless stated otherwise

Escape sequences

\*         literal *, works for any punctuation character: \.+*?()|[]{}^$
\a         bell (\x07)
\f         form feed (\x0C)
\t         horizontal tab
\n         new line (\L)
\r         carriage return
\v         vertical tab (\x0B)
\123       octal character code (up to three digits)
\x7F       hex character code (exactly two digits)
\x{10FFFF} any hex character code corresponding to a Unicode code point
\u007F     hex character code (exactly four digits)
\U0010FFFF hex character code (exactly eight digits)

Perl character classes (Unicode friendly)

These classes are based on the definitions provided in UTS#18

\d  digit (\p{Nd})
\D  not digit
\s  whitespace (\p{White_Space})
\S  not whitespace
\w  word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control})
\W  not word character

ASCII character classes

[[:alnum:]]   alphanumeric ([0-9A-Za-z])
[[:alpha:]]   alphabetic ([A-Za-z])
[[:ascii:]]   ASCII ([\x00-\x7F])
[[:blank:]]   blank ([\t ])
[[:cntrl:]]   control ([\x00-\x1F\x7F])
[[:digit:]]   digits ([0-9])
[[:graph:]]   graphical ([!-~])
[[:lower:]]   lower case ([a-z])
[[:print:]]   printable ([ -~])
[[:punct:]]   punctuation ([!-/:-@\[-`{-~])
[[:space:]]   whitespace ([\t\n\v\f\r ])
[[:upper:]]   upper case ([A-Z])
[[:word:]]    word characters ([0-9A-Za-z_])
[[:xdigit:]]  hex digit ([0-9A-Fa-f])

Lookaround Assertions

(?=regex)   A positive lookahead assertion
(?!regex)   A negative lookahead assertion
(?<=regex)  A positive lookbehind assertion
(?<!regex)  A negative lookbehind assertion

Any regex expression is a valid lookaround; groups are captured as well. Beware, lookarounds containing repetitions (*, +, and {n,}) may run in polynomial time.

Examples

Multiple captures

Unlike most regex engines, this library supports capturing all repetitions. Most other libraries return only the last capture. The caveat is even non-repeated groups or characters are returned as a list of captures instead of a single capture.

let text = "nim c --styleCheck:hint --colors:off regex.nim"
var m: RegexMatch
if match(text, re"nim c (?:--(\w+:\w+) *)+ (\w+).nim", m):
  doAssert m.group(0, text) == @["styleCheck:hint", "colors:off"]
  doAssert m.group(1, text) == @["regex"]
else:
  doAssert false, "no match"

Verbose Mode

Verbose mode (?x) makes regexes more readable by allowing comments and multi-lines within the regular expression itself. The caveat is spaces and pound signs must be scaped to be matched.

const exp = re"""(?x)
\#   # the hashtag
\w+  # hashtag words
"""
let text = "#NimLang"
doAssert match(text, exp)

Find All

The findAll function will find all boundaries and captures that match the regular expression.

let text = """
The Continental's email list:
john_wick@continental.com
winston@continental.com
ms_perkins@continental.com
"""
var matches = newSeq[string]()
var captures = newSeq[string]()
for m in findAll(text, re"(\w+)@\w+\.\w+"):
  matches.add text[m.boundaries]
  captures.add m.group(0, text)
doAssert matches == @[
  "john_wick@continental.com",
  "winston@continental.com",
  "ms_perkins@continental.com"
]
doAssert captures == @["john_wick", "winston", "ms_perkins"]

Match Macro

The match macro is sometimes more convenient, and faster than the function version. It will run a full match on the whole string, similar to ^regex$.

A matches: seq[string] variable is injected into the scope, and it contains the submatches for every capture group.

var matched = false
let text = "[my link](https://example.com)"
match text, rex"\[([^\]]+)\]\((https?://[^)]+)\)":
  doAssert matches == @["my link", "https://example.com"]
  matched = true
doAssert matched

Procs

func re(s: string): Regex {...}{.raises: [RegexError], tags: [].}
Parse and compile a regular expression at run-time

Example:

let abcx = re"abc\w"
let abcx2 = re(r"abc\w")
let pat = r"abc\w"
let abcx3 = re(pat)
func re(s: static string): static[Regex] {...}{.inline.}
Parse and compile a regular expression at compile-time
func toPattern(s: string): Regex {...}{.raises: [RegexError],
                                   deprecated: "Use `re` instead", tags: [].}
Deprecated: Use `re` instead
func rex(s: string): RegexLit {...}{.raises: [], tags: [].}
Raw regex literal string
func group(m: RegexMatch; i: int): seq[Slice[int]] {...}{.inline, raises: [],
    tags: [].}
return slices for a given group. Use the iterator version if you care about performance
func group(m: RegexMatch; i: int; text: string): seq[string] {...}{.inline,
    raises: [], tags: [].}
return seq of captured text by group number i

Example:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(hello) (?:([^\s]+)\s?)+", m)
doAssert m.group(0, text) == @["hello"]
doAssert m.group(1, text) == @["beautiful", "world"]
func groupFirstCapture(m: RegexMatch; i: int; text: string): string {...}{.inline,
    raises: [], tags: [].}
return first capture for a given capturing group

Example:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(hello) (?:([^\s]+)\s?)+", m)
doAssert m.groupFirstCapture(0, text) == "hello"
doAssert m.groupFirstCapture(1, text) == "beautiful"
func groupLastCapture(m: RegexMatch; i: int; text: string): string {...}{.inline,
    raises: [], tags: [].}
return last capture for a given capturing group

Example:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(hello) (?:([^\s]+)\s?)+", m)
doAssert m.groupLastCapture(0, text) == "hello"
doAssert m.groupLastCapture(1, text) == "world"
func group(m: RegexMatch; s: string): seq[Slice[int]] {...}{.inline,
    raises: [KeyError], tags: [].}
return slices for a given named group. Use the iterator version if you care about performance
func group(m: RegexMatch; groupName: string; text: string): seq[string] {...}{.
    inline, raises: [KeyError], tags: [].}
return seq of captured text by group groupName

Example:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m)
doAssert m.group("greet", text) == @["hello"]
doAssert m.group("who", text) == @["beautiful", "world"]
func groupFirstCapture(m: RegexMatch; groupName: string; text: string): string {...}{.
    inline, raises: [KeyError], tags: [].}
return first capture for a given capturing group

Example:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m)
doAssert m.groupFirstCapture("greet", text) == "hello"
doAssert m.groupFirstCapture("who", text) == "beautiful"
func groupLastCapture(m: RegexMatch; groupName: string; text: string): string {...}{.
    inline, raises: [KeyError], tags: [].}
return last capture for a given capturing group

Example:

let text = "hello beautiful world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m)
doAssert m.groupLastCapture("greet", text) == "hello"
doAssert m.groupLastCapture("who", text) == "world"
func groupsCount(m: RegexMatch): int {...}{.inline, raises: [], tags: [].}
return the number of capturing groups

Example:

var m: RegexMatch
doAssert "ab".match(re"(a)(b)", m)
doAssert m.groupsCount == 2
func groupNames(m: RegexMatch): seq[string] {...}{.inline, raises: [], tags: [].}
return the names of capturing groups.

Example:

let text = "hello world"
var m: RegexMatch
doAssert text.match(re"(?P<greet>hello) (?P<who>world)", m)
doAssert m.groupNames() == @["greet", "who"]
func match(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {...}{.
    inline, raises: [], tags: [RootEffect].}
return a match if the whole string matches the regular expression. This is similar to find(text, re"^regex$", m) but has better performance

Example:

var m: RegexMatch
doAssert "abcd".match(re"abcd", m)
doAssert not "abcd".match(re"abc", m)
func match(s: string; pattern: Regex): bool {...}{.inline, raises: [],
    tags: [RootEffect].}
func findAll(s: string; pattern: Regex; start = 0): seq[RegexMatch] {...}{.inline,
    raises: [], tags: [RootEffect].}
func findAllBounds(s: string; pattern: Regex; start = 0): seq[Slice[int]] {...}{.
    inline, raises: [], tags: [RootEffect].}
func findAndCaptureAll(s: string; pattern: Regex): seq[string] {...}{.inline,
    raises: [], tags: [RootEffect].}
search through the string and return a seq with captures.

Example:

doAssert findAndCaptureAll("a1b2c3d4e5", re"\d") ==
  @["1", "2", "3", "4", "5"]
func contains(s: string; pattern: Regex): bool {...}{.inline, raises: [],
    tags: [RootEffect].}
search for the pattern anywhere in the string

Example:

doAssert re"bc" in "abcd"
doAssert re"(23)+" in "23232"
doAssert re"^(23)+$" notin "23232"
func find(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {...}{.
    inline, raises: [], tags: [RootEffect].}
search through the string looking for the first location where there is a match

Example:

var m: RegexMatch
doAssert "abcd".find(re"bc", m) and
  m.boundaries == 1 .. 2
doAssert not "abcd".find(re"de", m)
doAssert "2222".find(re"(22)*", m) and
  m.group(0) == @[0 .. 1, 2 .. 3]
func split(s: string; sep: Regex): seq[string] {...}{.inline, raises: [],
    tags: [RootEffect].}
return not matched substrings

Example:

doAssert split("11a22Ϊ33Ⓐ44弢55", re"\d+") ==
  @["", "a", "Ϊ", "Ⓐ", "弢", ""]
func splitIncl(s: string; sep: Regex): seq[string] {...}{.inline, raises: [],
    tags: [RootEffect].}
return not matched substrings, including captured groups

Example:

let
  parts = splitIncl("a,b", re"(,)")
  expected = @["a", ",", "b"]
doAssert parts == expected
func startsWith(s: string; pattern: Regex; start = 0): bool {...}{.inline,
    raises: [], tags: [RootEffect].}
return whether the string starts with the pattern or not

Example:

doAssert "abc".startsWith(re"\w")
doAssert not "abc".startsWith(re"\d")
func endsWith(s: string; pattern: Regex): bool {...}{.inline, raises: [],
    tags: [RootEffect].}
return whether the string ends with the pattern or not

Example:

doAssert "abc".endsWith(re"\w")
doAssert not "abc".endsWith(re"\d")
func replace(s: string; pattern: Regex; by: string; limit = 0): string {...}{.inline,
    raises: [ValueError], tags: [RootEffect].}

Replace matched substrings.

Matched groups can be accessed with $N notation, where N is the group's index, starting at 1 (1-indexed). $$ means literal $.

If limit is given, at most limit replacements are done. limit of 0 means there is no limit

Example:

doAssert "aaa".replace(re"a", "b", 1) == "baa"
doAssert "abc".replace(re"(a(b)c)", "m($1) m($2)") ==
  "m(abc) m(b)"
doAssert "Nim is awesome!".replace(re"(\w\B)", "$1_") ==
  "N_i_m i_s a_w_e_s_o_m_e!"
func replace(s: string; pattern: Regex;
             by: proc (m: RegexMatch; s: string): string; limit = 0): string {...}{.
    inline, raises: [], tags: [RootEffect].}

Replace matched substrings.

If limit is given, at most limit replacements are done. limit of 0 means there is no limit

Example:

proc removeEvenWords(m: RegexMatch, s: string): string =
  result = ""
  if m.group(1).len mod 2 != 0:
    result = s[m.group(0)[0]]

let text = "Es macht Spaß, alle geraden Wörter zu entfernen!"
doAssert text.replace(re"((\w)+\s*)", removeEvenWords) ==
  "macht , geraden entfernen!"
func isInitialized(re: Regex): bool {...}{.inline, raises: [], tags: [].}
Check whether the regex has been initialized

Example:

var re: Regex
doAssert not re.isInitialized
re = re"foo"
doAssert re.isInitialized
func escapeRe(s: string): string {...}{.raises: [], tags: [].}
Escape special regex characters in s so that it can be matched verbatim

Iterators

iterator group(m: RegexMatch; i: int): Slice[int] {...}{.inline, raises: [], tags: [].}
return slices for a given group. Slices of start > end are empty matches (i.e.: re"(\d?)") and they are included same as in PCRE.

Example:

let text = "abc"
var m: RegexMatch
doAssert text.match(re"(\w)+", m)
var captures = newSeq[string]()
for bounds in m.group(0):
  captures.add text[bounds]
doAssert captures == @["a", "b", "c"]
iterator group(m: RegexMatch; s: string): Slice[int] {...}{.inline,
    raises: [KeyError], tags: [].}
return slices for a given named group

Example:

let text = "abc"
var m: RegexMatch
doAssert text.match(re"(?P<foo>\w)+", m)
var captures = newSeq[string]()
for bounds in m.group("foo"):
  captures.add text[bounds]
doAssert captures == @["a", "b", "c"]
iterator findAll(s: string; pattern: Regex; start = 0): RegexMatch {...}{.inline,
    raises: [], tags: [RootEffect].}
search through the string and return each match. Empty matches (start > end) are included

Example:

let text = "abcabc"
var bounds = newSeq[Slice[int]]()
var found = newSeq[string]()
for m in findAll(text, re"bc"):
  bounds.add m.boundaries
  found.add text[m.boundaries]
doAssert bounds == @[1 .. 2, 4 .. 5]
doAssert found == @["bc", "bc"]
iterator findAllBounds(s: string; pattern: Regex; start = 0): Slice[int] {...}{.
    inline, raises: [], tags: [RootEffect].}
search through the string and return each match. Empty matches (start > end) are included

Example:

let text = "abcabc"
var bounds = newSeq[Slice[int]]()
for bd in findAllBounds(text, re"bc"):
  bounds.add bd
doAssert bounds == @[1 .. 2, 4 .. 5]
iterator split(s: string; sep: Regex): string {...}{.inline, raises: [],
    tags: [RootEffect].}
return not matched substrings

Example:

var found = newSeq[string]()
for s in split("11a22Ϊ33Ⓐ44弢55", re"\d+"):
  found.add s
doAssert found == @["", "a", "Ϊ", "Ⓐ", "弢", ""]

Macros

macro match(text: string; regex: RegexLit; body: untyped): untyped

return a match if the whole string matches the regular expression. This is similar to the match function, but faster. Notice it requires a raw regex literal string as second parameter; the regex must be known at compile time, and cannot be a var/let/const

A matches: seq[string] variable is injected into the scope, and it contains the submatches for every capture group. If a group is repeated (ex: (\w)+), it will contain the last capture for that group.

Note: Only available in Nim +1.1

Example:

match "abc", rex"(a(b)c)":
  doAssert matches == @["abc", "b"]