A library for parsing, compiling, and executing regular expressions. The match time is linear in the length of the text and the regular expression. So, it can handle input from untrusted users. The syntax is similar to PCRE but lacks a few features that can not be implemented while keeping the space/time complexity guarantees, ex: backreferences.
Syntax
Matching one character
. any character except new line (includes new line with s flag) \d digit (\p{Nd}) \D not digit \pN One-letter name Unicode character class \p{Greek} Unicode character class (general category or script) \PN Negated one-letter name Unicode character class \P{Greek} negated Unicode character class (general category or script)
Character classes
[xyz] A character class matching either x, y or z (union). [^xyz] A character class matching any character except x, y and z. [a-z] A character class matching any character in range a-z. [[:alpha:]] ASCII character class ([A-Za-z]) [[:^alpha:]] Negated ASCII character class ([^A-Za-z]) [\[\]] Escaping in character classes (matching [ or ])
Composites
xy concatenation (x followed by y) x|y alternation (x or y, prefer x)
Repetitions
x* zero or more of x (greedy) x+ one or more of x (greedy) x? zero or one of x (greedy) x*? zero or more of x (ungreedy/lazy) x+? one or more of x (ungreedy/lazy) x?? zero or one of x (ungreedy/lazy) x{n,m} at least n x and at most m x (greedy) x{n,} at least n x (greedy) x{n} exactly n x x{n,m}? at least n x and at most m x (ungreedy/lazy) x{n,}? at least n x (ungreedy/lazy) x{n}? exactly n x
Empty matches
^ the beginning of text (or start-of-line with multi-line mode) $ the end of text (or end-of-line with multi-line mode) \A only the beginning of text (even with multi-line mode enabled) \z only the end of text (even with multi-line mode enabled) \b a Unicode word boundary (\w on one side and \W, \A, or \z on other) \B not a Unicode word boundary
Grouping and flags
(exp) numbered capture group (indexed by opening parenthesis) (?P<name>exp) named (also numbered) capture group (allowed chars: [_0-9a-zA-Z]) (?:exp) non-capturing group (?flags) set flags within current group (?flags:exp) set flags for exp (non-capturing)
Flags are each a single character. For example, (?x) sets the flag x and (?-x) clears the flag x. Multiple flags can be set or cleared at the same time: (?xy) sets both the x and y flags, (?x-y) sets the x flag and clears the y flag, and (?-xy) clears both the x and y flags.
i case-insensitive: letters match both upper and lower case m multi-line mode: ^ and $ match begin/end of line s allow . to match \L (new line) U swap the meaning of x* and x*? (un-greedy mode) u Unicode support (enabled by default) x ignore whitespace and allow line comments (starting with `#`)
All flags are disabled by default unless stated otherwise
Escape sequences
\* literal *, works for any punctuation character: \.+*?()|[]{}^$ \a bell (\x07) \f form feed (\x0C) \t horizontal tab \n new line (\L) \r carriage return \v vertical tab (\x0B) \123 octal character code (up to three digits) \x7F hex character code (exactly two digits) \x{10FFFF} any hex character code corresponding to a Unicode code point \u007F hex character code (exactly four digits) \U0010FFFF hex character code (exactly eight digits)
Perl character classes (Unicode friendly)
These classes are based on the definitions provided in UTS#18
\d digit (\p{Nd}) \D not digit \s whitespace (\p{White_Space}) \S not whitespace \w word character (\p{Alphabetic} + \p{M} + \d + \p{Pc} + \p{Join_Control}) \W not word character
ASCII character classes
[[:alnum:]] alphanumeric ([0-9A-Za-z]) [[:alpha:]] alphabetic ([A-Za-z]) [[:ascii:]] ASCII ([\x00-\x7F]) [[:blank:]] blank ([\t ]) [[:cntrl:]] control ([\x00-\x1F\x7F]) [[:digit:]] digits ([0-9]) [[:graph:]] graphical ([!-~]) [[:lower:]] lower case ([a-z]) [[:print:]] printable ([ -~]) [[:punct:]] punctuation ([!-/:-@\[-`{-~]) [[:space:]] whitespace ([\t\n\v\f\r ]) [[:upper:]] upper case ([A-Z]) [[:word:]] word characters ([0-9A-Za-z_]) [[:xdigit:]] hex digit ([0-9A-Fa-f])
Lookaround Assertions
(?=regex) A positive lookahead assertion (?!regex) A negative lookahead assertion (?<=regex) A positive lookbehind assertion (?<!regex) A negative lookbehind assertion
Any regex expression is a valid lookaround; groups are captured as well. Beware, lookarounds containing repetitions (*, +, and {n,}) may run in polynomial time.
Examples
Multiple captures
Unlike most regex engines, this library supports capturing all repetitions. Most other libraries return only the last capture. The caveat is even non-repeated groups or characters are returned as a list of captures instead of a single capture.
let text = "nim c --styleCheck:hint --colors:off regex.nim" var m: RegexMatch if match(text, re"nim c (?:--(\w+:\w+) *)+ (\w+).nim", m): doAssert m.group(0, text) == @["styleCheck:hint", "colors:off"] doAssert m.group(1, text) == @["regex"] else: doAssert false, "no match"
Verbose Mode
Verbose mode (?x) makes regexes more readable by allowing comments and multi-lines within the regular expression itself. The caveat is spaces and pound signs must be scaped to be matched.
const exp = re"""(?x) \# # the hashtag \w+ # hashtag words """ let text = "#NimLang" doAssert match(text, exp)
Find All
The findAll function will find all boundaries and captures that match the regular expression.
let text = """ The Continental's email list: john_wick@continental.com winston@continental.com ms_perkins@continental.com """ var matches = newSeq[string]() var captures = newSeq[string]() for m in findAll(text, re"(\w+)@\w+\.\w+"): matches.add text[m.boundaries] captures.add m.group(0, text) doAssert matches == @[ "john_wick@continental.com", "winston@continental.com", "ms_perkins@continental.com" ] doAssert captures == @["john_wick", "winston", "ms_perkins"]
Match Macro
The match macro is sometimes more convenient, and faster than the function version. It will run a full match on the whole string, similar to ^regex$.
A matches: seq[string] variable is injected into the scope, and it contains the submatches for every capture group.
var matched = false let text = "[my link](https://example.com)" match text, rex"\[([^\]]+)\]\((https?://[^)]+)\)": doAssert matches == @["my link", "https://example.com"] matched = true doAssert matched
Procs
func re(s: string): Regex {...}{.raises: [RegexError], tags: [].}
-
Parse and compile a regular expression at run-time
Example:
let abcx = re"abc\w" let abcx2 = re(r"abc\w") let pat = r"abc\w" let abcx3 = re(pat)
func re(s: static string): static[Regex] {...}{.inline.}
- Parse and compile a regular expression at compile-time
func toPattern(s: string): Regex {...}{.raises: [RegexError], deprecated: "Use `re` instead", tags: [].}
func rex(s: string): RegexLit {...}{.raises: [], tags: [].}
- Raw regex literal string
func group(m: RegexMatch; i: int): seq[Slice[int]] {...}{.inline, raises: [], tags: [].}
- return slices for a given group. Use the iterator version if you care about performance
func group(m: RegexMatch; i: int; text: string): seq[string] {...}{.inline, raises: [], tags: [].}
-
return seq of captured text by group number i
Example:
let text = "hello beautiful world" var m: RegexMatch doAssert text.match(re"(hello) (?:([^\s]+)\s?)+", m) doAssert m.group(0, text) == @["hello"] doAssert m.group(1, text) == @["beautiful", "world"]
func groupFirstCapture(m: RegexMatch; i: int; text: string): string {...}{.inline, raises: [], tags: [].}
-
return first capture for a given capturing group
Example:
let text = "hello beautiful world" var m: RegexMatch doAssert text.match(re"(hello) (?:([^\s]+)\s?)+", m) doAssert m.groupFirstCapture(0, text) == "hello" doAssert m.groupFirstCapture(1, text) == "beautiful"
func groupLastCapture(m: RegexMatch; i: int; text: string): string {...}{.inline, raises: [], tags: [].}
-
return last capture for a given capturing group
Example:
let text = "hello beautiful world" var m: RegexMatch doAssert text.match(re"(hello) (?:([^\s]+)\s?)+", m) doAssert m.groupLastCapture(0, text) == "hello" doAssert m.groupLastCapture(1, text) == "world"
func group(m: RegexMatch; s: string): seq[Slice[int]] {...}{.inline, raises: [KeyError], tags: [].}
- return slices for a given named group. Use the iterator version if you care about performance
func group(m: RegexMatch; groupName: string; text: string): seq[string] {...}{. inline, raises: [KeyError], tags: [].}
-
return seq of captured text by group groupName
Example:
let text = "hello beautiful world" var m: RegexMatch doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m) doAssert m.group("greet", text) == @["hello"] doAssert m.group("who", text) == @["beautiful", "world"]
func groupFirstCapture(m: RegexMatch; groupName: string; text: string): string {...}{. inline, raises: [KeyError], tags: [].}
-
return first capture for a given capturing group
Example:
let text = "hello beautiful world" var m: RegexMatch doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m) doAssert m.groupFirstCapture("greet", text) == "hello" doAssert m.groupFirstCapture("who", text) == "beautiful"
func groupLastCapture(m: RegexMatch; groupName: string; text: string): string {...}{. inline, raises: [KeyError], tags: [].}
-
return last capture for a given capturing group
Example:
let text = "hello beautiful world" var m: RegexMatch doAssert text.match(re"(?P<greet>hello) (?:(?P<who>[^\s]+)\s?)+", m) doAssert m.groupLastCapture("greet", text) == "hello" doAssert m.groupLastCapture("who", text) == "world"
func groupsCount(m: RegexMatch): int {...}{.inline, raises: [], tags: [].}
-
return the number of capturing groups
Example:
var m: RegexMatch doAssert "ab".match(re"(a)(b)", m) doAssert m.groupsCount == 2
func groupNames(m: RegexMatch): seq[string] {...}{.inline, raises: [], tags: [].}
-
return the names of capturing groups.
Example:
let text = "hello world" var m: RegexMatch doAssert text.match(re"(?P<greet>hello) (?P<who>world)", m) doAssert m.groupNames() == @["greet", "who"]
func match(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {...}{. inline, raises: [], tags: [RootEffect].}
-
return a match if the whole string matches the regular expression. This is similar to find(text, re"^regex$", m) but has better performance
Example:
var m: RegexMatch doAssert "abcd".match(re"abcd", m) doAssert not "abcd".match(re"abc", m)
func match(s: string; pattern: Regex): bool {...}{.inline, raises: [], tags: [RootEffect].}
func findAll(s: string; pattern: Regex; start = 0): seq[RegexMatch] {...}{.inline, raises: [], tags: [RootEffect].}
func findAllBounds(s: string; pattern: Regex; start = 0): seq[Slice[int]] {...}{. inline, raises: [], tags: [RootEffect].}
func findAndCaptureAll(s: string; pattern: Regex): seq[string] {...}{.inline, raises: [], tags: [RootEffect].}
-
search through the string and return a seq with captures.
Example:
doAssert findAndCaptureAll("a1b2c3d4e5", re"\d") == @["1", "2", "3", "4", "5"]
func contains(s: string; pattern: Regex): bool {...}{.inline, raises: [], tags: [RootEffect].}
-
search for the pattern anywhere in the string
Example:
doAssert re"bc" in "abcd" doAssert re"(23)+" in "23232" doAssert re"^(23)+$" notin "23232"
func find(s: string; pattern: Regex; m: var RegexMatch; start = 0): bool {...}{. inline, raises: [], tags: [RootEffect].}
-
search through the string looking for the first location where there is a match
Example:
var m: RegexMatch doAssert "abcd".find(re"bc", m) and m.boundaries == 1 .. 2 doAssert not "abcd".find(re"de", m) doAssert "2222".find(re"(22)*", m) and m.group(0) == @[0 .. 1, 2 .. 3]
func split(s: string; sep: Regex): seq[string] {...}{.inline, raises: [], tags: [RootEffect].}
-
return not matched substrings
Example:
doAssert split("11a22Ϊ33Ⓐ44弢55", re"\d+") == @["", "a", "Ϊ", "Ⓐ", "弢", ""]
func splitIncl(s: string; sep: Regex): seq[string] {...}{.inline, raises: [], tags: [RootEffect].}
-
return not matched substrings, including captured groups
Example:
let parts = splitIncl("a,b", re"(,)") expected = @["a", ",", "b"] doAssert parts == expected
func startsWith(s: string; pattern: Regex; start = 0): bool {...}{.inline, raises: [], tags: [RootEffect].}
-
return whether the string starts with the pattern or not
Example:
doAssert "abc".startsWith(re"\w") doAssert not "abc".startsWith(re"\d")
func endsWith(s: string; pattern: Regex): bool {...}{.inline, raises: [], tags: [RootEffect].}
-
return whether the string ends with the pattern or not
Example:
doAssert "abc".endsWith(re"\w") doAssert not "abc".endsWith(re"\d")
func replace(s: string; pattern: Regex; by: string; limit = 0): string {...}{.inline, raises: [ValueError], tags: [RootEffect].}
-
Replace matched substrings.
Matched groups can be accessed with $N notation, where N is the group's index, starting at 1 (1-indexed). $$ means literal $.
If limit is given, at most limit replacements are done. limit of 0 means there is no limit
Example:
doAssert "aaa".replace(re"a", "b", 1) == "baa" doAssert "abc".replace(re"(a(b)c)", "m($1) m($2)") == "m(abc) m(b)" doAssert "Nim is awesome!".replace(re"(\w\B)", "$1_") == "N_i_m i_s a_w_e_s_o_m_e!"
func replace(s: string; pattern: Regex; by: proc (m: RegexMatch; s: string): string; limit = 0): string {...}{. inline, raises: [], tags: [RootEffect].}
-
Replace matched substrings.
If limit is given, at most limit replacements are done. limit of 0 means there is no limit
Example:
proc removeEvenWords(m: RegexMatch, s: string): string = result = "" if m.group(1).len mod 2 != 0: result = s[m.group(0)[0]] let text = "Es macht Spaß, alle geraden Wörter zu entfernen!" doAssert text.replace(re"((\w)+\s*)", removeEvenWords) == "macht , geraden entfernen!"
func isInitialized(re: Regex): bool {...}{.inline, raises: [], tags: [].}
-
Check whether the regex has been initialized
Example:
var re: Regex doAssert not re.isInitialized re = re"foo" doAssert re.isInitialized
func escapeRe(s: string): string {...}{.raises: [], tags: [].}
- Escape special regex characters in s so that it can be matched verbatim
Iterators
iterator group(m: RegexMatch; i: int): Slice[int] {...}{.inline, raises: [], tags: [].}
-
return slices for a given group. Slices of start > end are empty matches (i.e.: re"(\d?)") and they are included same as in PCRE.
Example:
let text = "abc" var m: RegexMatch doAssert text.match(re"(\w)+", m) var captures = newSeq[string]() for bounds in m.group(0): captures.add text[bounds] doAssert captures == @["a", "b", "c"]
iterator group(m: RegexMatch; s: string): Slice[int] {...}{.inline, raises: [KeyError], tags: [].}
-
return slices for a given named group
Example:
let text = "abc" var m: RegexMatch doAssert text.match(re"(?P<foo>\w)+", m) var captures = newSeq[string]() for bounds in m.group("foo"): captures.add text[bounds] doAssert captures == @["a", "b", "c"]
iterator findAll(s: string; pattern: Regex; start = 0): RegexMatch {...}{.inline, raises: [], tags: [RootEffect].}
-
search through the string and return each match. Empty matches (start > end) are included
Example:
let text = "abcabc" var bounds = newSeq[Slice[int]]() var found = newSeq[string]() for m in findAll(text, re"bc"): bounds.add m.boundaries found.add text[m.boundaries] doAssert bounds == @[1 .. 2, 4 .. 5] doAssert found == @["bc", "bc"]
iterator findAllBounds(s: string; pattern: Regex; start = 0): Slice[int] {...}{. inline, raises: [], tags: [RootEffect].}
-
search through the string and return each match. Empty matches (start > end) are included
Example:
let text = "abcabc" var bounds = newSeq[Slice[int]]() for bd in findAllBounds(text, re"bc"): bounds.add bd doAssert bounds == @[1 .. 2, 4 .. 5]
iterator split(s: string; sep: Regex): string {...}{.inline, raises: [], tags: [RootEffect].}
-
return not matched substrings
Example:
var found = newSeq[string]() for s in split("11a22Ϊ33Ⓐ44弢55", re"\d+"): found.add s doAssert found == @["", "a", "Ϊ", "Ⓐ", "弢", ""]
Macros
macro match(text: string; regex: RegexLit; body: untyped): untyped
-
return a match if the whole string matches the regular expression. This is similar to the match function, but faster. Notice it requires a raw regex literal string as second parameter; the regex must be known at compile time, and cannot be a var/let/const
A matches: seq[string] variable is injected into the scope, and it contains the submatches for every capture group. If a group is repeated (ex: (\w)+), it will contain the last capture for that group.
Note: Only available in Nim +1.1
Example:
match "abc", rex"(a(b)c)": doAssert matches == @["abc", "b"]