/Reg(exp){2}lained/

Demystifying regular expressions

By Lea Verou (@LeaVerou)

#funfact

I grew up in Lesbos, Greece

…which technically makes me geographically Lesbian

W3C CSS Working Group Invited Expert

★★★★★ on Amazon

2014 – Haystack Group, MIT CSAIL

“Regular expression”
“Regexp”
“Regex”
“ASCII puke”

/^Reg(exp?|ular expression)$/i

Regular expressions:: A way to describe a set of strings

`/ &(lt|gt|amp); / gi`

`/ ^ Reg(ex p? | ular expression) $ / i`

"pâté".normalize("NFD").replace(/[\u0300-\u036f]/g, "") // "pate"

'lea@verou.me'.match(/(\w+)@/)[1] // "lea"

code.replace(/ {4}/g, '\t'); // fix broken indentation

<input name="zip" pattern="\d{5}" />

Also: text editors, IDEs, command line tools (grep, etc), databases, and more

What we learned

Regexes match anywhere, unless restricted
Matches cannot intersect
Case sensitive by default, use the i flag to change
Unicode characters by \uXXXX

\u2665 for ♥, emojis, rainbow matches rainbow flag emoji

Flag	Name	Purpose	ES?
`g`	Global	Get all matches	ES3
`i`	Case Insensitive	Ignore case when matching	ES3
`m`	Multiline	`^` and `$` match the beginning and end of each line, not of the whole string	ES3
`y`	Sticky	Anchors each match of a regular expression to the end of the previous match.	ES2015
`u`	Unicode	Treat pattern as a sequence of unicode code points	ES2015
`s`	DotAll	`.` matches newlines as well.	ES2018? (Stage 3)

What we learned

| for alternatives
Group with parenthses (Useful for alternatives or quantifiers)
Alternatives can be empty
Order matters!

(c|b|f|)at|dog, do(|g) for order

Test Challenge

fizz buzz fizzbuzz

Fizzbuzz

// Matches everything, including ""
/(fizz|)(buzz|)/

// Correct but inelegant
/(fizzbuzz|fizz|buzz)/

/(fizz(buzz|)|buzz)/

Emojis with woman

👩 👩‍💼 👪 👨‍👩‍👦‍👦 👩‍👦‍👦 👩‍👧 👩‍👦 👩‍👩‍👦‍👦 👩‍👩‍👦 👩‍👩‍👧 💏 👩‍❤️‍💋‍👩 not 👨‍👧 👨‍👨‍👧 👨‍👨‍👦 👨‍❤️‍👨

Emojis with woman

// Ewww, just stop
/👩|👩‍💼|👪|👩‍👦‍👦|👩‍👧|👩‍👦|👩‍👩‍👦‍👦|👩‍👩‍👦|👩‍👩‍👧|👨‍👩‍👦‍👦|.../

/👩/ // Wrong, doesn’t include 👪 and 💏!

/👩|👪|💏/

/1.5/ show it matches 21056 etc, show two are needed to match emojis

What we learned

Dot (.) matches anything, except line breaks
The experimental s flag makes it match line breaks
Escape metacharacters with a backslash

The 12 metacharacters

[ { ( ) \ ^ $ . | ? * +

What we learned

{n} = n times
{m,n} = at least m times but no more than n times
{m,} = at least m times

(ab){n}, a{2,5} and caaaaaaaaaat, parentheses always needed for emoji

What we learned

* = {0,} , + = {1,} , ? = {0,1}
Careful of accidental zero length matches!

What we learned

Quantifiers are greedy
Lazify them by adding a ? after them

Delete last > to show what happens

What we learned

Brackets = set, range or a combination of both
Concatenate multiple ranges for a union
Most metacharacters don’t need escaping in []

in Unicode order

Hex color

#abc, #f00, #BADA55, #C0FFEE

Hex color

/#[a-f0-9]{3,6}/i // Wrong!

/#([a-f0-9]{3}){1,2}/i

What we learned

\w = [a-zA-Z0-9_], \d = [0-9], \s ≈ [\t\r\n ]
Combine to form more complex character classes

Counting words (roughly)

function wordCount(text){
	return text.match(/\w+/g).length;
}

function wordCount(text){
	return text.split(/\s+/).length;
}

Number

Without exponent or digit separators

-1 .05 +1000 3.1415926535 42.

Number

/^[-+]?[\d.]+$/ // Too lax

/^[-+]?\d*\.?\d+$/ // False negatives: 5.

/^[-+]?\d*\.?\d*$/ // False positives: ., +., + etc

// Accurate, but is it worth it?
/^[-+]?(\d*\.?\d+|\d+\.)$/

What we learned

^ negates a character class
\W = [^\w], \D = [^\d], \S = [^\s]
Even the dot is a character class: . = [^\r\n\u2028\u2029]
DotAll alternatives: [^], [\S\s], [\W\w], [\D\d]

Strip HTML

function stripHTML(str){
    return str.replace(/<.+?>/g, '');
}

function stripHTML(str){
    return str.replace(/<[^>]+>/g, '');
}

Warning: Will fail in edge cases

Credit card numbers

4060 1234 5678 9000 4060-1234-5678-3457 1230123456789123 4.060/123456-78 90-00 not hello 4060 12345678901234567890

Credit card numbers

// Never do this!
/\d{16}/

// Limited grouping + allows 19 digit numbers
/(\d{4}.?){3}\d{4}/

/(\d\D*){16}/

What we learned

Syntax: \p{UNICODE_PROPERTY=VALUE}. Needs u flag.
When Unicode Property is General_Category it can be omitted.
Experimental, proposal in Stage 3, implemented in Chrome & Safari

Script=Cyrillic

What we learned

Parentheses form capturing groups
Add ?: in the begninning to avoid this

What we learned

^ = beginning of string
$ = end of string
beginning/end of lines, with the m flag

ISO 8601 Dates

Just dates, no time or timezone information

2012-12-12, 1986-06-13

ISO 8601 Dates

/^\d{4}-\d{2}-\d{2}$/.test(str)

/^\d{4}-(0\d|1[0-2])-([0-2]\d|3[01])$/.test(s)

Can it be improved further?

Trimming a string

if (!String.prototype.trim) {
    String.prototype.trim = function(){
        return this.replace(/^\s+|\s+$/g, '');
    }
}

What we learned

Assertions are zero width
\b = word boundary = between \w and \W|^|$
\B = non-word boundary = between \w and \w or \W and \W

What we learned

(?=a) = followed by a, which can be any regex
(?!a) = NOT followed by a

Lookahead hacks

for fun and profit

Intersection A ∩ B

“Password must be 6 letters or longer, and must contain at least one number, one letter, and one symbol.”

password.length > 6
		&& /\d/.test(password) // has digit?
		&& /[a-z]/i.test(password) // has letter?
		&& /\W/.test(password) // has symbol?

// Or, with lookaheads…
/^(?=.*\d)(?=.*[a-z])(?=.*[\W_]).{6,}$/i

Subtraction A – B

“Any integer that’s NOT divisible by 50”

 /^(?!\d+[50]0)\d+$/

Negation A̅

“Anything that doesn’t contain "foo"”

/^(?!.*foo).+$/

What we learned

(?<=a) = preceded by a, which can be any regex
(?<!a) = NOT preceded by a
Experimental (Proposal in Stage 3), only supported in V8 behind a flag

What we learned

\N = Nth backreference

('|")(\\\1|.)+?\1 for escaped quotes

Regex credit: Steven Levithan

Correctly Nested Parens

(), (())(), (()()()), ((((((())))))) not ()(, ), ((((((()))

Not a regular language

Context Free Grammar

S → (S)
S → SS
S → ε

Best Practices

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Practicality > Precision

Email

// Don’t ever do this
/^[a-z]+@[a-z]+\.[a-z]{2,5}/

lea@verou.me leaverou@mit.edu foo@google.co.jp Лиа@Веру.рф λία@βέρου.ελ me@lea.moscow

// Yes, it’s ok to be lax!
/^\S+@\S+\.\S+/

False positives vs false negatives

Keep it simple

Even if it means not using regexes at all

str === "foo" > /^foo$/.test(str)
str.indexOf("foo") > -1 > /foo/.test(str)
str.indexOf("foo") === 0 > /^foo/.test(str)

For performance

Avoid greedy quantifiers
Don’t forget anchors (^ and $)
Be as specific as possible
Prefer non-capturing groups (?:)
Minimize backtracking

What we learned

Only applies when there is no match
Avoid nested quantifiers

W3C CSS Working Group Invited Expert

2014 – Haystack Group, MIT CSAIL

/ &(lt|gt|amp); / gi

/ ^ Reg(ex p? | ular expression) $ / i

What we learned

What we learned

Test Challenge

Fizzbuzz

Emojis with woman

Emojis with woman

What we learned

The 12 metacharacters

What we learned

What we learned

What we learned

What we learned

Hex color

Hex color

What we learned

Counting words (roughly)

Number

Number

What we learned

Strip HTML

Credit card numbers

Credit card numbers

What we learned

What we learned

What we learned

ISO 8601 Dates

ISO 8601 Dates

Trimming a string

What we learned

What we learned

Intersection A ∩ B

Subtraction A – B

Negation A̅

What we learned

What we learned

Correctly Nested Parens

Not a regular language

Context Free Grammar

Practicality > Precision

Email

False positives vs false negatives

Keep it simple

Even if it means not using regexes at all

For performance

What we learned

Video Credits

`/ &(lt|gt|amp); / gi`

`/ ^ Reg(ex p? | ular expression) $ / i`