/Reg(exp){2}lained/

Demystifying regular expressions

By Lea Verou (@LeaVerou)

Picture of me
Привет, я Лиа

#funfact

I grew up in Lesbos, Greece

…which technically makes me geographically Lesbian

W3C CSS Working Group Invited Expert

CSS Secrets by O’Reilly ★★★★★ on Amazon

2014 – Haystack Group, MIT CSAIL

/^Reg(exp?|ular expression)$/i

Regular expressions:
A way to describe a set of strings

/ &(lt|gt|amp); / gi

/ ^ Reg(ex p? | ular expression) $ / i

"pâté".normalize("NFD").replace(/[\u0300-\u036f]/g, "") // "pate"
'lea@verou.me'.match(/(\w+)@/)[1] // "lea"
code.replace(/ {4}/g, '\t'); // fix broken indentation
<input name="zip" pattern="\d{5}" /> 

Also: text editors, IDEs, command line tools (grep, etc), databases, and more

What we learned

\u2665 for ♥, emojis, rainbow matches rainbow flag emoji

Flag Name Purpose ES?
g Global Get all matches ES3
i Case Insensitive Ignore case when matching ES3
m Multiline ^ and $ match the beginning and end of each line, not of the whole string ES3
y Sticky Anchors each match of a regular expression to the end of the previous match. ES2015
u Unicode Treat pattern as a sequence of unicode code points ES2015
s DotAll . matches newlines as well. ES2018? (Stage 3)

What we learned

(c|b|f|)at|dog, do(|g) for order

Test Challenge

fizz buzz fizzbuzz

Fizzbuzz

// Matches everything, including ""
/(fizz|)(buzz|)/
// Correct but inelegant
/(fizzbuzz|fizz|buzz)/
/(fizz(buzz|)|buzz)/

Emojis with woman

👩 👩‍💼 👪 👨‍👩‍👦‍👦 👩‍👦‍👦 👩‍👧 👩‍👦 👩‍👩‍👦‍👦 👩‍👩‍👦 👩‍👩‍👧 💏 👩‍❤️‍💋‍👩 not 👨‍👧 👨‍👨‍👧 👨‍👨‍👦 👨‍❤️‍👨

Emojis with woman

// Ewww, just stop
/👩|👩‍💼|👪|👩‍👦‍👦|👩‍👧|👩‍👦|👩‍👩‍👦‍👦|👩‍👩‍👦|👩‍👩‍👧|👨‍👩‍👦‍👦|.../
/👩/ // Wrong, doesn’t include 👪 and 💏!
/👩|👪|💏/

/1.5/ show it matches 21056 etc, show two are needed to match emojis

What we learned

The 12 metacharacters

[ { ( ) \ ^ $ . | ? * +

What we learned

(ab){n}, a{2,5} and caaaaaaaaaat, parentheses always needed for emoji

What we learned

What we learned

Delete last > to show what happens

What we learned

in Unicode order

Hex color

#abc, #f00, #BADA55, #C0FFEE

Hex color

/#[a-f0-9]{3,6}/i // Wrong!
/#([a-f0-9]{3}){1,2}/i

What we learned

Counting words (roughly)

function wordCount(text){
	return text.match(/\w+/g).length;
}
function wordCount(text){
	return text.split(/\s+/).length;
}

Number

Without exponent or digit separators

-1 .05 +1000 3.1415926535 42.

Number

/^[-+]?[\d.]+$/ // Too lax
/^[-+]?\d*\.?\d+$/ // False negatives: 5.
/^[-+]?\d*\.?\d*$/ // False positives: ., +., + etc
// Accurate, but is it worth it?
/^[-+]?(\d*\.?\d+|\d+\.)$/

What we learned

Strip HTML

function stripHTML(str){
    return str.replace(/<.+?>/g, '');
}
function stripHTML(str){
    return str.replace(/<[^>]+>/g, '');
}

Warning: Will fail in edge cases

Credit card numbers

4060 1234 5678 9000 4060-1234-5678-3457 1230123456789123 4.060/123456-78 90-00 not hello 4060 12345678901234567890

Credit card numbers

// Never do this!
/\d{16}/
// Limited grouping + allows 19 digit numbers
/(\d{4}.?){3}\d{4}/
/(\d\D*){16}/

What we learned

Script=Cyrillic

What we learned

What we learned

ISO 8601 Dates

Just dates, no time or timezone information

2012-12-12, 1986-06-13

ISO 8601 Dates

/^\d{4}-\d{2}-\d{2}$/.test(str)
/^\d{4}-(0\d|1[0-2])-([0-2]\d|3[01])$/.test(s)

Can it be improved further?

Trimming a string

if (!String.prototype.trim) {
    String.prototype.trim = function(){
        return this.replace(/^\s+|\s+$/g, '');
    }
}

What we learned

What we learned

Lookahead hacks

for fun and profit

Intersection AB

“Password must be 6 letters or longer, and must contain at least one number, one letter, and one symbol.”

password.length > 6
		&& /\d/.test(password) // has digit?
		&& /[a-z]/i.test(password) // has letter?
		&& /\W/.test(password) // has symbol?
// Or, with lookaheads…
/^(?=.*\d)(?=.*[a-z])(?=.*[\W_]).{6,}$/i

Subtraction AB

“Any integer that’s NOT divisible by 50”

 /^(?!\d+[50]0)\d+$/

Negation

“Anything that doesn’t contain "foo"”

/^(?!.*foo).+$/

What we learned

What we learned

('|")(\\\1|.)+?\1 for escaped quotes

Regex credit: Steven Levithan

Correctly Nested Parens

(), (())(), (()()()), ((((((())))))) not ()(, ), ((((((()))

Not a regular language

Context Free Grammar

Best Practices

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

Practicality > Precision

Email

// Don’t ever do this
/^[a-z]+@[a-z]+\.[a-z]{2,5}/

lea@verou.me leaverou@mit.edu foo@google.co.jp Лиа@Веру.рф λία@βέρου.ελ me@lea.moscow

// Yes, it’s ok to be lax!
/^\S+@\S+\.\S+/

False positives vs false negatives

Keep it simple

Even if it means not using regexes at all

For performance

What we learned

Video Credits