A practical reference of the regular expressions worth memorising — with the greedy-vs-lazy trap explained.
Regular expressions look like line noise until they click, and then they become one of the most useful skills a developer, analyst, or writer can have. This is not a theory lesson — it is a practical cheat sheet of patterns you will reach for again and again, with plain-English explanations of what each piece does and a note on the traps that catch people.
Almost every pattern is assembled from a small vocabulary:
. matches any single character. \d a digit, \w a word character (letter, digit, or underscore), \s any whitespace.* means “zero or more”, + “one or more”, ? “zero or one”. Add ? after any of them to make it lazy (match as little as possible).^ anchors to the start of the line, $ to the end.[abc] is a character class — any one of a, b, or c. [^abc] means anything except those.(...) groups and captures; (?:...) groups without capturing.{2,4} means “between two and four times”.\s+$ — matches any run of spaces or tabs at the end of a line. Replace with nothing to clean up sloppy text.
{2,} — two or more spaces. Replace with a single space to normalise text pasted from PDFs.
[\w.+-]+@[\w-]+\.[\w.-]+ — good enough for finding emails in a blob of text. Do not try to write the “perfect” email regex; the official one is hundreds of characters and still not worth it.
https?://[^\s]+ — “http” with an optional “s”, then everything up to the next space. Simple and effective for pulling links out of documents.
\d{4}-\d{2}-\d{2} — four digits, dash, two digits, dash, two digits. Tighten it with \d{4}-(0[1-9]|1[0-2])-(0[1-9]|[12]\d|3[01]) if you need to reject impossible months and days.
-?\d+(?:\.\d+)? — an optional minus sign, digits, then an optional decimal part. The (?:...) keeps the decimal grouped without creating a separate capture.
^[a-z0-9]+(?:-[a-z0-9]+)*$ — lowercase letters and digits, with single hyphens between words. Rejects leading, trailing, and doubled hyphens.
"([^"]*)" — a quote, then a captured run of anything that is not a quote, then the closing quote. The negated class is what stops it from greedily swallowing across multiple quoted strings.
This is the single most common regex bug. Suppose you want to match an HTML tag with <.+> in the text <b>hi</b>. Because + is greedy, it matches from the first < all the way to the last > — the whole string. Make it lazy with <.+?> and it stops at the first >, matching just <b>. Whenever a pattern “matches too much”, suspect greediness first.
Parentheses do double duty: they group for repetition and they capture for reuse. In a find-and-replace you can refer to captured groups as $1, $2 (or \1, \2 in some tools). To swap “Last, First” into “First Last”, match (\w+),\s*(\w+) and replace with $2 $1. Backreferences inside the pattern itself — (\w+)\s+\1 — find doubled words like “the the”.
Regex is unforgiving: a single misplaced . or a forgotten escape changes what matches. Never deploy a pattern you have only read; run it against real samples, including the awkward ones — empty strings, values with commas, lines with trailing spaces. A live regex tester that highlights matches as you type turns an hour of guesswork into a couple of minutes.
Keep patterns as simple as the job allows. A regex that is “clever” today is unreadable in six months. When a pattern grows past a line or two, that is often a sign the problem is better solved with real parsing rather than one heroic expression. And always escape the characters that have special meaning — . * + ? ( ) [ ] { } ^ $ | \ — when you mean them literally. Once these patterns are muscle memory, cleaning data, validating input, and bulk-editing text becomes genuinely fast.