Yes, they're a programming language within a programming language that's just for pattern matching and they're extremely succinct.
But what does "regular expression" really mean? And where did they come from?
Haven't seen regular expressions?
Imagine a special purpose programming language where every single character is a statement and no whitespace or comments are allowed. 😨
Regular expressions are extremely information dense but very helpful for certain types of pattern matching.
Regular expressions are called "regular" because regular expressions define a "regular language".
What's a "regular language"?
I'm so glad you asked! This is one of the few factoids from my CS degree that actually stuck with me.
This part of CS is tightly tied to linguistics.
A regular language is a "formal language" (a language with words formed based on a set of rules)
This diagram is called the Chomsky hierarchy (Noam Chomsky was a renowned linguist before he got into politics).
Regular languages are the most restrictive one in this hierarchy.
If you've ever seen the weird-looking grammar definitions in Python's documentation, that's based on EBNF, which is a notation for representing a context-free grammar.
Regular expressions are too restrictive to represent Python's grammar so a context-free grammar is needed.
If you ever hear folks talking about whether a language is "Turing complete" they're talking about "recursively enumerable" languages. That's on the opposite end of the formal language spectrum from regular expressions.
So what IS a regular expression / regular language?
Rule 1: a regular language can consist of one single character.
But also, if 𝗔 and 𝗕 are regular languages, then these are ALSO regular languages:
𝗔 𝗕 (concatenating)
𝗔 | 𝗕 (unioning: one OR the other)
𝗔* (repeating zero or more times)
That's it. Seems too easy, right?
The regular expression syntax you've probably seen was derived from those rules but includes extra shorthands to make common cases shorter to write.
These 3 regexes:
1️⃣ a(bbb|bbbb|bbbbb)c
2️⃣ abb*c
3️⃣ (a|e|i|o|u)
Could instead be written as:
1️⃣ ab{3,5}c
2️⃣ ab+c
3️⃣ [aeiou]
The syntax most programming languages use for regexes today is heavily borrowed from Perl's syntax.
Unlike Python, Perl includes a special syntax JUST to make writing regexes easier.
Perl's love of regex likely occurred because Larry Wall (Perl's creator) studied Linguistics.
So regular expressions are a notations for representing regular languages.
And the syntax we use for them started in theoretical linguistics but evolved through sed, Perl, etc. to today.
But it turns out that regular expressions are also equivalent to finite state machines.
If you can represent your text-to-match with a finite state machine (FSM), you could use a regular expression instead (& vice versa).
FSMs are those diagrams with circles and arrows that connect the circles.
It's neat that regular expressions are identical to FSMs.
Something weird: programming language regexes have extra features that can't be represented in "regular languages".
An example: back references referring to a previously matched group (e.g. \1).
So Python's regular expressions do more than theoretical "regular expressions". 🤔
Back to practical every day regular expressions.
2 tips for Python regexes:
1. Use a raw string (r"") for every regex you write. You'll thank me when you use \b. 2. Enable the VERBOSE flag and add whitespace & comments to your regexes. Your future self will thank you.
Tips for ALL your regexes:
3. Don't overuse them. For example if you can use Python's "in" operator, do that instead. 4. Don't be overly precise: err on the side of accepting invalid input rather than not accepting valid input. Users don't like being told their name is invalid!
This thread was inspired by all my Python training attendees who've asked "why are they regular if they look so weird"?
Thanks to Dr. Sheila Greibach who taught my formal languages & automata theory class at UCLA. 💖
If you find yourself strangely intrigued by the connections between regular expressions, finite state machines, context-free grammars, context sensitive grammars, and Turing machines: look up formal language theory. And maybe dive into compiler theory eventually!
If you want to improve your regex skills, try my workshop on regular expressions: regex.training.
Practicing regexes is THE way to learn them.
And if you enjoy that exercise style check out the many regular expression-tagged exercises on @PythonMorsels. 🐍🍪
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Usually when we think of an "object" we think of class instance. For example these are all objects:
>>> numbers = [2, 1, 3, 4, 7] # a list object
>>> colors = {"red", "green", "blue", "yellow"} # a set object
>>> name = "Trey" # a string object
>>> n = 3 # an int object
Need to remove all spaces from a string in #Python? 🌌🐍
Let's take a quick look at:
• removing just space characters
• removing all whitespace
• collapsing consecutive whitespace to 1 space
• removing from the beginning/end
• removing from the ends of every line
Thread🧵
If you just need to remove space characters you could use the string replace method to replace all spaces by an empty string:
Need to split up seconds into hours, minutes, & seconds in #Python?
This is one of the rare instances in which I might reach for Python's built-in divmod function. Though datetime.timedelta might work too, depending on your use case.
Let's compare int, //, divmod, & timedelta🧵
Given a number of seconds:
>>> duration = 4542
You might have thought to use division, modulo, and Python's int function 🤔
The classical "callable" is a function, but in #Python classes are also callables.
In many programming languages (e.g. JS, PHP, C++) creating a new "instance" of a class (an object whose type is that class) involves the "new" keyword:
let eol = new Date(2020, 1, 1);
But in #Python to make a new class instance we just call the class:
eol = date(2020, 1, 1)
The fact that classes are callables means the distinction between a function and a class is often quite subtle. All of these "functions" are actually implemented as classes:
n = float("4.5")
m = int(n)
s = str(m)
b = bool(m)
t = tuple('abcd')
e = enumerate(t)
r = reversed(t)