What's a regular expression (a.k.a. regex)? 🤔 #TerminologyTuesday

Yes, they're a programming language within a programming language that's just for pattern matching and they're extremely succinct.

But what does "regular expression" really mean? And where did they come from?
Haven't seen regular expressions?

Imagine a special purpose programming language where every single character is a statement and no whitespace or comments are allowed. 😨

Regular expressions are extremely information dense but very helpful for certain types of pattern matching.
Regular expressions are called "regular" because regular expressions define a "regular language".

What's a "regular language"?

I'm so glad you asked! This is one of the few factoids from my CS degree that actually stuck with me.

This part of CS is tightly tied to linguistics.
A regular language is a "formal language" (a language with words formed based on a set of rules)

This diagram is called the Chomsky hierarchy (Noam Chomsky was a renowned linguist before he got into politics).

Regular languages are the most restrictive one in this hierarchy. Concentric ovals showing recursively enumerable languages in
If you've ever seen the weird-looking grammar definitions in Python's documentation, that's based on EBNF, which is a notation for representing a context-free grammar.

Regular expressions are too restrictive to represent Python's grammar so a context-free grammar is needed.
If you ever hear folks talking about whether a language is "Turing complete" they're talking about "recursively enumerable" languages. That's on the opposite end of the formal language spectrum from regular expressions.

So what IS a regular expression / regular language?
Rule 1: a regular language can consist of one single character.

But also, if 𝗔 and 𝗕 are regular languages, then these are ALSO regular languages:

𝗔 𝗕 (concatenating)
𝗔 | 𝗕 (unioning: one OR the other)
𝗔* (repeating zero or more times)

That's it. Seems too easy, right?
The regular expression syntax you've probably seen was derived from those rules but includes extra shorthands to make common cases shorter to write.

These 3 regexes:

1️⃣ a(bbb|bbbb|bbbbb)c
2️⃣ abb*c
3️⃣ (a|e|i|o|u)

Could instead be written as:

1️⃣ ab{3,5}c
2️⃣ ab+c
3️⃣ [aeiou]
The syntax most programming languages use for regexes today is heavily borrowed from Perl's syntax.

Unlike Python, Perl includes a special syntax JUST to make writing regexes easier.

Perl's love of regex likely occurred because Larry Wall (Perl's creator) studied Linguistics.
So regular expressions are a notations for representing regular languages.

And the syntax we use for them started in theoretical linguistics but evolved through sed, Perl, etc. to today.

But it turns out that regular expressions are also equivalent to finite state machines.
If you can represent your text-to-match with a finite state machine (FSM), you could use a regular expression instead (& vice versa).

FSMs are those diagrams with circles and arrows that connect the circles.

It's neat that regular expressions are identical to FSMs. 3 node diagram representing regex a*b(b|a(a|b))*. CC BY-SA i
Something weird: programming language regexes have extra features that can't be represented in "regular languages".

An example: back references referring to a previously matched group (e.g. \1).

So Python's regular expressions do more than theoretical "regular expressions". 🤔
Back to practical every day regular expressions.

2 tips for Python regexes:

1. Use a raw string (r"") for every regex you write. You'll thank me when you use \b.
2. Enable the VERBOSE flag and add whitespace & comments to your regexes. Your future self will thank you.
Tips for ALL your regexes:

3. Don't overuse them. For example if you can use Python's "in" operator, do that instead.
4. Don't be overly precise: err on the side of accepting invalid input rather than not accepting valid input. Users don't like being told their name is invalid!
This thread was inspired by all my Python training attendees who've asked "why are they regular if they look so weird"?

Thanks to Dr. Sheila Greibach who taught my formal languages & automata theory class at UCLA. 💖
If you find yourself strangely intrigued by the connections between regular expressions, finite state machines, context-free grammars, context sensitive grammars, and Turing machines: look up formal language theory. And maybe dive into compiler theory eventually!
If you want to improve your regex skills, try my workshop on regular expressions: regex.training.

Practicing regexes is THE way to learn them.

And if you enjoy that exercise style check out the many regular expression-tagged exercises on @PythonMorsels. 🐍🍪

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Trey Hunner (Python trainer)

Trey Hunner (Python trainer) Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @treyhunner

Aug 3
I usually recommend the "literal" list & dict syntax in #Python over the built-in list & dict functions.

✅ []
🚫 list()

✅ {}
🚫 dict()

✅ {"name": "Trey", "id": 4}
🚫 dict(name='Trey', id=4)

So what's the purpose of list(...) and dict(...)? 🤔

Copying!

(thread🧵)
Per the Zen of Python:

> there should be one— and preferably only one —obvious way to do it

I see [] and {} as "the one obvious" way to make a new list/dict.

[] and {} are even more common than list() and dict() so are likely more obvious to most Python devs.
What about passing keyword arguments to dict(...)?

This is a neat trick, but its use is limited.

Using non-string or invalid Python variables as keys doesn't work:

>>> d = dict(class="yes")
SyntaxError: invalid syntax

I don't find the benefits of dict(...) worth 2 syntaxes.
Read 8 tweets
Aug 2
What's an "object" in Python?

According to the Python glossary, an object is:

> Any data with state (attributes or value) and defined behavior (methods). Also the ultimate base class of any new-style class.

What does that really mean? (thread🧵)

#Python #TerminologyTuesday
Usually when we think of an "object" we think of class instance. For example these are all objects:

>>> numbers = [2, 1, 3, 4, 7] # a list object
>>> colors = {"red", "green", "blue", "yellow"} # a set object
>>> name = "Trey" # a string object
>>> n = 3 # an int object
Anything that can have attributes is an object.

Anything that can has methods is an object.

Anything that you can point a variable to is an object.

Pretty much EVERY THING is an object in Python.
Read 8 tweets
Aug 1
Need to remove all spaces from a string in #Python? 🌌🐍

Let's take a quick look at:

• removing just space characters
• removing all whitespace
• collapsing consecutive whitespace to 1 space
• removing from the beginning/end
• removing from the ends of every line

Thread🧵
If you just need to remove space characters you could use the string replace method to replace all spaces by an empty string:

>>> greeting = " Hello world! "
>>> greeting.replace(" ", "")
'Helloworld!'

But you may also want to remove other whitespace too (e.g. newlines)...
To remove all sorts of whitespace, you could use the string split method along with the string join method:

>>> version = "\tpy 310\n"
>>> "".join(version.split())
'py310'

Or you could use a regular expression:

>>> import re
>>> re.sub(r"\s+", "", version)
'py310'
Read 8 tweets
Jul 11
"Why does Python have a datetime.timedelta object?"

My Intro to #Python students sometimes ask me that question ☝

My answer:

1. Moments in time are not the same as durations of time
2. Sometimes an idea is important enough to warrant a new data type

Let me explain 🧵
First, let's note that datetime != timedelta

These represent moments in time:

>>> nye = datetime(2022, 12, 31)
>>> halloween = datetime(2022, 10, 31)

But this is a *duration* of time:

>>> nye - halloween
datetime.timedelta(days=61)

A datetime can't represent "61 days"
We need some way to represent a duration of time.

Why didn't the core devs use a float representing seconds?

Imagine datetime arithmetic that way 😖

>>> from datetime import timedelta
>>> nye - halloween
5270400.0
>>> halloween + 2*24*60*60
datetime.datetime(2022, 11, 2, 0, 0)
Read 6 tweets
Jun 1
Need to split up seconds into hours, minutes, & seconds in #Python?

This is one of the rare instances in which I might reach for Python's built-in divmod function. Though datetime.timedelta might work too, depending on your use case.

Let's compare int, //, divmod, & timedelta🧵
Given a number of seconds:

>>> duration = 4542

You might have thought to use division, modulo, and Python's int function 🤔

hours = int(duration/60 / 60)
minutes = int(duration/60 % 60)
seconds = duration % 60
But that int function is there because we're doing truncating division. Python has an operator just for that!

The // operator:

hours = duration // (60*60)
minutes = duration // 60 % 60
seconds = duration % 60

When x and y are ints, x//y is exactly the same as int(x/y).
Read 9 tweets
May 31
Another fundamental Python term for #TerminologyTuesday

callable: an object which can be called

The classical "callable" is a function, but in #Python classes are also callables.
In many programming languages (e.g. JS, PHP, C++) creating a new "instance" of a class (an object whose type is that class) involves the "new" keyword:

let eol = new Date(2020, 1, 1);

But in #Python to make a new class instance we just call the class:

eol = date(2020, 1, 1)
The fact that classes are callables means the distinction between a function and a class is often quite subtle. All of these "functions" are actually implemented as classes:

n = float("4.5")
m = int(n)
s = str(m)
b = bool(m)
t = tuple('abcd')
e = enumerate(t)
r = reversed(t)
Read 11 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us on Twitter!

:(