And at least one useful point I don't think I've ever thought to articulate before, along the subject of testing in production ...
You really want the ability to do both, as someone shipping code.
Are you frustrated by too many failed deploys, or days elapsing before bugs are noticed and reverted? Do you struggle with people getting paged and losing time debugging changes that weren't theirs?
What might: assuming your team already wants to do a good job, and building tooling to boost visibility, create feedback loops, and give fine grained control
Fix your deploys so that each deploy contains a single changeset. Generate a new, tested artifact after each merge, and deploy them in order.
✨YOU CAN DO IT✨
Struggle with ownership? Modify your paging alerts so that if it's within an hour of a deploy to the complaining service, it pages whoever wrote and merged the diff that just rolled out. I
Likely your team is very weak at practicing instrumentation, for starters.
Like, make it expected for devs to watch as their own diffs roll out, synchronously/in real time.
Maybe it deploys to a canary or 10% of hosts and then requires confirmation to proceed.
Or maybe you have the opposite problem and people ship too recklessly, and prod goes down or the deploys fail/rollback all day.
* automate the process of deploying each CI/CD artifact
*... to a single canary host
* then monitor a number of health checks and thresholds over the next 30 min
* if ok, promote 10% of hosts at a time to the new version
Carefully considered changes to deploys can improve the overall function of the team in one fell swoop.
- if you merge to master, your changes will automatically go live in the next 15 minutes
- deploys are almost entirely nonevents, because the behavioral changes are governed through feature flags anyway