Implementing Distributed Transactions โก [in a gist]
Distributed Transactions are not theoretical; they are very well used in many systems. An example of it is 10-min food/grocery delivery.
๐
โจ The UX we want is: Users should see orders placed only when we have one food item and a delivery agent available to deliver.
A key feature we want from our databases is atomicity. Our storage layer can choose to provide it through atomic operations or transactions.
๐
We will have 3 microservices: Order, Store, and Delivery.
Important decision: Store services have food, and every food has packets that can be purchased and assigned.
Hence, instead of just playing with the count, we will play with the granular food packets while ordering.
๐
โจ Phase 1: Reservation
Order service calls the reservation API exposed on the store and the delivery services.
๐
The individual services reserve the food packet (of the ordered food) and a delivery agent atomically (exclusive lock or atomic operation).
Upon reservation, the food packet and the agent become unavailable for any other transaction.
๐
โจ Phase 2:
Order service then calls the store and delivery services to atomically assign the reserved food packet and the delivery agent to the order.
Upon success assigning both to the order, the order is marked as successful, and the order service returns a 200 OK.
๐
The end-user will only see "Order Placed" when the food packet is assigned, and the delivery agent is assigned to the order.
So, all 4 API calls should succeed for the order to be successfully placed.
๐
Negative cases:
- If any reservation fails, the user will see "Order Not Placed"
- If the reservation is made but assigning fails, the user will see "Order Not Placed"
๐
- If there is any transient issue in any service during the assignment phase, APIs will be retried by the order service.
- To not have a perpetual reservation, every reservation will have an expiration timer that will be large enough to cover transient outages.
๐
Thus, in any case, an end-user will never experience a moment where we say that the order is placed, but it cannot be fulfilled in the backend.
๐
This topic is extensively covered in my YT video; check it out. You can find the notes I used during the video attached ๐โ and the link to the video is right here
Any write operation happening on the Master is logged in the Replication log file as an event. The format in which these events are logged in the Log file is called Replication Format.
The two common Replication formats:
- Statement-based format
- Row-based format
โจ Statement-based Format
The Master records the operation as an event in its log, and when the Replica reads this log, it executes the same operation on its copy of data.
This way, the operation on the Master is executed on the Replica, which keeps it in sync with the Master.
Just wrapped up my 1:1 call with one of my cohort-ian and we ended up building an infinitely scalable Distributed Task Scheduler, AWS CloudWatch Events, DKron, and Quartz Scheduler, in under 30 minutes.
When foundations are clear, no system is harder to design ๐ช
- Infinite task ingestion
- 30 second SLA of execution
- Execution Framework that supports Binaries, Scripts, Remote Executions
- Fault tolerance of Scheduler Nodes
- Repeatability of tasks
- Exactly-once schedule and execution
The design we discussed did not just have random boxes of high-funda components but rather the actual tools and techs that we would be using, along with their pros and limitations. ๐ช