Any write operation happening on the Master is logged in the Replication log file as an event. The format in which these events are logged in the Log file is called Replication Format.
The two common Replication formats:
- Statement-based format
- Row-based format
✨ Statement-based Format
The Master records the operation as an event in its log, and when the Replica reads this log, it executes the same operation on its copy of data.
This way, the operation on the Master is executed on the Replica, which keeps it in sync with the Master.
UPDATE tasks SET is_done = true WHERE user_id = 53;
is logged as
UPDATE tasks SET is_done = true WHERE user_id = 53;
👉 Advantages of Statement-based Replication:
- Smaller log files
- Log files can be used to audit the database
👉 Disadvantages of Statement-based Replication:
- Non-deterministic operations like RAND(), UUID(), will yield different values on Master and Replica
- The Replica lag depends on the load and concurrent queries executing during replication.
✨ Row-based Format
The Master logs the updates on the individual data item instead of the operation.
When the Replica reads this log, it updates its copy of the data by applying the changes on its data items. This way the Replica remains in sync with the Master.
UPDATE tasks SET is_done = true WHERE user_id = 53;
- changes can be safely and predictably applied on the Replica
- locks are fewer and shorter
👉 Disadvantages
- If an operation affects 5000 rows, the Master would create 5000 entries in the log file
- longer lock taken during logging affects the throughput
To date, I have written ~60 articles on Distributed Systems, System Design, Advanced Algorithms, and Python Internals.
Right now, I am running a series on Distributed Systems.
Just wrapped up my 1:1 call with one of my cohort-ian and we ended up building an infinitely scalable Distributed Task Scheduler, AWS CloudWatch Events, DKron, and Quartz Scheduler, in under 30 minutes.
When foundations are clear, no system is harder to design 💪
- Infinite task ingestion
- 30 second SLA of execution
- Execution Framework that supports Binaries, Scripts, Remote Executions
- Fault tolerance of Scheduler Nodes
- Repeatability of tasks
- Exactly-once schedule and execution
The design we discussed did not just have random boxes of high-funda components but rather the actual tools and techs that we would be using, along with their pros and limitations. 💪