A database needs to do two things When we are given the data, we should store the data and when asked for the data, we should return the data.
There are big differences between storage engines optimized for transactional workloads, and those optimized for analytics.
Exploring the distinction:
Keep data geographically close to users
Allow the system to continue working even if some of the parts have failed (to increase availability)
Scale out the number of machines that can read queries
Maintaining consistency among the shards
Considerations:
Synchronous or async replication
How to handle failed replicas
Replication lag
One of the replicas is designated as the leader. When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.
The other replicas are known as followers. whenever the leader writes new data to its local storage, it also sends the data change to all its followers as part of a replication log / change stream.
Synchronous replication: leader waits until follower has confirmed that it received a write before reporting success to the user and making it visible to other clients.
Asynchronous replication: leader sends a replication but does not wait for a response.
Synchronocity:
Follower is guaranteed to have an up-to-date copy of data that is consistent withe hte leader
If synchronous follower doesn’t response, write cannot be processed.
Hence, proposal of semi-synchrocity.
Clients are constantly writing to database, and data is always in flux so standard file copy would see different parts of the database at different points in time.
Lock database and copy, but it goes against idea of high availability.
Take consistent snapshot of leader’s database at some point of time. Copy snapshot.
One leader, and all writes must go through it.