Brain Dump

Systems

DesigningDataIntensiveApplications

Updated at: 04 Aug 2019 13:52:42

Storage and Retrieval

A database needs to do two things When we are given the data, we should store the data and when asked for the data, we should return the data.

There are big differences between storage engines optimized for transactional workloads, and those optimized for analytics.

Exploring the distinction:

Replication

Reasons

  1. Keep data geographically close to users

  2. Allow the system to continue working even if some of the parts have failed (to increase availability)

  3. Scale out the number of machines that can read queries

Algorithms to handle changes

Maintaining consistency among the shards

Considerations:

  1. Synchronous or async replication

  2. How to handle failed replicas

  3. Replication lag

Leaders and followers

Leader-based replication

One of the replicas is designated as the leader. When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage.

The other replicas are known as followers. whenever the leader writes new data to its local storage, it also sends the data change to all its followers as part of a replication log / change stream.

Synchronous versus asynchronous replication

Synchronocity:

Follower is guaranteed to have an up-to-date copy of data that is consistent withe hte leader

If synchronous follower doesn’t response, write cannot be processed.

Hence, proposal of semi-synchrocity.

Setting up new followers

Clients are constantly writing to database, and data is always in flux so standard file copy would see different parts of the database at different points in time.

  1. Lock database and copy, but it goes against idea of high availability.

  2. Take consistent snapshot of leader’s database at some point of time. Copy snapshot.

Multi-Leader Replication

One leader, and all writes must go through it.