/1 Data is cached everywhere, from the front end to the back end!
This diagram illustrates where we cache data in a typical architecture.
/2 There are ๐ฆ๐ฎ๐ฅ๐ญ๐ข๐ฉ๐ฅ๐ ๐ฅ๐๐ฒ๐๐ซ๐ฌ along the flow.
๐น 1. Client apps: HTTP responses can be cached by the browser. We request data over HTTP for the first time; we request data again, and the client app tries to retrieve the data from the browser cache first.
/3
๐น 2. CDN: CDN caches static web resources. The clients can retrieve data from a CDN node nearby.
๐น 3. Load Balancer: The load Balancer can cache resources as well.
/4
๐น 4. Messaging infra: Message brokers store messages on disk first, and then consumers retrieve them at their own pace. Depending on the retention policy, the data is cached in Kafka clusters for a period of time.
/5
๐น 5. Services: There are multiple layers of cache in a service. If the data is not cached in CPU cache, the service will try to retrieve the data from memory. Sometimes the service has a second-level cache to store data on disk.
/6
๐น 6. Distributed Cache: Distributed cache like Redis hold key-value pairs for multiple services in memory. It provides much better read/write performance than the database.
/7
๐น 7. Full-text Search: we sometimes need to use full-text searches like Elastic Search for document search or log search. A copy of data is indexed in the search engine as well.
/8
๐น 8. Database: Even in the database, we have different levels of caches:
- WAL(Write-ahead Log): data is written to WAL first before building the B tree index
- Bufferpool: A memory area allocated to cache query results
- Materialized View
/9
- Transaction log: record all the transactions and database updates
- Replication Log: used to record the replication state in a database cluster
/10 ๐ Over to you: With the data cached at so many levels, how can we guarantee the ๐ฌ๐๐ง๐ฌ๐ข๐ญ๐ข๐ฏ๐ ๐ฎ๐ฌ๐๐ซ ๐๐๐ญ๐ is completely erased from the systems?
Subscribe to our weekly newsletter to learn something new every week: bit.ly/3FEGliw
๐น Processes are usually independent, while threads exist as process subsets.
๐น Each process has its own memory space. Threads that belong to the same process share the same memory.
๐น A process is a heavyweight operation. It takes more time to create and terminate.
๐น Context switching is more expensive between processes.
๐น Inter-thread communication is faster for threads.
/1 What are the most common misconceptions about distributed environments?
About 30 years ago, Peter Deutsch drafted a list of eight fallacies in distributed computing environments, now known as "The 8 fallacies of distributed computing". Many years later, the fallacies remain.
/2 ๐นThe network is reliable
๐นLatency is zero
๐นBandwidth is infinite
๐นThe network is secure
๐นTopology doesn't change
๐นThere is one administrator
๐นTransport cost is zero
๐นThe network is homogeneous.
/3 Subscribe to our weekly newsletter to learn something new every week:
/1 ChatGPT and copy. ai brought attention to AIGC (AI-generated Content). Why is AIGC gaining explosive growth?
The diagram below summarizes the development in this area.
OpenAI has been developing GPT (Generative Pre-Train) since 2018.
/2 GPT 1 was trained with BooksCorpus dataset (5GB), whose main focus is language understanding.
On Valentineโs Day 2019, GPT 2 was released with the slogan โtoo dangerous to releaseโ. It was trained with Reddit articles with over 3 likes (40GB). The training cost is $43k.
/3 Later GPT 2 was used to generate music in MuseNet and JukeBox.
/1 Our newsletter ByteByteGo just reached an important milestone, and I wanted to share some of the learnings in this journey.
/2 How did we get here?
Before posting anything about system design on social media, I spent 2.5 years writing 2 system design interview books. Writing a good book is incredibly hard and usually not very rewarding, but this turned out to be my best investment.
/3 It taught me 3 things: 1) How to write technical content people like to read, 2) Good work takes time. Donโt rush it. 3) Follow your intuition.
/1 How do you decide which type of database to use?
There are hundreds or even thousands of databases available today, such as Oracle, MySQL, MariaDB, SQLite, PostgreSQL, Redis, ClickHouse, MongoDB, S3, Ceph, etc. How do you select the architecture for your system?
/2 My short summary is as follows:
๐นRelational database. Almost anything could be solved by them.
๐นIn-memory store. Their speed and limited data size make them ideal for fast operations.
๐นTime-series database. Store and manage time-stamped data.
/3 ๐นGraph database. It is suitable for complex relationships between unstructured objects.
๐นDocument store. They are good for large immutable data.
๐นWide column store. They are usually used for big data, analytics, reporting, etc., which needs denormalized data.
Based on the Lucene library, Elasticsearch provides search capabilities. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface. The diagram below shows the outline.
/2 Features of ElasticSearch:
๐น Real-time full-text search
๐น Analytics engine
๐น Distributed Lucene
ElasticSearch use cases:
๐น Product search on an eCommerce website
๐น Log analysis
๐น Auto completer, spell checker
๐น Business intelligence analysis
๐น Full-text search
/3 ๐น Full-text search on StackOverflow
The core of ElasticSearch lies in the data structure and indexing. It is important to understand how ES builds the ๐ญ๐๐ซ๐ฆ ๐๐ข๐๐ญ๐ข๐จ๐ง๐๐ซ๐ฒ using ๐๐๐ ๐๐ซ๐๐ (Log-Strucutured Merge Tree).