15 January 2023

The article summarizes my experiences on software architecture. Architecture design is essentially driven by philosophies as the generator engine that governs all knowledge. From the organization view, we can find why and how architecture design process and skills are required that way. Common methodologies and principles, viewed from the philosophies, provide guidance to carry out architecture design with quality. An architect needs an armory of techniques for different system properties. I categorized Reference architectures in each distributed storage area, summarize architecture design patterns from them, and connect them into technology design spaces.

Table of Contents

Software architecture - A philosophy perspective

Software architecture is a modeling of the reality world, a language, and a human mind creation to assist human mind. Language is an interesting topic. The three together are deeply interconnected, pointing why, what and how to handle software architecture.

The next and following chapters tell about knowledge in software architecture. But this first chapter tells about the engine that generates the knowledge.

Reality, language, and human mind

Firstly, the modeling of the world is human language. Human language evolved for thousands of years, enriched by distinctive civil culture, polished by daily interaction among population, and tested by full industry usage and creation. Grab a dictionary, you learn the world and mankind.

Next, the modeling tool is also a model of the modeler itself. I.e. human language is also the modeling of human mind. Thinking is carried and organized by language. Language is structured in the way how human mind is capable to perceive the world, rather than how necessarily the world itself has to be. E.g. software designs by high cohesion low coupling, which is also the principle of how words are created in language. Like they are to reduce software complexity, they do because human thinks this way.

We can say human language, mind, and the perceivable reality are isomorphic (of the same structure). The expedition into the outer world is the same way with exploring into the deep heart. Losing a culture, a language, is the same with losing a piece of reality. As the two sides of a coin, human language is both the greatest blessing how mankind outperforms other creature beings, and also the eternal cage how farthest human mind is capable to perceive.

About software architecture

Software architecture is a language, a modeling of the reality world, and a human mind creation to assist human mind. The essence of software architecture is to honestly reflect the outer world, to introspect into the inner mind, and to conceptually light up where it is dark missing. The answer is already there, embedded in the structure, waiting to be perceived.

The question is not what software architecture itself is, nor to learn what software architecture has, but to understand the landscape of world and mind, where you see the hole that needs “software architecture” to fill. You predict and design what “software architecture” should be, can be, and will be. There can be three thousands parallel worlds each with a different software architecture book. We pick one of ours.

Besides, knowledge and experience are themselves good designs. They are essentially a domain language, a reusable piece of world modeling, thus also explains why they are useful across daily work and even substitutes design skills. Knowledge is not to learn, but to observe the art of design tested by human act.

The mapping between human mind, language, the reality

Side notes: Explaining with examples

For “high cohesion low coupling” in human language, imagine an apple on a disk. People name them with “apple” and “desk”, rather than a “half apple + half desk” thing. Like selecting what to wrap into an object in Object-Oriented (OO) design, the naming “apple” and “desk” practices “high cohesion low coupling”.

To drill deeper, “high cohesion” implies “going together”. The underlying axis is time, during which the apple goes with itself as a whole. The edges of the apple and the desk intersect, but they have different curves, and they can be decoupled (separated if you move them). Another underlying axis is space. Human senses apple and desk with basic elements like shape and color. These sense elements grow on axes of time and space, to be processed into human language. The processing principles look like those from software design, or to say, software design principles are crafted to suit human mind.

An imagined creature can have a totally different language system and thinking mind, if they do not rely on visual sights like human, or even not with time and space axes. They may not need “high cohesion low coupling” as a thinking principle neither. E.g. they can process information like how organic biology evolves.

For human language is also a cage, remember language is a modeling of the reality. Modeling implies “less important” information are dropped to ease the burden of human cognition. Are they really less important? Words are to reuse the same concept for events happened at different time, which avoids duplication. But are they really duplicates? The necessity of language is itself a sign that human mind is unable to process “full” information. Relying on language, the ability is crippled, limited, caged.

More, human mind can hardly think without underlying time and space axes. Human words, at the bottom layer of the abstraction tower, can hardly go without “human-organ-oriented” sense elements. People frequently need daily chats, to sync drifts on abstract concepts. Even language itself is becoming a bottleneck, between human-to-machine, population-to-population information exchange.

For “software architecture” hole in the world and mind landscape, you all see more in the following article. Though most associate “software architecture” with technology, it is also determined by organization and process needs. Various “needs” in different domains flow into the gap of “software architecture”, crafted to be processed and expressed in a suitable language for human mind. Together they evolve into the internal meaning of “software architecture”.

For predict and design what “software architecture” should be. It can be explained as the method of learning. The plain way is the learn what it is, the structure, the composition, cover the knowledge points, and practice using. The better way is to first understand the driving factors, landscape, and dynamics behind. You can see the source and direction of it, even to the end and final limitation. You can also see the many different alternatives, possible to happen, but eventually not chosen by the real world industry, due to certain reasons underwater. You should be able to define your own methodology, given your local customized needs. You can forget the knowledge and create any on your own.

Cutting the apple and desk

Why need software architecture

There are various aspects why software architecture is necessary, besides technology. These aspects together define what software architecture should be, and correspondingly the methodology and knowledge landscape developed.

Technology aspects

  • Handling the complexity. Software design are separated into architecture level, component level, and class level. Each level popularize with own techniques: 4+1 view, design patterns, refactoring. Any challenge can be solved by adding one layer of abstraction.

  • Decide key technology stack. Internet companies commonly build services atop opensource stacks across different domains, e.g. database, caching, service mesh. Which stack to use affects architecture, and are often evaluated with technology goals and organization resources.

  • Cost of faults. The cost of correcting a fault at early design is way lower than at full-fledged implementation, especially the architecture level faults that need to restructure component interconnects.

Capturing the big

  • Non-functional requirements. Typically, availability, scalability, consistency, performance, security, COGS. More importantly, possible worst cases, how to degrade, critical paths. Also, testability, usability, quality, extensibility, delivery & rollout. They are not explicit customer functional needs, but usually more important, and touches wide scope of components to finally implement.

  • Capturing the changing and non-changing. Architecture design identifies what changes quickly, and what can be stable. The former is usually localized and encapsulated with abstraction, or outsourced to plugin. The later is designed with an ask “can this architecture run 1/3/5 years without overhaul”, which usually reflects to components and interconnections.

  • Issue, strategy, and decision. Architecture is where to capture key issues in system, technology, organization. Strategies are developed to cope with them. And a explicit track of design decisions are documented.

  • Clarify the fuzziness. At architecture step, not uncommonly the customer requirements are unclear, problems are complex and clouded, future is unstable, and system scope is unknown. The architect role analyzes, defines, designs solution/alternatives, and builds consensus across teams.

  • Capture the big. Architect role needs to define what system properties must be grasped in tight control throughout project lifecycle. They map to the project goals of success and key safety criteria. More importantly, architect role needs to decide what to give up, which may not be as easy as it looks, and reach consensus across teams.

Process & Organization

  • Project management. Architecture step is usually where the cost effort, touching scope, delivery artifact, development model; and resource, schedule, quality can be determined and evaluated. It is also where to closely work with customers to lock down requirements. Project management usually works with the architect role.

  • Review and evaluation. Architecture step is usually where the key designs are reviewed; the key benefit, cost, risk are evaluated; throughput capacity breakdown are verified; and all user scenarios and system scenarios are ensured to be addressed. This usually involves stakeholders from different backgrounds and engage with senior management.

  • Cross team collaboration. Architecture touches various external systems and stakeholders. It is when to break barrier and build consensus cross teams or BUs. It is when to ensure support and get response from key stakeholders. It is where to drive collaboration. Unlike technology which only involves oneself, driving collaboration can be a larger challenge.

  • Tracks and lanes. The architect role usually builds the framework, and then the many team members quickly contribute code under given components. It sets tracks and lanes where the code can grow and where not, i.e. the basis of intra-team collaboration. Future, the tracks and lanes are visions for future roadmap, and standards for team to daily co-work.

Architect role

Different architecture organization styles

What an architect role does and means in real world industry are somehow puzzled. From my experience, this is due to architecture step is organized differently at different companies. At some, architect is the next job position of every software developer. At some others, I didn’t even see an explicit architect job position.

Architect the tech lead

Usually seen at Internet companies. The architect role is taken by a senior guy in the team, who masters technology stacks and design principles. The architect makes decision on which technology stack to use, and builds the framework for the following team members to fill concrete code. The architect role is in high demand, because Internet companies quickly spin up App after App, each needs its architect, while the underlying opensource infrastructure is relatively stable. Both the business value and technology stack win traction. The API richness in upper App level implies more products and components to host new architects, while infra level generally has simpler API and honors vertical depth.

Architecture BU

BU - business unit, i.e. department. Usually seen at Telecom companies. Architects work with architects, software developers work with software developers; they reside at different BUs. The architecture results are handed off in middle, following a waterfall / CMMI model. The architecture designs on more stable, even standardized requirements, with very strict verification, and delivers completeness of documentation. Strong process, and expect more meetings bouncing across BUs. Employees tend to be separated into decision making layer and execution layer, where the later one expects long work, limited growth, and early retire.

Peer-to-peer architect

Usually seen at teams building dedicated technology. Unlike Internet companies spinning up Apps horizontally atop many different technologies, such team vertically focuses on one, e.g. to build a database, a cloud storage, an infrastructure component, i.e. 2C (former) vs 2B (later) culture. No dedicated architect job position, but shared by everyone. Anyone can start a design proposal (incremental, new component, even new service). The design undergoes a few rounds of review from a group of senior guys, not fixed but selected by relevance and interest. Anyone can contribute to the design, and can join freely to set off with project development. Quite organic. Technology is the key traction here, where new architecture can be invented for it (e.g. new NVM media to storage design).

System analyst

Usually seen at companies selling ERP (Enterprise resource planning), or outsourcing. The systems are heavily involved into customer side domain knowledge. And the domain knowledge is invalidated when selling to another customer from a different domain. Because of new background each time, comprehensive requirement analysis and architecture procedures are developed. When domain can be reused, domain experts are valued, where knowledge and experience themselves are good designs. Domain knowledge can win more traction than technology, where the later one more leans to stability and cost management.

Borrow and improve

Usually seen at follower companies. If not edge cutting into no man’s land, reference architecture (top product’s architecture) can usually be found to borrow from, to customize and improve. This is also benefited by the wide variety of opensource. Reference architecture, standing on the shoulder of giants, are widely used in software architecture processes, e.g. comparing peer works, which is another example of knowledge and experience themselves are good designs. Market technology investigation survey are high demanding skills.

Key processes in software architecture

How to successfully drive the software architecture process? It involves various collaboration with upstream and downstream, identify the scope, and break barrier and build consensus. Problem analysis and implementation deployment are connected with data driven evaluation, to compose the continuous feedback loop to drive future evolution.

Knowledge and skills

As preparation, architecture design requires below knowledge and skills

  • Downstream, understand your customer. The customer here also includes downstream systems that consume yours. Know customer to capture key aspects to prioritize in architecture, and more importantly what to de-prioritize (E.g. favor latency over cost? Is consistency and HA really needed?). It helps identify the risks (E.g. festival burst usage, backup traffic pattern). Besides, well defining customer space reveals future directions the architecture can evolve.

  • Upstream, understand what your system is built atop. A web App can be built atop a range of server engines, service mesh, database, caching, monitoring, analytics, etc. Mastering the technology stacks is necessary for designing architecture that works with the practical world, and for choosing correct technology stacks that suit project goals and team capabilities.

  • Externally, understand the prior of art. To design a good system, you need to know your position in the industry. Reference architecture can be discovered and borrowed from. Existing technology and experience should be leveraged. E.g. given the richness of opensource databases, designing a new data storage is even a selection and cropping of existing techniques. Participating in meetups helps exchange industry status, and to ensure your design is not drifting away into a pitfall.

  • Internally, understand your existing systems. Understand the existing system to make designs that actually work, and to correctly prioritize what helps a lot and what helps little. Learn from past design history, experience, and pitfalls, to reuse and go the right path.

  • Organizationally, broaden your scope. Architecture design involves interacting with multiple external systems and stakeholders. Be sure to broaden your scope and get familiar with them. Communicate with more people. Solid soft skills are needed for cross team / BU collaboration, to break barrier and build consensus, and to convey with action-oriented points, concise, big picture integrated with detailed analysis.

Carry out the steps

I lean more to peer-to-peer architect style mentioned above. Many can be sensed from GXSC’s answer. At each step, be sure to engage talk with different persons which significantly improves design robustness. Rather than the design results, it’s problem analysis and alternative trade-off analysis that weight most.

  • Firstly, problem analysis. Design proposal starts from innovation. Finding the correct problem to solve is half-way to success. The cost and benefit should be translated to the final market money (Anti-example: we should do it because the technology is remarkable. Good-example: we adopt this design because it maps to $$$ annual COGS saving). The problem scope should be complete, e.g. don’t miss out upgrading and rollout scenarios, ripple effect to surrounding systems, or exotic traffic patterns that are rare but do happen in large scale deployment. Risk should be identified; internally from technology stacks, externally from cross teams, market, and organization. The key of management is to peace out risks, same with managing the design.

  • One important aspect from problem analysis is prioritization. Architecture design, even the final system, cannot address each problem. You must decide what to discard, what to suppress, what to push down to lower level design, what choices to make now and what to defer, what to push into abstraction, what to rely on external systems, what to push off as future optimization; and to decide what are the critical properties you must grasp tightly throughout the project lifetime and monitor end-to-end. I.e. the other key of management is to identify the critical path. Prioritization are usually determined by organization goals, key project benefits and costs, and the art to coordinate across teams.

  • Next, find alternatives. To solve one problem, at least two proposals should be developed. Trade-off analysis is carried out to evaluate the Pros and Cons. Usually, Pros yet have special cases to make it worse, and Cons yet have compensations to make it not bad. The discussion is carried out across team members, up/downstream teams, stakeholders, which may in turn discover new alternatives. The process is iterative, where the effort is non-trivial, multiplied, because it’s not developing one but a ripple tree of solutions. Eventually you explored the completeness of design space and technology space, and reached consensus across team. Choosing the final alternative can be carried out with team voting, or with a score matrix to compare.

  • Review with more people. Firstly, find one or two local nearby guys for early review, to build a more solid proposal. Next, find senior and experienced guys to review, to make sure no scenarios are missing, all can be reused are reused, and the solution is using the best approach. Then, involve key upstream guys, to ensure required features, load level, and hidden constraints, are actually supported; and to ensure their own feature rollout won’t impact yours. Involve key downstream guys, to ensure the new system addresses what they actually want. It’s important to involve key stakeholders early; make sure you gain support from organization, you deliver visibility, and you align with high level prioritization.

  • Then evaluation for the architecture design. Make sure the problem analysis, every customer scenario and system scenario, and project goals, are well addressed. Make sure non-functional requirements are addressed. Make sure the key project benefit and cost are verified in a data driven approach, with actual production numbers as input, using a prototype, simulation tools, or math formulas to model. Make sure the system can support required load level, by breaking down throughput capacity into each component. Make sure the system handles the worst case and supports graceful throttling and downgrade. Make sure the logic has completeness; e.g. when you handle a Yes path, you must also address No path; e.g. you start a workflow, you must also handle how it ends, go back, interleaved, looped. Make sure development and deliver are addressed, e.g. how to infra is to support multi-team development, the branching policy, component start/online/maintenance/retire strategies, CI/CD and rollout safety. Also, make sure hidden assumptions and constraints are explicitly pointed out and addressed.

  • Finally, it’s the documentation. On practice, it involves a short “one pager” document (actually can be < 20 pages), and slides for quick presentation, and spreadsheets for data evaluation. Nowadays culture lean more to lightweight document, central truth in codebase, and prioritize agile and peer-to-peer communication. Problem analysis and alternative trade-off analysis usually weight more in document than the design itself, where defining the problem space is a key ability. It transforms complex muddy problems into breakdown-able, executable, measurable parts. Architecture design part usually includes key data structure, components, state machines, workflows, interfaces, key scenario walkthrough, and several detailed issue discussion. Importantly, the document should track the change history of design decision, i.e. how they reach today, and more specifically the Issue, Strategy, Design Decision chain.

  • Another output of architecture design are interfaces. Interface design does have principles (see later). They are the tracks and lanes where following development start. They reveal how components are cut and interactions to happen. They also propagate expectations of your system to external systems, such as how they should co-work, what should be passed.

Designed to evolve

Architecture is designed to evolve, and prioritized to make it evolve faster. Ele.me Payment System is a good example in a 5 year scope. Competency of nowadays software depend on the velocity it evolves, rather than a static function set.

  • Simple is beauty. Initial architecture usually only address key requirements. What changes and not changes in several year’s scope are identified and addressed with abstraction. MVP is a viable first deployment, after which it yet becomes challenging how to “replace wheels on a racing van”.

  • Highway is important. Functionalities in software resembles to tall buildings in a city, where highways and roads are key how they build fast. These architecture aspects are less visible, usually under prioritized, but are life critical. Inside the system, they can be the debugability, logging, visibility and monitoring. Have they defined quality standards? Do monitoring have more 9s when the system is to be reliable? From infrastructure, they can be the tooling, platform, config system, fast rollout, data obtaining convenience and analytics, scripting. At organization level, they can be the team process and culture to facilitate agile moves. Externally, they can be the ecosystem and plugin extensibility. E.g. Chrome Apps with plugins designed as first-class. E.g. Minecraft published tools to build 3rd-party mods. E.g. Opensource Envoy designs for community engagement from day 1.

  • Build the feedback loop. Eventually after project rollout and deploy, you should be able to collect data and evaluate the actual benefit and costs. New gaps can be found, and yet facilitate a new round of design and improve. How to construct such feedback loop with data driven should be taken into consideration of architecture design.

Driving the project

The last point is about driving the project. The architect role is usually accompanied with ownership, and be responsible to the progress and final results. Driving goes not only the architecture step, but also along with entire project execution. Many can be sensed from 道延架构 article.

  • There can be timeline schedule issues, new technical challenges, new blockers, more necessary communication with up/downstream; previous assumptions may not hold, circumstances can be changed, new risks will need engage; there can be many people joining and many needs to coordinate, and many items to follow up.

  • Besides the knowledge and communication skills, driving involves the long time perseverance, attention, and care. The ability to find real problems, to prioritize and leverage resources, to push, the experiences, and the skillset of project management, are valued. To drive also means to motivate team members to join and innovate. The design becomes more robust, completed, improved, with more people help; and with people from different perspectives to look.

  • More, driving is a mindset. You are not who asks questions, people ask questions to you, and you are the final barrier to decide whether problem is solvable or not. The most difficult problems naturally routes to you. If solving the problem needs resource, you make the plan and lobby for the support. You make prioritization, you define, and you eat the dogfood. The team follow you to success (if not otherwise).

Architecture design feedback loop

Key methodologies in software architecture

Software architecture is a large topic that I didn’t find a canonical structure. I divide it into process (above), methodologies (this chapter), principles, system properties and design patterns, technology design spaces. The article is organized as it.

  • Process. Already covered in the above chapters. It involves how real world organizations carry out architecture design, and conceptually what should be done for it.

  • Methodologies. The analysis method, concept framework, and general structure, to carry out architecture design. They also interleave with principles and philosophies. Methodologies change with culture trends, organization styles, and technology paradigms. But throughout the years, there are still valuable points left.

  • Principles. Architecture level, component level, class level each has many principles, common or specific. Essentially they are designed to reduce mind burden, by letting the code space to mimic how human mind is organized.

  • System properties and design patterns. Distributed systems have non-functional properties like scaleout, consistency, HA (high availability). Various architectural design patterns are developed to address each. They are the reusable knowledge and domain language. Best practices can be learned from reference architecture, more market players, and historical systems; where architecture archaeology systematically surveys through the past history of systems.

  • Technology design spaces. A technology, e.g. database, can evolve into different architectures after adapting to respective workload and scenarios, e.g. OLAP vs OLTP, in-memory of on disk. Exploring the choices and architectures, plotting them on the landscape, reveals the design space. With this global picture in mind, the design space landscape greatly helps navigating the new round of architecture design.

Managing the complexity

The first and ever biggest topic in architecture design (or software design) is to handle complexity. The essence is to let the code space mimic human mind, i.e. how the human language is organized (if you have read the philosophy chapter). Human language is itself the best model of the complex world, which is a “design” polished by human history, and yet shared by everyone. Domain knowledge is thus helpful, as it is the language itself. When code space is close to the language space (or use a good metaphor), it naturally saves everyone’s mind burden.

Below are conceptual tools to handle complexity.

  • Abstraction. Any challenge can be solved by adding one layer of abstraction. The tricky part is you must precisely capture, even predict, what can change and what not. It’s non-trivial. E.g. for long years people tried to build abstract interface across Windows API and Linux API, but today what we have is “write once glitch somewhere”. You still need to examine down the abstraction tower to the bottom. Because coding interface cannot constraint all hidden assumptions, and non-functional properties e.g. throughput and latency, compatibility. Information in the flow can become missing and distorted, after passing along the abstraction tower, resulting in incorrect implementation.

  • Information flow. Typical design captures how code objects flow around the system. But instead, you should capture how information described in human language flow around the system. Language information is symmetric at the sender and receiver components, but the implementation and representation varies (e.g. you pass “apple” across the system, rather than DB records, DAO, bean objects, etc). Dependency is essentially a symmetry, where there is possibly no code references, but semantics linked (e.g. apple has color “red”, that’s where everywhere of your system must handle correctly). Language information carries the goal, which the code should align to, i.e. the code should align to the human language model. Human language is consistent compared to the code objects passing in the system; the later one becomes the source of bug when misalignment happens at different layers of system. The design principle eventually leads to “programming by contract”, “unbreakable class” (a component should work, with no assumptions to the outside, regardless what the caller passes into), semantics analysis; but more to learn from.

  • High cohesion low coupling. Human concepts, or say words in language, are all constructed following the rule of “high cohesion low coupling”. This is how human mind works, and to follow which, the code design saves mind burden. The topic is related to change and dependency. High cohesion encapsulates changes, which localizes code modification impact. Changes pass along the wire of dependency, that’s why low coupling works to reduce undesired propagation. Good encapsulation and delegation requires to predict future changes, which is usually not easy; instead of adding unnecessary OO complexity, it oppositely leads to another KISS design.

  • Name and responsibility. The most difficult thing in software design is giving names. It’s not to say fancy names are hard to find, but to say, being able to name something means you have already grouped the concept in a high cohesion way (e.g. you can name “apple”, “desk”, but cannot name “half apple + half desk”), which inherently leads to good design. Next, a name defines what a thing is, is not, can do, and cannot do; that’s the responsibility. Saying objects should call be their names, is to say objects should call by interfaces and responsibility. Finally, when you can describe the system with fluent human language, i.e. with good names and information flows, you are naturally doing the good design. To do it better, you can organize the talk with consistent abstraction levels, rather than jumping around; if so, it means the design abstraction levels are consistent and self-contained too. Remember design is a modeling to human language (if you have read the philosophy chapter).

  • Reuse. If a component is easy to reuse, it naturally follows high cohesion and good naming responsibility. Design for reuse is recommended, but avoid introduce extra encapsulation and delegation, which results in high OO complexity. Refactor for reuse is recommended, but refactor usually requires global picture knowledge, which contradicts with the goal that changes should be localized. Reference architecture is another reuse to reduce mind complexity. Find the top product and opensource to learn from. Find the popular framework which teaches good designs. The past experience here becomes its domain knowledge, shared by team members, and changing points are more predictable.

  • Separate of concerns. Divide and conquer, decomposition, are the popular concepts. Decouple on the boundary of minimal dependency links. Make components orthogonal from each own space. Make API idempotent from timeline of calls. To truly separate concerns, methodologies are naturally required such as encapsulation, knowledge hiding, minimal assumptions. In theory, any complexity can be broken down into handy small pieces, but beware of the information flow distorted in between, and the missing holes in responsibility delegating.

  • Component boundary. Separating components and sub-components eases mind memory usage. Component boundary should be cut at what changes together. If an upstream service change is frequently coupled with a downstream service change, they should have been put into the same component. Violating it is the common case where microservice messes up the system. High organization collaboration cost is another place to cut component boundary, see Conway’s Law.

Design complexity can be formulated and evaluated using scores on dependency. I found D Score interesting. And this Measuring software complexity article lists other measures. These methods are less popular probably because domain knowledge is more effective to handle complexity. Besides the below bullets, another Domain-drive Design - Chapter 1 article has a nice list of code complexity measurement.

  • “D Score” measures software complexity by the number of dependencies. Dependency links inside the component is adding to cohesion, otherwise adding to coupling if pointing to outside. The two types of dependency links are summed up, with a formula, as the final score.

  • “Halstead Metrics” treat software as operators and operands. The total number of operators and operands, unique numbers, and operand count per operator, are summed up, with a formula, as the final score.

  • “Cyclomatic Complexity” treat software as control flow graph. The number of edges, nodes, and branches, are summed up, with a formula, as the final score.

Code complexity measurement

Levels of architecture design

Software design is complex. To manage the complexity, I break it into different levels and views. Typical levels are: architecture level, component level, and class level. The abstraction level goes from high to low, scope from big to small, and uncertainty from fuzzy to clear. Each level yet has its own methodologies. Levels also map to first-and-next steps, which in practice can be simplified or mixed, to lean more to the real world bottleneck.

  • Architecture level focuses on components and their interconnections. Interconnections are abstracted by ports and connectors. A component or connector can hide great complexity and to delay technical decision to lower levels. A component can be a metadata server, a storage pool, or with distributed caching. A connector can be a queue with throttling QoS, REST services, or an event-driven CQRS. System scope is examined, e.g. input and output flows, how to interact with end users, and the up/down stream systems. The infrastructure and technology stack to build atop can be investigated and determined. Non-functional requirements, user scenarios, and system scenarios are captured and addressed in this level. The typical analysis method is 4+1 View. When talking about software architecture, more are referring on this level. It is also what this article to cover.

  • Component level follows the architecture level. It focuses on the design inside the component. The scope should also be defined, e.g. interface, input and output, execution model and resources needed. This level usually involves tens of classes, Design Patterns are the popular methodology, and component should be designed Reusable. Architecture can be built on existing systems, where technical debt plays a role, e.g. to rewrite all with a better design (high cost), or to reuse by inserting new code (high coupling).

  • Class level next focuses on the more fine-grained level, i.e. how to implement one or several classes well. The definitions are clear and ready for coding. Typical methodologies are Coding Styles, Code Refactoring, Code Complete (bad book name). You can also hear about defensive programming, contract based programming. UML diagrams are vastly useful at this level and also component level, as a descriptive tool, and more importantly an analysis tool; e.g. use state machine diagram/table to ensure all possible system conditions are exhausted and cared about. (Similar methods are also shared in PSP, which is a subset (Combining CMMI/PSP) of CMMI; real world today more lean to Agile, while CMMI essentially turns developers into screw nails with heavy documentation and tightly monitored statistics).

Views of architecture design

Views help understand software design from different perspectives. The methodologies covered below act as the descriptive tools for design output, the analysis tools to verify correctness, the heuristic frameworks for mind flow, and the processes to carry out architecture design.

4+1 View is one of the most popular software architecture methods. It captures the static structure, runtime parallelism, physical deployment, and development lifecycle.

  • Logical View: The components and their interconnections. The diagram captures what consists of the system and how it functions, the dependencies and scope, and the responsibility decomposition. It’s the most commonly mentioned “what architecture is”.

  • Process View: Logical View is static, while Process View captures the runtime. Performance and scalability are considered. Examples are how multi-node parallelism and multi-threading are organized, how control flow and data flow co-work in timeline.

  • Deployment View: E.g. which part of the system runs at datacenter, CDN, cloud, and client device. How should the rollout and upgrade be managed. What are the binary artifacts to deliver.

  • Implementation View: Managing the code, repo, modules, DLLs, etc. Logical View components are mapped to concrete code objects, that developers and project manager can readily work on. It also covers the branching policies and how different versions of the product should be maintained.

  • Usecase View: Last but the most important view. It captures all user scenarios and system scenarios, functional and non-functional requirements. Each of them are walked through across the previous 4 views to verify truly addressed.

Architecture 4+1 view

UML diagrams is the generic software modeling tool, but it can also be understood from the view’s perspective.

  • Structural diagrams capture the static view of the system. From higher abstraction level to lower, there are Component diagram, Class diagram, Object diagram, etc.

  • Behavioral diagrams capture the runtime view of the system. E.g. Activity diagram for the logic workflow, sequence diagram for timeline, Communication diagram for multi-object interaction, and the state diagram everyone likes.

  • Other diagrams. There are Usecase diagram to capture user scenarios and more fine-grained cases; and deployment view to capture how the code and artifacts are grouped for code development.

Domain-Driven Design (DDD) views the system from the domain expert perspective. It applies to systems with complex business logic and domain knowledge, e.g. ERP, CRM (Customer relationship management), or Internet companies with rich business. Compared to traditional OO-design, which easily leads to a spider web of objects (“Big Ball of Mud”), DDD introduces “domains” to tide it up. Besides below listed key concepts, IDDD flu vaccine is also a nice example.

  • Domain. A big complex system (enterprise scale) are cut into multiple domains (e.g. user account system, forum system, ecommerce system, etc), each with their specific domain knowledge, language wording, and domain experts.

  • Bounded Context. The boundary of the domain is called the bounded context. The same conceptual object is possible to exist in two different domains, but they map to different classes; e.g. an account in the context of banking is different from an account in book selling. The object can only interact with objects from the same bounded context. And you should not directly operate on getters/setters, instead you use “business workflow calls”. (Similarly in OO design, objects should interact with objects at the same abstraction level.) A domain’s object cannot (directly) go outside of its bounded context. Bounded contexts are orthogonal to each other.

  • Context Map. But how two Bounded Contexts interact? A domain’s object is mapped to another domain, via the Context Map. The Context Map can be as simple as “new an object” and “assign properties”, or as complex as a REST service. Anti-corruption layer (ACL) can be inserted in middle for isolation.

  • Drive DDD design by language. Domain knowledge is a language, and knowledge itself is a good design (if you see the philosophy part). However language has its fuzziness nature, that’s why context needs to be introduced to bound for certainty. Language is fluent when you organize talking at the same abstraction level, which explains why objects should only interact with objects from the same bounded context. DDD is a methodology to operate language into design. It expresses domain knowledge in reusable code, where domain experts are (or close to) code developers.

  • Company strategic view. DDD is able to model company-wide. An executive needs to strategically decide what is core for business competency, what to support it, and what are the common parts. This introduces Core domains, Supporting domains, and Generic domains. Priority resources are biased among them. In long term, the domain knowledge, and the DDD model implemented in running code, are accumulated to become valuable company assets. The DDD architecture focus on lasting domain modeling, where a good design is neutral to the technical architecture being implemented.

IDDD flu vaccine example

There are more general architecture views more used for customer facing and sales scenarios. They provide alternative insights for what an architecture should include.

  • This Enterprise architecture consists of Business architecture, Data architecture, Application architecture, Technology architecture. This is more viewed from enterprise business level and does a coarse decomposition. It’s shown in the below picture.

  • The 四横三纵 architecture or with more detailed in this Alibaba 四横三纵 article. “四横” are IaaS, DaaS (data as a service), PaaS (platform services) and SaaS. “三纵” are Standard Definition & Documentation (标准规范体系), Security Enforcing (安全保障体系), Operation Support & Safety (运维保障体系).

An enterprise architecture

Besides this section, I also found valuable experiences from Kenneth Lee’s blogs/Kenneth Lee’s articles, the remarkable On Designing and Deploying Internet-Scale Services; and from eBay’s 三高 design P1/eBay’s 三高 design P2 articles, Alibaba’s 道延架构 article, or AWS’s 如何软件开发 article.

Common architecture styles

This is the old topic, a generic design pattern on the scale of architecture. New recent technologies bring more paradigms, but the essence can be tracked back. Company-wide the architecture may eventually evolve to reflect the organization’s communication structure (Conway’s Law), besides the technical aspects.

  • Layered architecture. Now every architecture cannot totally discard this.

  • Repository/blackboard architecture. All components are built around the central database (the “Repository/blackboard”). They use pull model or get pushed by updates.

  • Main program and subroutines. Typical C-lang program architecture, procedure-oriented programming, and can usually be seen at simple tools. The opposite side is object-oriented programming.

  • Dataflow architecture. Still procedure-oriented programming, it can typically be represented by dataflow diagram. The architecture is useful for data processing, especially chips/FPGA, and image processing. Pipeline and filters are another architecture accompanied with.

  • MVC (Model-view-controller). The fundamental architecture to build UI. It separates data, representation, and business logic. It gets richer variants in richer client UI, e.g. React.

  • Client server. The style that old fashion client device connects to server. Nowadays it’s usually Web, REST API, or SOA instead. But the architecture is still useful in IoT, or as a host agent to report status / receive command to central server, or as a rich client lib to speedup system interactions.

  • The mediator. Suppose N components are connecting to M components, instead of N * M connections, a “mediator” component is introduced in middle to turn it to N + M connections.

  • Event sourcing. User sends command, and every system change is driven by an event. System stores the chain of events as the central truth. Realtime states can be derived from event replay, and speedup by checkpoints. The system naturally supports auditing, and is append-only and immutable.

  • Functional programming. This is more an ideal methodology rather than a concrete architecture. Variables are immutable; system states are instead defined by a chain of function calls. I.e. it’s defined by math formula, or a bit like event sourcing. Functions are thus the first-class citizens.

Common architecture styles

More recent architectures

More recent architectures below. You can see architectures vary on: How to cut boundaries, e.g. fine-grain levels, offloading to cloud. Natural structures, e.g. layered, event & streaming, business logic, model-view UI. The gravity of complexity, e.g. complex structures, performance, consistency, managing data, security & auditing, loose communication channels.

  • Microservice. Complex systems are broken into microservices interacting with REST APIs. Typical examples are Kubernetes and Service Mesh. You yet need an even more complex container infrastructure to run microservices: SDN controller and agents for virtual networking, HA load balancer to distribute traffic, circuit breaker to protect from traffic surge, service registry to manage REST endpoints, Paxos quorum to manage locking and consistent metadata, persistent storage to provide disk volumes and database services, … Below is Netflix microservice architecture for example.

Netflix microservice architecture

  • Stream processing. Upstream and downstream systems, across company-wide, are connected via messaging queue, or low latency streaming platforms. Nowadays enterprises are moving from Lambda architecture (realtime approximate streaming and delayed accurate batching are separated) to Kappa architecture (combine both into streaming, with consistent transaction). A more complex system can comprise online, nearline, offline parts, as in below picture.

Netflix online/nearline/offline architecture

  • Cloud native. The system is designed to run exclusively on cloud infrastructure (but to be hybrid cloud). The typical example is Snowflake database. Key designs are: 1) Disk file persistence are offloaded to S3. 2) Memory caching, query processing, storage are disaggregated and can independently scaleout and be elastic for traffic surge. 3) Read path and write path can separately scale, where typical users generate write contents in steady throughput and read traffic in spikes. 4) Different tiers of resources, since fully disaggregated, can accurately charge billing for how much a customer actually uses. Serverless is another topic, where all the heavy parts like database and programming runtime are shifted to cloud. Programmers focus on writing functions to do what business values, lightweighted and elastic to traffic. Below is Snowflake architecture for example.

Snowflake architecture

  • DDD onion architecture. The onion (or call it hexagon) architecture comes to shape in the context of DDD. Domain model is the central part. The next layer outside are applications. The outer layer are adapters that connects to external systems. Onion architecture is neutral to the actual technical architecture being implemented. Domain models can also be connected to test cases to easily validate business logic (rather than the verbosity of preparing testbed with fake data in databases, fake REST interfaces, etc).

DDD onion architecture

  • React-Redux. The architecture is a more advanced version of MVC. With data pulled from server-side, Javascripts at client-side runs MVC itself. Views are constructed by templates + input properties. User actions generate events, which trigger actions, e.g. call services. New updates are sent to reducer, which then map to store. Container uses selectors to fetch states from store, map them to properties, and then finally render the new view. The architecture is also frequently accompanied with Electron and NodeJS to develop rich client Apps with web technologies.

React-Redux architecture

General architecture principles

Architecture level

Most principles are already reflected in the above sections. At architecture level, the most mentioned principles are below three (Architecture 3 principles)

  • Keep it simple. There are enough complexity; simple is precious. Related to KISS.

  • Suitable. Enough for the need, is better than “industrial leading”. An architecture should be suitable, to steer it with your concrete requirement and team resources, rather than to vainly pursuit new technologies. Be frugal. The benefit of a design should be mapped to financial cost to evaluate.

  • Designed for evolving. Business needs are changing. Traffic scale are increasing. Team members may come and go. Technologies are updating. An architecture should be designed evolvable. The architecture process (and development) should be carried out with a growth mindset. An example is Ele.me Payment System, which is quite common for Internet companies.

Component level

More principles come to component level design. CoolShell design principles is good to list all of them. Below are what I think most useful

  • Keep It Simple, Stupid (KISS), You Ain’t Gonna Need It (YAGNI), Don’t Repeat Yourself (DRY), Principle of Least Knowledge, Separation of Concerns (SoC): That said, make everything simple. If you cannot, divide and conquer.

  • Object-oriented S.O.L.I.D. Single Responsibility Principle (SRP), Open/Closed Principle (OCP), Liskov substitution principle (LSP), Interface Segregation Principle (ISP), Dependency Inversion Principle (DIP). Note that though OO principles try to isolate concerns and make changes local, refactoring and maintaining the system in well such state however involves global knowledge of global dependency.

  • Idempotent. Not only API, the system operation should be idempotent when replayed, or reentrantable. A distributed system can commonly lost message and do retry. Idempotent examples can be doing sync (rather than update, sync a command to node, which is consistent after node fail-recovers and re-executes); propagating info in eventual consistency and in one direction; re-executing actions without side effect; goal states commonly used in deployment and config change.

  • Orthogonality. Component behavior is totally isolated from each other. They don’t assume any hidden behaviors from another. They work, no matter what others output. Not only the code path, also the development process can be orthogonal, with a wise cut of components. Orthogonality greatly saves the mind burden, communication cost, and ripple impact of changes.

  • Hollywood Principle, don’t call us, we’ll call you. Component doesn’t new components. It’s however the Container who manages Component creation and initialization. It’s inversion of control, or dependency injection. Examples are Spring DOI, AspectJ AOP. Dependency should be towards the more stable direction.

  • Convention over Configuration(CoC). Properly set default values, save the caller’s effort to always pass in comprehensive configurations. This principle is useful to design opensource libs, e.g. Rails. However, large scale production services may require explicit and tight control on configuration, and the ability to dynamic change. Microsoft SDP is an example.

  • Design by Contract (DbC). A component / class should work by its “naming”, i.e. contract, rather than implementation. A caller should call a component by its “naming”, instead of the effort to look into its internals. The principle maps to objects should work objects at the same abstraction level, and to respect responsibilities.

  • Acyclic Dependencies Principle (ADP). Try not to create a cyclic dependency in your components. Ideally, yes. In fact, cyclic dependency still happens, when multiple sub-systems are broker-ed by a message queue. Essentially, components need interaction, just like people.

Class level

Coming to class level or lower component level, the principles can be found from Coding Styles, Code Refactoring, Code Complete; this article won’t cover. However, it’s interesting to evaluate if a piece of code is good design, which people frequently argue for long time without an agreement. In fact, several distinct design philosophies all apply, which can be found from diverged opensource codebases and programming language designs. To end the arguing, practical principles are

  • Compare concrete benefits/costs to team and daily work, rather than design philosophies.

  • Build the compare on concrete real usecases, rather than blindly forecasting future for design extensibility.

About OO design and Simple & direct

Continued from the above discussion about evaluating a piece of code is good design. The key concern should be whether it saves mind burden across team. There are generally two paradigms: OO design and Simple & direct. They work in different ways.

  • OO design reduces mind burden because good OO design metaphors (e.g. patterns) are shared language across team. The principle fails if they are actually not shared, which should be verified. E.g. one person’s natural modeling may not be another person’s. What code one person feels natural, can become another person’s mind burden. One top team guy can quickly generate code in her natural OO design, but the new code becomes mind burden for others, and slows them down. The condition self-enhances and makes the “top” guy topper. Consistency can be desired, because it extends what’s shared to share.

  • OO design does increase complexity. It introduces more parts from beginning. More interactions becomes hidden and dynamic. A change can refactor more parts to maintain high cohesion low coupling. Things become worse for performance considerations. Decoupling generally hurts performance; it thus needs to introduce more parts to compensate, e.g. caching. More moving parts touched, yet larger scope to maintain for production safety and correctness. Over-design is the next problem behind. OO design essentially works by forecasting future to make changes extensible. However, the forecasting can be frequently wrong, and extra code yet becomes new burden.

  • Simple & direct. Compared to OO design which frequently applies to App level programming, “simple and direct” is more used in system level and data plane programming. The interfaces supported in programming languages, which are the core that OO design relies on, are frequently not capable to capture all information to pass. Examples are performance aspects (cache line, extra calls, memory management, etc), handling worst cases, safety & security concerns, fragile side effects that touch system data structure (if you are programming OS), etc.

    • Encapsulation. OO favors encapsulation. But encapsulation doesn’t work well for performance plane. CPU, cache, memory, disk characteristics are at the lowest level. But they still propagate to every layers in software design. Another anti-encapsulation example is debugging & troubleshooting. We have to tear down the fine details of every layer to locate the bug, where encapsulation adds more barriers. But OO is effective to business logic design, which is complex, volatile, and receives less traction from performance.
  • Thus in simple & direct paradigm, people frequently need to read over all code in a workflow, grasp every detail, to make sure it can work correctly. People also need to read over the code to make sure each corner cases is handled, all scenarios are covered, and worst case and graceful degradation is cared. Then, less code less burden to people, everything simple direct transparent is better. Mind burden is reduced this way. However, OO design is making it harder, because subtle aspects to capture are hidden in layers of encapsulation, and linked in dynamic binding, and there are yet more code and more components to introduce.

  • What’s the gravity and traction of the project being developed? Apps with rich and varying logic are willing to adopt OO design. While system level and data plane usually have more stable interfaces and feature sets, but having more traction to performance and safety. Sometime the developer is willing to break every OO rule as long as COGS can be saved. Besides, encapsulation hinders developers from having control on the overall perf numbers and calls.

  • Prioritization. Can the new code go production? Perf under goal, no. Production safty concerns, no. Bad OO design, OK. Thus, the design should first consider perf and safty, and then OO design. However, OO design naturally prioritizes design first, and pushes off goals like performance, worst case handling, to future extension. Besides the priority inversion, extension may turn out hard after the interface is already running on production.

About optimized algorithms and robust design

A similar discussion like “OO design vs Simple & direct” is whether to use the fast low overhead algorithm or a simple & direct solution.

  • Optimized algorithm usually leverages the most of unique workload characteristics. In another word, it carries the most assumptions, and retains least amount of information. Though reduced overhead, it tends to specialize the code too early (Premature Optimization), thus easily breaks up when adding a new feature, slightly changed the problem scope, or considering more input or output. Optimized algorithms more favor areas where problems have little change.

  • Robust design means to tolerate quick volatile feature changes. Simple & direct solution has a nice play, because it carries few assumptions, and retains the information flow through layers. Without tricks, how you describe the solution in plain human language, how it is implemented in code. Performance optimization is left to hotspots located with diagnostic tools.

  • Trade off. Optimized algorithms and robust design have their fundamental conflicts. As a balanced trade off, usually optimized algorithms localize into smaller and specialized scopes, while robust design expands to the most parts of the system. Though the former attracts people as being “core”, it implies smaller adoptable scope, more likely to be replaced by reuse, and less chance of cross domain intersection.

About analytical skills in designing

Proposing a reasonable design solution and the right decision making require analytical skills. Some can be learned from the Consultant area. They help dissect complex problems and navigate through seemingly endless arguments.

Problem to solve

Finding the right problem to solve and to define the problem are non-trivial.

  • Firstly, there should be the base solution to compare with the proposed solution.

  • Secondly, measure the size of problem in a data driven way, e.g. X% data are affected with Y% cost. Match the problem size with the worth of effort to solve it.

  • Thirdly, PONs/CONs should trace back to fundamental pillars like COGS saving, less dev effort, new user features, user SLA improvement, etc. Multi-dimension trade offs are translated to market money for compare. They are the criteria for decision making.

  • Avoid vague words like something is “better”, “smarter”, “more efficient”, “easier”, “new technology”, “too slow”, “too complex”, etc.

Linear thinking

Smart people like jumping thoughts, but decision making requires linear thinking.

  • Start from a base solution that is simple & direct, and then move step by step to the proposed solution. The key is to identify hidden jumps. Every jump must be justified.

  • Then, ask 1) Why choose to move this step (Problem to solve); 2) What new assumptions are added to support the move; 3) What in information flow are lost or distorted through the step.

  • In a systematical way, these questions are to identify all missing paths. Eventually they compose an MECE analysis tree, to ensure the full spectrum of potential solutions (Design space) are explored. Data driven approach then leads to the best and must-be one.

  • These questions also help identify potential trade offs. Hidden assumptions and distorted information flow are what make adding new feature harder. They also make it easier to introduce bugs.

Technology design spaces - Overview

Software architecture has common system properties, e.g. CAP. To achieve them, different techniques are invented and evolve into more general architecture design patterns. Plotting them on the map of various driving factors, they reveal the landscape of technology design space, that we explore and navigate for building new systems. I’ll focus on distributed storage.

Sources to learn from

Articles, books, and courses teach design patterns and outlines the design spaces

  • MartinFowler site Patterns of Distributed Systems. With the adoption of cloud, patterns like Consistent Core and Replicated Log are gaining popularity. Besides this article, Service Registry, Sidecar, Circuit Breaker, Shared Nothing are also popular patterns.

  • Cloud Design Patterns from Azure Doc also summarizes common cloud native App design patterns. They are explained in detail, and fill the missing ones from above.

  • Book Designing Data-Intensive Applications shows the challenges, solutions and techniques in distributed systems. They map to design patterns and combine into design space.

  • Courses CMU 15-721 outlines key components in database design, e.g. MVCC, data compression, query scheduling, join. The breakdown reveals the design space to explore. The attached papers future tours established design patterns in depth. Highly valuable.

  • On Designing and Deploying Internet-Scale Services. The article is comprehensive, in-depth, and covers every aspect of best practices for building Internet scale services. Highly valuable. It reminds me of SteveY’s comments

Recognized opensource and industry systems become the Reference architectures, which to learn prevalent techniques or design patterns. I listed what I recall quickly (can be incomplete). Reference architectures can be found by searching top products, comparing vendor alternatives, or from cornerstone papers with high reference.

Related works section in generous papers are useful to compare contemporary works and reveal the design space. For example,

  • TiDB paper and Greenplum paper related works show how competing market products support HTAP (Hybrid Transactional and Analytical Processing) from either prior OLTP or OLAP. They also reveal the techniques employed and the Pros/Cons.

Good papers and surveys can enlighten the technology landscape and reveal design space in remarkable depth and breadth

LSM-based Storage Techniques: A Survey

Scaling Replicated State Machines with Compartmentalization

An Empirical Evaluation of In-Memory Multi-Version Concurrency Control

In-Memory Big Data Management and Processing

Constructing and Analyzing the LSM Compaction Design Space

Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree

Reference architectures in storage areas

(Continued from the previous Sources to learn from section.)


  • Redis is the opensource de-facto in-memory cache used in most Internet companies. Compared to Memcached, it supports rich data structures. It adds checkpoint and per operation logging for durability. Data can be shared to a cluster of primary nodes, then replicated to secondary nodes. Tendis further improves cold tiering, and optimizations.

  • Kangaroo cache (from long thread of Facebook work on Scaling Memcached, CacheLib, and RAMP-TAO cache consistency) features in in-memory cache with cold tier to flash. Big objects, small objects are separated. Small objects combines append-only logging and set-associative caching to achieve the optimal DRAM index size vs write amplification. Kangaroo also uses “partitioned index” to further reduce KLog’s memory index size.

  • BCache is a popular SSD block cache used in Ceph. Data is allocated in “extents” (like filesystem), and then organized to bigger buckets. Extent is the unit of compression. A bucket is sequentially appended to full and is the unit of GC reclaim. Values are indexed by B+-tree (unlike KLog in Kangaroo using hashtables). The B+-tree uses large 256KB nodes. Node internal is modified by appending log structured. B+-tree structural change is done by COW and may recursively rewrite every node up to the root. Journaling is not a necessity because of COW, but used as an optimization to batch and sequentialize small updates.

(Distributed) Filesystem

  • BtrFS for Linux single node filesystem. It indexes inodes with B-tree, updates with copy-on-write (COW), ensures atomicity with shadow paging. Other contemporaries include XFS, which also indexes by B-tree buts updates with overwrite; and EXT4, which is the default Linux filesystem that directory inode is a tree index to file inodes, and employs write-ahead journaling (WAL) to ensure update (overwrite) atomicity.

  • CephFS introduces MDS to serve filesystem metadata, i.e. directories, inodes, caches; while persistence is backed by object storage data pool and metadata pool. It features in dynamic subtree partitioning and Mantle load balancing. Cross-partition transaction is done by MDS journaling. MDS acquires locks before update.

  • HopsFS builds distributed filesystem on HDFS. Namenode becomes a quorum, stateless where metadata is offloaded to another in-memory NewSQL database. Inodes are organized into entity-relation table, and partitioned to reduce servers touched by an operation. Cross-partition transaction, e.g. rename, rmdir, are backed by the NewSQL database, with hierarchical locking. Subtree operations are optimized to run parallel.

  • HDFS is the distributed filesystem for big data. It relaxes POSIX protocol, favors large files, and runs primary/back Namenode to serialize transactions. HDFS was initially the opensource version of Google Filesystem (which started the cloud age with Big Table, Chubby), then went so successful, that has become the de-facto shared protocol for big data filesystems, databases (e.g. HBase), SQL (e.g. Hive), stream processing (e.g. Spark), datalakes (e.g. Hudi) for both opensource and commercial (e.g. Isilon) products.

Object/Block Storage

  • Ceph for distributed block storage and object storage (and CephFS for distributed filesystem). Ceph made opensource scaleout storage possible, and dominated (Ubuntu Openstack storage survey) in OpenStack ecosystem. It features in CRUSH map to save metadata by hash-based placement. It converges all object/block/file serving in one system. Node metadata is managed by a Paxos quorum (Consistent Core) to achieve all CAP. Ceph stripes objects and update in-place, which yet introduced single node transaction. Ceph later built BlueStore that customized (Ceph 10 year lessons) filesystem, optimized for SSD, and solved the double-write problem. The double-write issues is solved by separating metadata (delegated to RocksDB), and key/value data (like Wisckey); and big writes become append-only, small overwrites are merged to WAL (write-ahead logging).

  • Azure Storage for industry level public cloud storage infrastructure. It is built on Stream layer, which a distributed append-only filesystem; and uses Table layer, which implements scaleout table schema, to support VM disk pages, object storage, message queue. Append-only simplifies update management but gets more challenge in Garbage Collection (GC). The contemporary AWS S3 seems instead follows Dynamo, that is update in-place and shards by consistent hashing. For converging object/block/file storage, Nutanix shares similar thought to run storage and VM on one node (unlike remotely attached SAN/NAS).

  • Tectonic is similar with Azure Storage. It hash partitions metadata to scaleout. It employs Copyset Placement. It consolidates Facebook Haystack/F4 (Object storage) and Data Warehouse, and introduced much multitenancy and resource throttling. Another feature of Tectonic is to decouple common background jobs, e.g. data repair, migration, GC, node health, from metadata store, into background services. TiDB shares similar thought if would have moved Placement Driver out of metadata server.

  • XtremIO to build full-flash block storage array with an innovative content-based addressing. The data placement is decided by content hash, thus deduplication is naturally supported. Though accesses are randomized, they run on flash. Write is acked after two copies in memory. Other contemporaries include SolidFire, which is also scaleout; and Pure Storage, which is scale-up and uses a dual-controller sharing disks.

Data deduplication

  • Data Domain builds one of the most famous data deduplication appliance. It recognizes middle-file inserts by rolling hash variable-length chunking. Fingerprint caching is made efficient via Locality Preserved Caching, which works perfectly with backup workload.

  • Ceph dedup builds the scalable dedup engine on Ceph. Ceph stores deduplicated chunks, keyed by hash fingerprint. A new metadata pool is introduced to lookup object id to chunk map. Dedup process is offline with throttling. The two level indirection pattern can also be used to implement merging small files to large chunk.

Archival storage

  • Pelican is the rack-scale archival storage (or called cold storage, near-line storage), co-designed with hardware, to reduce disk/cpu/cooling power by only 8% of total disks are spinning. Data is erasure coded and striped across disk groups. Flamingo continues research from Pelican. It generates best data layout and IO scheduler config per Pelican environment setup. Archival storage gains adoption from government compliance needs, and with AWS Glacier.

  • Pergamum co-designs hardware, as an appliance, to keep 95% disks power-off all time. NVRAM is added per node, holding signatures and metadata, to allow verification without wake up disk. Data is erasure coded intra and inter disks. Note Tape Library is still attractive archival storage media due to improvement on cost per capacity, reliability, and throughput.

OLTP/OLAP database

  • CockroachDB builds the cross-regional SQL database that enables serializable ACID, an opensource version of Google Spanner. It overcomes TrueTime dependency by instead use Hybrid-Logical Clock (HLC). CockroachDB maps SQL schema to key-value pairs (KV) and stores in RocksDB. It uses Raft to replicate partition data. It built novel Write Pipelining and Parallel Commit to speedup transactions. Another contemporary is YugabyteDB, which reuses PostgreSQL for query layer and replaced RocksDB with DocDB, and had an interesting debate with CockroachDB (YugabyteDB challenges CockroachDB, Zhihu YugabyteDB/CockroachDB debate, CockroachDB rebuts YugabyteDB).

  • TiDB is similar with CockroachDB. It focus on single region and serializes with timestamp oracle server. It implements transaction following Percolator. TiDB moved a step further to combine OLTP/OLAP (i.e. HTAP) by Raft replicating an extra columnar replica (TiFlash) from the baseline row format data. In contemporaries (Greenplum’s related works) to support both OLTP/OLAP, besides HyPer/MemSQL/Greenplum, Oracle Exadata (OLTP) improves OLAP performance by introducing NVMe flash, RDMA, and added in-memory columnar cache; AWS Aurora (OLTP) offloads OLAP to parallel processing on cloud; F1 Lightning replicas data from OLTP database (Spanner, F1 DB) and converts them into columnar format for OLAP, with snapshot consistency.

  • OceanBase is a distributed SQL database, MySQL-compatible, and supports both OLTP/OLAP with hybrid row-column data layout. It uses a central controller (Paxos replicated) to serialize distributed transaction. The contemporary X-Engine is an MySQL-compatible LSM-tree storage engine, used by PolarDB. X-Engine uses FPGA to do compaction. Read/write paths are separated to tackle with traffic surge. X-Engine also introduced Multi-staged Pipeline where tasks are broken small, executed async, and pipelined, which resembles SeaStar. PolarDB features in pushing down queries to Smart SSD (Smart SSD paper) which computes within disk box to reduce filter output. Later PolarDB Serverless moved to disaggregated cloud native architecture like Snowflake.

  • AnalyticDB is Alibaba’s OLAP database. It stores data on shared Pangu (HDFS++), and schedules jobs via Fuxi (YARN++). Data is organized in hybrid row-column data layout (columnar in row groups). Write nodes and read nodes are separated to scale independently. Updates are first appended as incremental delta, and then merged and build index on all columns off the write path. The baseline + incremental resembles Lambda architecture.

  • ClickHouse is a recent OLAP database quickly gaining popularity known as “very fast” (why ClickHouse is fast). Besides common columnar format, vectorized query execution, data compression, ClickHouse made fast by “attention to low-level details”. ClickHouse supports various indexes (besides full scan). It absorbs updates via MergeTree (similar to LSM-tree). It doesn’t support (full) transaction due to OLAP scenario.

  • AWS Redshift is the new generation cloud native data warehouse based on PostgreSQL. Data is persisted at S3, while cached at local SSD (which is like Snowflake). Query processing nodes are accelerated by AWS Nitro ASIC. It is equipped with modern DB features like code generation and vectorized SIMD scan, external compilation cache, AZ64 encoding, Serial Safe Net (SSN) transaction MVCC, Machine Learning backed auto tuning, semi-structured query, and federated query to datalake and OLTP systems, etc.

  • Log is database 1 / Log is database 2 / Log is database 3. The philosophy was first seen on AWS Aurora Multi-master. Logs are replicated as the single source of truth, rather than sync pages. Page server is treated a cache that replays logs. In parallel, CORFU, Delos builds the distributed shared log as a service. Helios Indexing, FoundationDB, HyderDB build database atop shared logging.

In-memory database

  • HyPer in-memory database has many recognized publications. It pioneers vectorized query execution with code generation, where LLVM is commonly used to compile IR (intermediate representation); and features in Morsel-driven execution scheduling, fork() to create OLAP snapshot from OLTP, and many other aspects. Other contemporaries include SAP HANA, which combines both OLTP/OLAP (with delta structure) and supports rich analytics; MemSQL, which supports OLTP/OLAP by adding both row/columnar format; and GreenPlum, which extended PostgreSQL to MPP, added GemFire (GemFire used by 12306.cn) for in-memory processing, and added OLTP after OLAP with performance improvement and resource isolation.

  • Hekaton is the in-memory DB engine for Microsoft SQL Server. It features in the lock-free Bw-Tree, which works by append deltas and merge. Bw-tree needs a Page Mapping Table (LLAMA) for atomic page update, and avoid propagating page id change to parent nodes. Bw-tree’s SSD component can also be append-only, with “Blind Incremental Update” in DocumentDB. Hekaton also has Project Siberia to tier cold data, which uses adaptive filters to tell whether data exists on cold disk, and cold classification is done offline on logged sampled record accesses.

  • ART tree is one of the popular index (e.g. HyPer) for in-memory databases (and also PMEM). It’s essentially a radix tree with adaptive node sizes. Other contemporaries include Masstree, which is a trie of B+trees and collective optimizing techniques; Bw-tree; and Be-tree, which uses per node buffer to absorb random updates, and adopted in VMWare copy files. For filtering, besides commonly used BloomFilter, SuRF additionally supports range query but with high update cost.

  • FaRM builds scaleout in-memory database with fast serializable transactions on RDMA and UPS protected PMEM. CPU bottleneck is overcome by reducing message count, one-sided RDMA reads/writes, and exploiting parallelism. Data is sharded. Distributed transaction is implemented with 2PC; lock is persisted in logs of primary nodes of each partition; read is lock-free; coordinator has no persistent state. Zookeeper is used to maintain node membership. Objects are accessed via keys (pointer address). Following work A1 builds graph database atop FaRM, and handles RDMA congestion with DCQCN.

  • Silo builds OCC serializable transaction commit protocol by epoch-based group commit, indexed by Masstree. Manycore (40+ CPU cores) significantly changes concurrency design in HPC, in-memory, PMEM systems; e.g. Linux Kernel manycore and Filesystems manycore. Besides custom latching & fencing, techniques are frequently used such as Epoch-based Reclamation (e.g. in Masstree), Sloppy Counter, Flat Combining, Shared Nothing. Epoch-based Reclamation groups frequent memory operations into larger infrequent epochs; threads work on local memory, except the GC one touches all after epoch inactive. RCU is similar, that after all transaction passed low-watermark epoch, older DB record versions can be reclaimed. Sloppy Counter splits reference counting to a global counter and per-core counters, where most operation happens at thread-local. In Flat Combining, worker threads publish requests to thread-local, then compete for a global CAS (compare-and-set), and the only winner batches and executes all requests. Shared Nothing is the silver bullet for high concurrency, as long as the system can be designed this way (e.g. NetApp Waffinity).

NoSQL database

  • RocksDB is the de-facto LSM-tree implementation of single node key-value store. It is used as KV backend in many systems, e.g. MySQL MyRocks, CockroachDB on RocksDB, TiDB on RocksDB, BlueStore and RocksDB. It is also frequently used at Internet companies (RocksDB FAQ). RocksDB features in Universal Compaction, SSD optimization, and Remote Compaction (offload compaction to cloud based shared storage). In tiering approach, PebblesDB inserts increasingly more SST “Guards” in each LSM-tree level, which works like a skip list to constraint and index SST files key ranges, thus to reduce read amplification.

  • FoundationDB to support ACID transaction in distributed KV store. The transaction implementation is backed by the shared logging system. Control Plane, Transaction, Shared Logging, Storage Systems are decoupled. FoundationDB also builds fast recovery leveraging the shared log. Besides, FoundationDB features in Deterministic Simulation Testing built by Flow.

  • MongoDB is the de-facto JSON document database, one of the most successful opensource databases and went IPO. MongoDB went popular because of ease to use. It scales out by sharding (range/hash partitioning) and HA (high availability) by replica set (1 write + N read replicas).

  • HBase is the opensource version of Big Table. Table is range partitioned and metadata managed by ZooKeeper (opensource version of Chubby, or Paxos + Replicated State Machine + Namespace indexing). Partition server employs LSM-tree to manage updates, with common parts like MemTable, HFile, Compaction. HBase features in variable column schema, retrieving values by timestamp versions, and per row atomic operations. Cross-partition transactions can be built atop with Percolator. HBase becomes the de-facto big table schema database on HDFS, and serves as the backend for higher level systems serving SQL, time-series, block, etc. ByteDance has customized implementation of Big Table and Spanner as ByteSQL and Bytable. Alibaba customized HBase and published Lindorm.

  • Cassandra follows the peer-to-peer (P2P) cluster management from Dynamo, while DynamoDB (paper) is AWS commercial that also follows Dynamo. It has no dedicated metadata quorum, but carries metadata in peer nodes and propagates with Gossip protocol. It supports big table schema where primary key is required. Keys are partitioned and placement-ed by Consistent Hashing to avoid data churn when node join/leaves. Cassandra employs quorum write/read (write N replicas, read N/2+1 replicas) to ensure durability and version consistency. Similar P2P cluster management can be found in Service Fabric which hosts microservices and has extensive mechanisms for member node ring consistency.

  • ElasticSearch originates from full-text search engine based on Apache Lucene, so popular, then evolves into the scalable database of JSON documents, logging, time-series, geospatial data with strong search support. ElasticSearch scaleout is managed with primary-secondary replications, and hash sharding. Previously ElasticSearch was also known by ELK stack.

  • InfluxDB is a popular time-series database. Compared to SQL databases, time-series database exploits fixed data organization and query patterns. Metric dimensions can be aggregated to tackle with high ingress volume, re-sampled to tier data. Another contemporary is OpenTSDB, which supports time-series atop HBase. Time-series database is frequently used in monitoring (e.g. Prometheus) and IoT (e.g. InfluxDB IoT).

Graph database

  • Graphene builds the typical patterns for a graph databases, semi-external memory. It speeds up queries by co-locating edges and vertices accessed together, managing small objects and fine-grained IOs. Former work traces back to GraphLab. Other contemporaries include Neo4J, which originates from saving OO graph in DB (database); ArangoDB, which features in JSON document graph and multi-model; and OrientDB which is also a multi-model database. Graph databases are frequently used in Social Network mining and iterative Machine Learning.

  • Facebook TAO the frugal two level architecture for social graph (OLTP). Persistence/capacity layer is by MySQL, which instead uses RocksDB as engine. QPS/cache layer is by Memcached, with a long thread of works of improvement (e.g. CacheLib). For consistency, TAO supports 2PC cross shard write, and prevents fracture read (not ACID, not snapshot isolation). Query is optimized to fetch association.

  • FaRM A1. General purpose graph database used by Bing for knowledge graph, all in-memory. Vertices/edges are organized in linked structure objects, accessed via pointer addresses, and build optimistic concurrency control (OCC) transaction and MVCC (multi-version concurrency control) read via FaRM. Other contemporaries include AWS Neptune; and CosmosDB, which developed from DocumentDB, is a globally distributed (optional) strong consistency multi-model database, and uses Bw-tree with “Blind Incremental Update” instead of LSM-tree to absorb writes.

  • ByteGraph builds graph database atop RocksDB (TerarkDB) with widely compatible Gremlin API. Weighted consistent hash ring shards vertex & adjacent edges to one node. RocksDB easily represents vertex and edges in KV, support in-memory/on-disk tiering, and single node transaction. Large edge list is implemented by edge-tree (B-tree), and further supports secondary index. ByteGraph also supports geo replication (eventual consistency), distributed transaction (2PC), and cost-based query optimizer.


  • Apache Hudi to build datalake atop HDFS, Kafka, Spark, Hive. Compared to data warehouse, it allows update data via CopyOnWrite or MergeOnRead. Other datalake contemporaries are Delta Lake which brings ACID with Spark, Apache Iceberg which features in high performance query. Datalakes generally emphasize in cross-system interoperability. Combing datalake and data warehouse, you get Lakehouse pattern.

  • F1 Query connects multiple data sources like Spanner, BigTable, CSV, ColumnIO, Capacitor, ETL, to create the federated query engine. The former F1 was built atop Spanner and serves Google AdWords. F1 Query supports interactive SQL queries with joins, batch queries, and custom UDFs via the UDF Server. Query is executed as a DAG in parallel, where “dynamic range repartitioning” alleviates data skew. F1 Query use heuristic rules in query optimizer. Besides F1 Lightning adds support to HTAP by replicating extra columnar replica, and ensures snapshot consistency by tracking timestamp watermarks.

Stream processing

Persistent memory

  • NOVA sets up the design patterns for how to build filesystem on persistent memory (PMEM) with high concurrency. NOVA indexes by DRAM radix tree, and improves concurrency by per inode logging, per core free-list. Nova builds atomic file operations with logging, COW, clwb instruction on (customized) DAX-mmap. ART and hashtable are also frequently used index for PMEM storage.

  • Level Hashing. Though NOVA uses tree-based PMEM data structure (filesystem inode tree), another approach explores hashtable data structures on PMEM. It favors O(1) lookup. Level Hashing uses no logs. Resizing is done by two-level rotates. Crash consistency is guaranteed by carefully operating flag bits. However, hash-based PMEM data structure doesn’t support range query.

  • Orion further speeds up PMEM filesystem by directly exposing memory access via RDMA to client, continued from Octopus. Remote PMEM becomes a pool, local PMEM is accessed via DAX. Besides, this PMEM guide is useful for programming.

  • SplitFS, continues from to Orion, puts data path at userspace and metadata operations at kernel by Ext4-DAX. Data path speeds up by bypassing Kernel, while Kernel still manages critical operations affecting consistency and isolation. In this thread, Kuco introduces Ulib, collaborative indexing, and two-level locking, to offload more fine-grain operations to userspace. ZoFS instead use MMU to isolate filesystems from different users, while per single user can operate both metadata/data in userspace (protected by MPK).

Cloud native

  • Snowflake is the OLAP database native on public cloud. Memory caching, query processing, storage are disaggregated, reuse public cloud service (e.g. AWS S3), and independently scalable and billable. Tenant isolation leverages VMs (virtual machines), and offloads the classic resource under-utilization problem to cloud. To avoid read S3 every time, Snowflake adds a caching layer based on ephemeral storage. Nodes can be pre-warmed for elasticity. Snowflake went IPO very successfully.

  • Service Mesh is a containerized microservice infrastructure, where Sidecar proxies (e.g. Envoy) adds traffic routing, Service Registry, Load Balancing, Circuit Breaker, health checks, encryption, etc to Apps with little code change. The former Spring Cloud can be migrated to K8S and Service Mesh environment with effort.

  • Dominant Resource Fairness is a typical Cloud Resource Scheduling algorithm, used in YARN, that normalizes multi-dimensional resource allocation to follow the dominate resource. Alternatively, 2DFQ achieves fairness by separating requests to threads according to their sizes; Quasar samples workload profile on a small cluster via Machine Learning, then goto the full cluster; Container/CGroup specifies quota/weight per user job, and the pattern is shared by K8S scheduling; Ceph QoS employs dmClock that uses weighted reservation tags. Besides, Leaky bucket is the classic algorithm for throttling; Heracles isolates resource for latency-sensitive jobs vs batch. In general, cloud introduced Multitenancy to depict a system shared by multiple users (tenants) and each assigned a group of virtualization, isolation, access control, and priority/quota policies. For cost estimation, a typical method is request count & size in smoothing window or outstanding queue; Cost Modeling in DB query optimizer provides more comprehensive cost modeling methods; examples can be found at paper Access Path Selection and Optimal Column Layout.

  • Akkio used in Facebook migrates u-shards across geo-regional datacenters to maintain access locality. U-shards (in MBs), which represents the small actively access datasets determined by App-side knowledge, is way smaller than shards (GBs), thus incurs low migration cost. Taiji is another Facebook system that load balances users to datacenters based on SocialHash, i.e. friendly groups are likely to access similar contents.

Secondary Indexing

  • Helios builds global scale secondary index. Updates are ingested into shared logging, the single source of truth, and then build indexes asynchronously with eventual consistency. Index is built bottom-up by merging logs and uproll level by level, and stores at HDFS-compatible datalake. 3rd-party query engine can leverage the indexes to prune blocks. Hyperspace is another indexing system on datalake, building index with Spark jobs; but publishes fine-grain index states, metadata, data, logs as plain files (with a spec) on datalake to achieve good interoperability.

  • SLIK builds global secondary index for RAMCloud. It partitions B+tree index that is represented as objects in underlying Key-value store. SLIK avoids the cost of distributed transaction by relax index consistency to satisfy common usecases.

  • HBase Secondary Index compares global index and local index, mentioned in the LSM-tree survey. Global index only needs one search but incurs high consistency cost upon updates. Local index co-locates with each data partition, where consistency update is kept local, but a search needs to query all partitions.

Content distribution network (CDN)

  • Facebook Owl runs a decentralized peer-to-peer data layer (like BitTorrent), while maintaining a centralized control plan with sharded Trackers per region. P2P architecture efficiently scales out and achieves a very high traffic growth v.s. server growth ratio. Content distribution is chunk by chunk, while each chunk follows a different ephemeral distribution tree composed dynamically. Besides preset policies for peer selection and caching, an Emulation framework uses Random-restart Hill Climbing to search for the best policy settings. CDN can also be seen as a special type of distributed cache.

Storage components breakdown

To plot the architecture design space for distributed storage systems, we divide it by three different dimensions. They map to static/runtime views and non-functional goals of the architecture. Common components can be extracted from sources like section Reference architectures section. They may overlap, while I strive to separate them concisely and clearly.

Divide by storage areas

  • Cache
  • Filesystem
    • Distributed filesystem
  • Object/Block Storage
  • Data deduplication
  • Archival storage
  • OLTP/OLAP database
    • Shared logging
  • In-memory database
    • Manycore
  • NoSQL database
  • Graph database
  • Datalake
  • Stream processing
  • Persistent memory
  • Cloud native
    • Cloud scheduling
    • Geo Migration
  • Secondary indexing
  • Query processing

Divide by static components

  • Metadata nodes
  • Data nodes
  • Indexing
  • Logging & journaling
  • Transaction control
  • Allocator
  • Data layout
  • Data compression
  • Data deduplication
  • Caching layer
  • Cold/hot tiering
  • Client
  • Storage media
  • Networking & messaging
  • Backup & disaster recovery
  • Upgrade/deployment and restart
  • Monitoring & alerting
  • Configuration management

Divide by runtime workflows

  • Read path
  • Write path - append/overwrite
  • Load balancing
  • Data replication/repair/migration
  • GC/compaction
  • Data scrubbing
  • Failure recovery
  • Node membership & failure detection
  • Background jobs
  • Clock synchronization
  • Resource scheduling & quota/throttling
  • Overload control
  • Offloading

Divide by system properties

  • Traffic pattern, query model
  • Data partitioning & placement
  • Consistency
  • Transaction & ACID
  • Scaleout
  • Scale-up
  • High availability
  • Data durability
  • Data integrity
  • Read/write amplification
  • Space amplification
  • Concurrency & parallelism
  • Throughput & latency
  • Cross geo-regions
  • Operational ease
  • Interoperability

Distributed storage system overview

Technology design spaces - Breakdown

The following sections talk about each technology design space (ordered by the importance). They root in “Reference architectures” listed above, and cover areas in “Storage components breakdown”. Unlike breakdowns, techniques and design patterns usually interleave multiple components and require co-design. Architecture design patterns, also covered below, map to certain techniques to achieve desired system properties. When connected the dots, they expand to a consecutive design space that enlightens more choices.


Key problems related to metadata are the size of metadata, how to scaleout, where to store, and consistency. Metadata size is closely related to data partitioning and placement.

Metadata section

Metadata size

Essentially, the size of metadata is determined by tracking granularity and degree of freedom the per object. They are the key design space dimensions to consider

  • Tracking granularity. Smaller partition size generally yields better balance, though more memory consuming. The same also works for multi-thread task scheduling. Think randomly tossing balls into bins; the smaller/more balls, the balancer per bin ball count. Different hot/cold tiers can uses different tracking granularity, e.g. cache blocks but store files, e.g. Akkio u-shards.

  • Degree of freedom. The fundamental reason that an object needs memory to track placement location is due to it has freedom to place at any slot. Limiting the possible slots generally reduces memory consumption, e.g. hash object id to map to a placement location. However, this makes placement inflexible and incurs cost on migration.

Generally techniques and design patterns range from minimal metadata or more metadata for fine-grain control

  • Hash-based placement, the extreme of zero metadata. The typical example is Ceph CRUSH, or consistent hashing. A Ceph PG’s placement calculated by a deterministic algorithm, which has no degree of freedom, thus, no metadata needed. The Pro is little metadata memory cost. The Con is excessive data migration when add/remove nodes; balanced placement for capacity but not for hotness; can hardly place when cluster near full.

  • Track full placement, the extreme of full metadata. An object is able to place at any node, and the location is tracked at memory. The Pro is easy to implement extensive migration and balancing for capacity, temperature, in fine-grain. The Con is large metadata size; but there are ways to reduce it or offload.

  • VNode, the hybrid approach that put limits to the object side. 2-levels, an object is first deterministically mapped to a VNode, then VNode is placement on any node. VNode increased tracking granularity, thus less metadata to track, but still enjoys placement freedom. The examples are DB placing partitions, which groups rows mapped by hash; Ceph’s PG and Dynamo’s VNode, which groups small objects (though then still use hash placement); Azure Storage’s “extent”, which groups small blocks from table layer.

  • Partitioned Index, the hybrid approach that limits placement space. Used in Kangaroo KLog index. Rather than allowing an entry to place at any slot of a hashtable, hashtable is split into partitions and entry is allowed to place at a deterministic partition. Thus index space is reduce, and index/pointer memory is reduced. Another approach to limit placement space is Copyset.

  • Overlay, the hybrid approach that overlays freedom placement layer over hash-based placement layer. Existing objects keeps the old hash-based placement. New objects are tracked in metadata and place in a different algorithm. Adding nodes won’t forcefully migrate existing objects. An example is MAPX.

  • Reduce object linkage. Another source of metadata size is the mapping linkage used to lookup objects, e.g. a 16-byte UUID. It grows especially when objects are small, and components are disaggregated into different layers or nodes in the system. Techniques to reduce metadata size can be to piggyback child objects into its parent to save the lookup ID.

Metadata scaleout

The de-facto way to handle scaleout is partitioning (or call it sharding). But there are also simpler methods

  • Partitioning. Metadata are cut by key ranges to serve at different Paxos rings, e.g. Tectonic. Objects can also hash the key to map. This approach solves scalability, requires implementation complexity, and incurs challenge on consistency.

  • Decoupling. Not all metadata are necessary to be stored in the central store. Less important ones can be decoupled to other stores that scale differently, e.g. Tectonic. This approach increases complexity, and incurs cost on messaging, especially for the previous tight memory scans.

  • Pushdown. Metadata can be separated into two levels. The first level is still served in the central store. The second level is looked up on-demand, pushed down to many more data nodes, or pushed down to SSD. A typical example is to handle “Lots of small files” (LOSF): Small files are compacted into a big file, which also persists the index; HDFS only knows the big file, and loads indexes on-demand.

  • Levels of delegation. Similar with Pushdown, the example is Big Table, think the cluster-wide B+-tree as the metadata. Metadata is essentially an index to lookup data, if in tree structure, it can be decomposed level by level, and naturally scaleout lower levels to whole cluster, where the top level is specially kept in a consistent Paxos quorum.

Metadata storage

Where to host metadata, a dedicated cluster, distributed on data nodes, generate on-fly, etc

  • Paxos cluster is the popular approach, e.g. Ceph, FaRM, TiDB, CockroachDB. They use a dedicated Paxos (variant) cluster to host metadata, or Etcd, ZooKeeper that is backed by Paxos (variant).

  • Peer-to-peer. Systems originated from Dynamo doesn’t use dedicated metadata cluster, but distribute the info across entire cluster. They use Gossip protocol to reach eventual consistency. Besides, Dynamo doesn’t have much metadata to track because it uses consistent hashing placement.

  • Primary/Secondary. HDFS uses a single Namenode to host all metadata and process transactions. It’s simpler. HDFS adds a secondary standby backup nodes for HA.

  • God node. You can see distributed DBs get transaction timestamp from one (Paxos quorum of) “timestamp oracle” node, or “sequencer” node, e.g. TiDB’s PD, FoundationDB, CORFU. The sequencer node is stateless, can quickly recover by restart, and use epoch to separate old/new.

Metadata offloading

Metadata can be managed elsewhere to avoid managing the scaleout, consistency, and persistence.

  • Consistent Core. App can manage metadata in Microservice framework provided ZooKeeper, Etcd. In this way, each dimension of problems are offloaded elsewhere. The approach is popular.

  • In-memory DB. Storage cluster-wide metadata management can be offloaded to in-memory database. Examples are HopsFS, or Hekaton. The databases manages metadata partitioning, consistency, scaleout, and tiering cold ones to SSD. At single data node level, Ceph BlueStore offloads metadata to RocksDB, and reuses the transaction.

  • Cold Tiering. Cold metadata can be offloaded to SSD. Which/when to offload need careful management, to avoid slowdown maintenance scan loops, especially when correlated node failures and critical data repair. It’s also possible to compress cold memory entries, but which is CPU consuming.

Metadata consistency

Different areas can favor their terms, such as DB, Storage, Filesystem, which sometime brings confusion

  • Database area commonly use terms like strong consistency, external consistency, serializability, isolation levels, snapshot consistency. See Distributed Transactions.

  • Storage area and distributed systems may terms like Linearizability, sequential consistency (see above article); and weaker ones like eventual consistency, causal consistency. Eventual consistency (well implemented) guarantees updates finish propagation in a time window, won’t revert, and in certain direction. Causal consistency is frequently used in client messaging requiring to see what-you-change.

  • Filesystem area uses “journaling” for metadata logging, and “logging” for data logging. It talks about write atomicity, operation atomicity, and crash consistency. The example of write atomicity is, if a write changes both data and inode, they should either all succeed or all not. The example of operation atomicity is, rmdir, rename; operations on a directory should never expose half state to user. Crash consistency means after node crash, the filesystem should restore a correct state, e.g. no half rmdir, rename exposed, e.g. no broken linked-list on PMEM.

  • VM and Backup systems use terms like consistent snapshot. A DB can use compute VMs, cache VMs, storage VMs. When Hypervisor takes a consistent snapshot, it means all VMs are taken snapshot at a consistent point-in-time. The anti-example is, compute VM thinks an update is committed, but storage VM’s snapshot is taken earlier and says no such commit.

  • Paxos algorithm use terms like consistent read, or quorum read. The issues comes that half of the voters can lag votes, or half of the replicas can lag execution, thus a client can read stale states from a replica. To overcome this issue, the client has to only read from Paxos leader (cannot distribute load, and may failover already), or use quorum read that touches more than half non-leader replicas, or switch to causal consistency instead.

Metadata consistency and data consistency share common techniques, and metadata needs to update in consistent with data. Epoch and fencing token are common techniques to expire stale metadata/data (after crash restart) and to exclude stale leaders. I’ll leave most to data consistency part. In general, metadata needs strong consistency, or weaker but versioned.

  • Single node, strong consistency. Putting all metadata at a single node is the old way, but extremely simple to implement. HA can be achieved by a secondary standby node, or simply rely on faster restart. Modern CPU ensures sequential consistency per core, and cross-core can achieve linearizability via locking.

  • Paxos, strong consistency, in quorum. Relying on Paxos quorum is the de-facto way to achieve strong metadata consistency, e.g. Ceph, HBase. A popular variant is the Raft algorithm, originated from RAMCloud, but becomes even more successful.

  • Causal consistency, weaker consistency, propagating. When strong consistency is prohibitive, usually due to performance consideration, metadata can switch to weaker consistency. The most frequently used one is causal consistency, which captures the propagating constraints. It can be implemented by adding version numbers (simplified from vector clocks) to messages.

  • Snapshot consistency, weaker consistency, versioning. Like causal consistency to constraint propagating, snapshot consistency constraints that within a version, all component states seen are at a consistent point-in-time. Usually both needs a version number, or a timestamp. In general, “weak” consistency is vague, while versioning provides instinctive way to measure and control.

  • Gossip. A common way to propagate metadata across nodes is gossiping, i.e. to piggyback metadata in common communications between nodes. An example is Ceph. The method is also commonly applied in heartbeats to detect node health and membership. Eventual consistency can be achieved with version tracking. A node usually also needs periodically refresh with Consistent Core for suspected stale metadata.


Consistency interleaves the core spine of distributed storage system design. The techniques have high variety and touch most components. I choose scale as the first level category to illustrate consistency design space: from single node level, datacenter level, to geo-regional level. In general, key design space dimensions to consider are below. See Distributed Transactions for more.

  • Point of sync. When a piece of data is changed, there must be a point of time after which the change is made visible to end user. Call it point of sync. It must be atomic, i.e. no half state in middle of invisible vs visible. It must keep promise, i.e. once passed point of sync, it cannot go back. It must reach consensus, i.e. components in the system will agree on the point of sync, according to which propagation it divides into strong consistency vs eventual consistency. For implementation, point of sync usually relies on atomic disk sector write (e.g. logging commit entry), atomic memory pointer switch (e.g. B+-tree), or another (group of) node that acts as the Consistent Core (e.g. leader participant).

  • Ensure ordering. The system must agree on what happens first or later. This is instinctive for append-only or WAL based systems, or where every operation can be serialized by a locking data structure. It becomes tricky when the system involve multiple nodes, or the logging has multiple parallel segments. Versioning (or timestamp) is introduced, where a total ordering maps to Serializable, partial ordering maps to Vector Clocks, and disjoint read/write versions map to Snapshot Isolation (Serializable requires same timestamp for read/write). The system resolved ordering may not be the same with realworld, requiring which it maps to External Consistency. How to handle ordering conflicts varies, where new comer wait maps to plain locking / pessimistic concurrency control, new comer retry maps to OCC, and preempting a lock maps to preemptive concurrency control or wound-wait. For implementation, usually CPU/memory level use locks/latches, and disk level uses flush or direct write.

  • Separating ACID. In transaction ACID, usually ACI is united with consistency, but D durability can potentially be separated. Majority of storage systems choose to implement them altogether, essentially because ordering on disk is done by flush or direct write that couples with persistence. We can see more techniques in the following that break the paradigm and improve performance (e.g. Soft Update, Journal Checksum).

Consistency section

Single node level consistency

At the level of CPU/memory, fundamentally single CPU core ensures sequential consistency (though both compiler and CPU reorder instructions). Multi-core programming involves instruction atomicity (e.g. Intel x64 arch guarantees 64-bit reads/writes are atomic), memory operation ordering (e.g. load/store semantics), visibility of memory changes (e.g. volatile, cache invalidation); they can be summarized under C++ memory model. CPU provides fine-grain instructions for locking/CAS (e.g. lock, xchg, cmpxchg), memory fencing (e.g. lfence, sfence, mfence), cache flush (e.g. CLFLUSH, CLWB). Going to higher level, they are used to build programming locks, lock-free algorithms, and PMEM commit protocols (like O_DIRECT flushes to disk, CLFLUSH flushes cache to memory/PMEM). More advanced are developed for B+-tree locking techniques in database, and Linux Kernel synchronization. They are not a main topic for architecture design.

Coming to storage, more concerns add to memory/disk level and crash recovery (i.e. system integrity). Write-ahead logging (WAL) is the de-facto solution for consistency (as well as write atomicity and durability in ACID), which becomes more dominating with the trend of append-only storage systems (e.g. LSM-tree). WAL (redo/undo log) is also the necessity to implement database transactions. But there are more ways for consistency.

  • Write-ahead logging, consistency by sequential logging and commit entry. Metadata/data changes are made durable to disk by journaling/logging; where the journal/logging commit entry, sync flushed to disk, is the point of sync that changes are committed and visible. Logging is naturally totally-ordered, no excluding further use of versioning/timestamp. Database further employs redo logs and undo logs (ARIES), where redo logs is the common logging, and undo logs is introduced because of “No Force, Steal”, i.e. a page can be flushed to disk even when a (large) transaction hasn’t committed.

  • Shadow paging, consistency by COW and atomic pointer switch. The example is BtrFS. New updates are copy-on-write (COW) added to new B+-tree pages. Upon committing, the point of sync is to atomically switch the pointer at the parent node to new pages. The same paradigm is used both in memory and on disk, which CPU/memory controls ordering with locks. The technique is beneficial with built-in support for snapshot, improves parallelism with COW, and won’t be bottlenecked at serialized journal committing. However, a change at leaf node incurs change at parent node, and propagating further upper to root, which is expensive (unless employs a Page Mapping Table).

  • Soft update, which tracks ordering in memory but without durability. The example is FFS. Inodes tracks update dependency, and the system enforces it. Actual writes to disk can be delayed, and happen asynchronously, and improve parallelism. End user needs to wait notification for changes become durable. Soft update itself doesn’t guarantee necessary metadata/data changes are durable upon crash, and careful implementation is needed to ensure crash consistency.

  • Transactional checksumming, which tracks ordering on disk but without durability. The system starts writing block A/B in parallel, but expects block A is committed only after block B. Block A carries B’s checksum; if a crash happened in middle, leaving A on disk but not B, the checksum can tell block A is invalid. The technique breaks the sequential bottleneck of logging, however determining the point of sync during failure recovery becomes more expensive. See Optimistic Crash Consistency for more.

Consistency between metadata/data components also needs maintain (continued from the Metadata section). A typical storage system propagates visibility of new changes from disk data, to index, then to end user. The index here is metadata, which tells how to lookup data, e.g. inode trees. From system internal, the propagation is usually of eventual consistency, e.g. allocating disk space, write data, then after some time to commit the journal. From the view of end user, it’s made atomic by the interface (hiding system internals) and notification (async) exposed by the write request. This same design pattern applies when metadata and data are separated to different groups of nodes.

Datacenter level consistency

After single node level consistency, we come to the distributed multi-node level. From strong to weak, modern distributed database typical implements distributed transactions for ACID at Serializable or Snapshot Isolation level. Storage systems builds strong consistency with data replication. NoSQL, caching, cross systems interactions typically employ weaker consistency models to reduce complexity and overhead on performance.

  • Distributed Transactions. See the Distributed Transactions article for more. Examples are Spanner, Percolator, CockroachDB, TiDB. The implementations vary at point of sync, how to enforce ordering, and lock conflict handling. Besides, database global secondary index, in strong consistency with user writes, also implements with distributed transaction.

Distributed transaction spectrum of strategies

  • Raft data replication. Examples are CockroachDB, TiDB. Like running metadata in Paxos quorum, data partitions are replicated with Raft protocol (a Paxos variant). This ensures strong consistency, and reuses optimizations on Paxos e.g. Out-of-order commit. Megastore provides comprehensive optimizations for Paxos replication.

  • 3-way replication. Examples are Ceph, and similarly the Chain Replication used in Azure Storage. It’s simpler and came earlier than Raft. The classic implementation selects a leader node via the Consistent Core (e.g. the metadata cluster) to drive follower nodes with strong consistency. Throughput can be optimized with pipelining.

  • Quorum read/write. Examples are Dynamo, Cassandra. With N replicas in total, either read or write operations on > N/2 replicas, so they guarantee to intersect on the replica with the latest version. The implementation adds more complexity to handle read amplification (or simply return cached versions), version tracking, and node write failures.

  • Log is database. Like WAL simplifies single node consistency, distribute system can build atop a shared logging service. Examples are FoundationDB, Helios Indexing. The idea can be expanded to build system atop any shared storage service that provides strong consistency and act as a single node, e.g. a distributed filesystem, a page store. Examples are AWS Aurora Multi-master, Azure Storage. The idea also extends to propagate changes in a synchronous or eventually consistent way, which naturally works with database WAL. Examples are Helios Indexing, MySQL BinLog Replication.

Above techniques build strong consistency. For weaker consistency

  • Eventual consistency. Typically if a system doesn’t do anything about consistency, and let changes propagate, it’s eventual consistency. Better implementation provides versioning to measure propagation, and guarantees deadline for propagation.

  • Causal consistency. Same with Metadata section’s. It’s compatible with Eventual consistency, and a client must see what it already sees. For implementation, client tracks the low watermark version it wants server to return.

  • Custom consistency level. The example is RAMP-TAO, which checks local result set satisfies “read atomicity”, and fetch missing versions from RefillLibrary. In general, wide spectrum of custom consistency model can be implemented by tracking versions with data, checking consistency constraints on the fly, and buffer necessary lookups in a cache.

  • Compensation Transaction. See this Compensation Transaction article. It unites multiple systems to build (a pseudo) ACID transaction. Each system internally supports ACID transaction and idempotent operation. The client drives transaction, propagating changes across the multiple systems in an eventual consistency way in single direction, with at-least-once semantics and a clear completion time. If one system fails in middle, which breaks atomicity, client rollbacks by replaying “Compensation Transaction” at each system in reverse order. Hiding all complexity, the client exposes a seemingly ACID transaction interface. The techniques is handy to build booking service in large scale Internet companies. Additionally, a “reservation” step can be added to makes a system less easy to fail, which renders it more like 2PC (except other client can read middle states).

Geo-regional level consistency

When coming to cross-regional multi-datacenter level, the techniques are similar with single datacenter level. But the scale makes strong consistency or synchronous replication hard due to the latency overhead. Most implementations are eventually consistent, where disaster recovery area defines measure concepts

  • RTO (Recovery Time Objective). How long the system and application needs to recover at the second region, after a disaster happened at the first region. RTO can be long if the system startup, cache warm, DNS hand-off take time.

  • RPO (Recovery Point Objective). Because cross-region replication is async, there is a delay from replicated data to the latest data. RPO defines the delay window. It maps how much recent data will be lost after recovery in the second region.

Besides those duplicate with Datacenter level, common techniques are below. In compare, more optimization are for unstable links and low bandwidth in WAN (wide-area network).

  • Geo-replication. Databases commonly support async replication (eventual consistency) used for backup cross regions, typically by replicating logs, e.g. MySQL BinLog Replication, and Redis replication primary/secondary via command stream. Async Geo-replication doesn’t exclude sync replicate a small piece of critical metadata; and doesn’t exclude a client to query the primary region to determine the latest version.

  • Incremental diff. Ceph provides RBD diff that exports incremental snapshots, which can be used for geo-replication in a semi-automate way.

  • Log is database. Most is already summarized above. Use logs to replicate changes in eventual consistency way. Examples are Helios Indexing, MySQL BinLog Replication.

Above are eventual consistency replications. For strong consistency geo-replication, typically Paxos replication is employed (and optimized), while clock syncing for serializable transaction becomes a bigger problem.

  • Megastore Paxos. Google Megastore synchronously replicates across WAN using optimized Paxos protocol. Compared to primary/secondary replication, any Paxos replica at a nearby or less utilized datacenter can lead transaction to balance load. Write only needs > N/2 replicas to ack, which reduces cross-datacenter latency. Local datacenter reads are favored, by using a Coordinator to track which replica has latest versions. Writes use Leader Paxos, where leader selection favors nearby replica. Witness replicas, which vote and replicate logs but won’t execute logs to serve DB data, are introduced to form a quorum when too few participants. Read-only replicas, which don’t vote but replay logs to serve DB data, are introduces to serve snapshot reads. Per implementation, Paxos cross-datacenters is essentially replicating logs, similarly “Log is database”.

  • Spanner & TrueTime. Like Megastore, Google Spanner stores replicas in different geo-regions, and employ Paxos replication. Distributed transaction is implemented by 2PC, whose liveness is guaranteed by HA of a participant’s Paxos replicas. The special part is TrueTime, used to synchronize clocks across datacenters, thus to implement External Consistency via Commit Wait (Distributed Transactions article). TrueTime relies on customized hardware, i.e. GPS receivers and atomic clocks, as time master nodes in each datacenter, to guarantee less than 7 ms clock drifts globally.

  • CockroachDB & Hybrid-Logical clock (HLC). Like Spanner, CockroachDB employs Paxos (Raft) data replication across regions, and 2PC for distributed transaction. Reads favor nearby replicas, and writes can choose nearby replicas in same region first and leave others in async. Different from Spanner TrueTime, CockroachDB uses HLC for cross-datacenter clocks. HLC provides causality tracking at its logical components, and monotonic increasing epochs at its physical component, and employs NTP as the software-only clock syncing protocol.

Write path

The next big component in a distributed storage system is write path, following which you characterize how a system works. Append-only or update in-place fundamentally divides system styles and next level techniques. Write path touches almost every other components in a system, e.g. metadata, index, data organization, logging, replication, and many system properties, e.g. consistency, durability, amplification. Read path can be seen as a reflection of write path, plus caching to optimize performance.

Write path section

Append-only vs update in-place

The first driving dimension is append-only vs update in-place. Traditional single node filesystems usually update disk data in-place (except BtrFS). Later the quick adoption of LSM-tree leads the predominance of append-only systems, also known as log-structured systems. Not only HDD which benefits from sequential writes, SSD also favors append-only (e.g. RocksDB) due to internal FTL & GC. More, PMEM filesystems e.g. NOVA adopts append-only with per-inode logging; and in-memory systems e.g. Bw-tree adopts append-only with delta pages.

  • Update in-place. Examples are EXT4, Ceph. If a piece of data is to be updated, it’s overwritten on the same address on the disk, rather than written to a new address. Compared to append-only, address tracking is simpler, without needing extra memory metadata to track new addresses; and without extra costly GC to reclaim old data. The drawbacks are: 1) the underlying HDD doesn’t like random writes. 2) With a fixed block-size, storing compressed results are tricky. 3) Double-write problem, where overwrites need transaction to protect against crash, thus new data gets an extra write in journaling.

    • Content-based addressing. The example is XtremIO. Each piece of data has a fixed on-disk location (think about placement by data hash, but at disk block level). When the data block location is determined by data content hash, it can be used to auto dedup. Since data block location has zero degree of freedom, such system needs minimal metadata to track data location, and cannot be implemented by append-only.

    • Set-associative cache. The example is Kangaroo and Flashcache. Entire SSD is used to map a large HDD space, just like how CPU cache maps memory. An HDD data block can be stored on SSD, selecting from a small set of blocks. The set is determined by hashing, within which a block is found with linear probing. Similarly, by limiting the data location degree of freedom, minimal memory metadata is needed.

    • Database paging. A cleaner way to update in-place is to divide address space into pages, and use page as the atomic unit of transfer. The “page” here is like storage “blocks”. However, the system additionally needs transaction logging to guarantee crash consistency. More, even only a few bytes updated, an entire page has to be switched, i.e. write amplification. A page can have internal fragmentation that margin bytes cannot be utilized, i.e. space amplification. If page doesn’t need to be equal-sized, it becomes “chunks”, or “micro-partitions”.

  • Append-only. Examples are LSM-tree or RocksDB, Log is database, Azure Storage. The systems don’t support modifying written data on-disk, thus updates need to append to new places, like a log. The main drawbacks of such systems are: 1) Constant GC (or compaction) is needed to reclaim old data, which can eat up even 50% of system bandwidth. 2) Data location has high degree of freedom, thus the system either needs huge memory metadata to lookup, or incurs read amplification when scanning through stale data. The benefits are: 1) Everything is simplified because written data is immutable. 2) Writes are sequential which HDD favors. 3) Transaction and crash consistency is built-in, because data is log. Over the years, after all, append-only proves successful.

    • Sequential structure or not. The example is BtrFS. Not all append-only follows a sequential logging. In BtrFS, new data is copy-on-write to a new page, and then atomically linked to the B+-tree. Besides, optimization like parallel multi-segment logging also breaks the default one sequential logging.

    • Cleanup inline or offline. Append-only needs to cleanup stale data; should it be done on the write path, or offline? GC/compaction chooses offline. Apache Hudi copy-on-write chooses inline of the write path. Besides, the cleanup can even be delayed to the first user read, i.e. Apache Hudi merge-on-read.

    • Delta data. The idea of append-only can be expanded to indexing, on PMEM (e.g. NOVA) or in-memory (e.g. Bw-tree). They exploit that appending delta data benefits high concurrency, simplifies lock handling, and avoids amplification like COW. In another perspective, immutable data can either be implemented by COW or appending delta, while COW forces compaction on write path.

    • Log is database. We mentioned it before already. Compared to database paging which incurs random writes, transferring logs across components writes sequentially. Syncing pages incurs write amplification if only partial page is modified, but repeated modification on same address can be absorbed; while logging carries delta, smaller than whole page, but can grow to a long list for repeated modification, thus need compaction. Though log can easily be used as a consistent truth for database state, replaying to the latest data incurs computation cost, and needs careful version aligning to leverage cached pages.

  • Hybrid approach. The example is Ceph BlueStore, where big writes are append-only, small writes overlapping no existing data is in-place, and small overwrites are merged to RocksDB WAL. This approach was invented to overcome Ceph double-write problem. It essentially bridges the old in-place update to append-only.

Thinking in higher level, the driving factor behind append-only vs update in-place is whether to delay maintaining on-disk data organization, to do it inline or offline, or a write-optimized data format vs a read-optimized data format.

  • Write path is efficient if it doesn’t need to maintain on-disk data organization (see Data organization section). Writes favor batching, sequential. This is what append-only brings, except extra bandwidth spent for GC/compaction. Besides, writes favor less co-update components (in sync), e.g. fewer indexes, caching, less fragmented write locations.

  • Read path is efficient either if data has an index, or the location can be derived from the key, or well-sorted to favor full scan. Data should be less fragmented, preserve locality, and with fewer stale entries. Though append-only generates fragmented deltas, GC/compaction can rewrite them to optimized read formats. Though update in-place saves GC/compaction traffic, more read-optimized formats may still need extra rewrites.

    • Data index is usually needed for efficient read path. Update in-place reduces index size by limiting data location degree of freedom, though not applicable to secondary indexes; and by preserving (a bigger) tracking granularity, i.e. unlike append-only which redirects small updates to a new page. This also means less ripple updates to index.
  • On-disk data organization. The best read-optimized data format almost always require a full rewrite to generate, which explains why append-only is favorable, especially considering columnar compression (i.e. OLAP). More recent data, which can be separated by hot/cold tiering (or like the “levels” in LSM-tree), may still benefit from update in-place to reduce GC/compaction or churn to index (though in fact most also use append-only).

Co-updating neighbor components

Besides on-disk data, write path touches a wide range of components to co-update together, e.g. metadata, index, checkpoint, logging, cache.

  • Metadata, index. The main concern here is the propagation of visibility from disk data change to end user. This is mentioned before in Consistency section.

  • Checkpoint, logging. New changes are first made atomically durable by WAL, where a typical technique is separating key/value (WiscKey). Durable changes can then be propagated to index and metadata to be made visible to user. Logging is a write-optimized format, while reads need structured data. The “structured data” is either periodically flushed from memory to disk, i.e. checkpointing, or by transferring database pages. Fragmented, overlapping checkpoints further need GC/compaction to rewrite to more read-optimized format (e.g. LSM-tree), and to reclaim deleted storage space.

  • Cache updates are async, usually be offline from write path; unless the write wants to invalidate stale data or immediately load new data.

Besides writing locally, data replication is also interleaved in write path. It achieves durability and many other purposes

  • Durability, e.g. Raft replication, 3-way replication, quorum writes, see Consistency section. Durability replication is usually synchronous with strong consistency.

  • Disaster-recovery, e.g. backup, geo-replication, see Consistency section. They can async with an agreement on RPO.

  • Locality, e.g. geo-replication which moves data to user’s local region, e.g. Akkio u-shards; and CDN that acts as static content cache and bridges across WAN provider.

  • Data layout. Examples are TiFlash and F1 Lightning. The databases maintain main data copy as row format to serve OLTP, which replicate an extra columnar layout copy for OLAP use. Raft protocol or fine-grained version tracking can be used to maintain consistency between replicas.

  • Hot/cold tiering. Hot data can be copied cache. Cold data can be offloaded to slow HDD or archival storage. Data formats between tiers can also be different, to favor access latency, storage efficiency, or compression.

  • Data balance. Typically, data can be re-balanced to occupy empty new nodes, to spread out placement from correlated failure domains, or to balance hot/cold access on nodes.

  • Log is database. Instead of replicating data or pages, logs which carry delta are replicated and propagated as the source of truth. See Consistency section.

  • Separating write path and read path. The example is AnalyticDB, MySQL primary/secondaries replication. The design originates from database community that uses one server as write primary, and replicates to multiple replicas to scale reads. It exploits the pattern that social network generates content (writes) in a relatively constant rate, but user views (reads) can burst high.

Offline background jobs touching data can also be divided by purpose. They usually rewrite data copies, which is the main source of write amplification, but necessary to reduce read amplification by generating a more optimized data layout.

  • Durability. Typically the data repair process, which comes when nodes or disks went bad. These background jobs require low detection time, and high priority bandwidth. Data repair efficiency can be improved by involving more nodes to provide source data, e.g. Ceph which involves full cluster, Copyset which involves a partition of cluster, and primary/secondary replication however which only involves a few secondaries.

  • Storage efficiency. Data compression can be run off the write path to avoid increasing user seen latency. Erasure coding can then further reduce storage space needed. GC runs periodically to reclaim deleted storage space.

  • Data layout. E.g. RocksDB runs offline compaction, which removes stale data, sort out overlapping SST files, to make reads more efficient. E.g. AnalyticDB buffers new writes in incremental files, and then merge them to baseline data and build full index. Similar patterns of delta merging can also be found in Datalakes, e.g. Apache Hudi. W.r.t. data replication, the destination copy can be placed in another region or even another cloud service, while the computation can also be offloaded to cloud.

  • Data integrity. Storage systems typically employ offline data scrubbing to detect silent data corruption. End-to-end CRC can be stored along with data. Besides, invariants with different layers can be checked, e.g. index vs data, mapping constraints.

Write to different storage media

Write data flows through or eventually persists at one of the storage media: memory, PMEM, SSD, HDD, or archival tapes. Data structures and techniques vary according to the characteristics of storage media, and the workload access patterns. We will see more in Data indexing section and Data organization section.

  • Memory tier does well with random access and provides the lowest latency compared to other storage tiers. The major concern is to improve concurrency, cache efficiency, and to pack more data in memory. Typical data structures can be plain pointer links (e.g. FaRM), skiplists (e.g. RocksDB) and Bw-tree which favor concurrency, B+-tree whose bigger node benefits cache line than red-back tree, and hashtables for quick lookup (e.g. Memcached). Memory compression and disk SWAP can be enabled (e.g. TMO).

  • PMEM tier is 2x~3x slower than DRAM (PMEM empirical guide), and doesn’t like small random writes. The major concern is to improve concurrency, compensate with slow CPU, and maintain crash consistency while avoiding expensive cache flush instructions. RDMA and Kernel bypassing are common techniques. Tree-based append-only data structures, e.g. per inode logging in NOVA, are still favorable. Another approach uses hashtable data structure, e.g. Level Hashing.

  • SSD tier. Except a few systems update in-place, most systems shift to append-only, e.g. RocksDB, and TiDB/Cockroach/MySQL which use RocksDB as engine, HBase/ClickHouse which employs LSM-tree (like) engine, or FoundationDB / Azure Storage which build atop shared logging. I.e. SST files and central logging are the common data structures on SSD. OLAP databases also favor append-only in batch and rewrite to compressed columnar layout. Some databases choose to build index for every column, while some others solely rely on full scan.

  • HDD tier. Since both favor append-only, the data structure are similar on HDD or SSD, where most systems can interchangeably run on both. The difference is SSD one needs more CPU and parallelism allocated per device.

  • Archival tapes tier. Append-only is also the favored write style, e.g. Data Domain, thus no much diff from HDD or SSD ones. The data is usually deduplicated and appended in sequential structure, and relying on an index to lookup. Dedup fingerprints can be stored with data that preserves locality. Higher compression level and longer erasure coding codecs are used.

  • Computation tier. The above tiers sort by data size. Computation tier is special that, in certain cases there is no data needs to store, and all can be derived from calculation. In another word, “store” data in calculation.

Tiering between different storage media

In general, storage media tiers are chosen according to the price, scale, and performance targets of data. Each tier has their own optimization techniques. Data movement across tiers yet needs efficient temperature detection/prediction algorithms, which are usually LRU variants but more concerned in reducing tracking metadata size against the large data scale:

  • Exponential smoothing. This is the standard academy method that averages now and history hotness with a weight, where older history is exponentially forgotten. The method doesn’t mention how to implement it efficiently. Hotness can be measured by data access IOs and bytes in a time window.

  • LRU (least recent used). Like exponential smoothing, LRU is the typical method that stems most temperature tiering algorithms, but doesn’t specify how to implement.

  • Bits per object. The example is Kangaroo RRIParoo algorithm. Temperature is tracked by per object bits. A bit can be flipped when the object is accessed, or global eviction is needed (e.g. clock tick, cache full). If all bits match, the object can be evicted.

  • Objects in list. Examples are linked-list implemented LRU, or Linux Kernel memory page swap. Temperature is tracked by object position in list. Objects are prompted to head when accessed, pushed to tail when cold, and evicted beyond tail.

  • Last accessed and expire. Usually seen when App is operating cache aside. Simply, the last accessed item from DB is put into cache. The oldest item is evicted if the cache is full. Cache items also expire by a timeout.

  • Offline classification. Examples are Hekaton Siberia, Google G-SWAP. When temperature tracking metadata is too large, the system can dump traffic records (may be sampled) to disk, and employs an offline periodical classification job or Machine Learning to categorize hot/cold data.

  • User tagging. Expose interface for end users to explicitly tag whether a piece of data is hot or cold. Simple, but users always know better.

Write & read paths coalescing

Though write/write and read/read coalescing are common techniques, write/read and read/write have interesting ways to combine and reuse each other’s middle results.

  • Writes coalescing. Small writes can be combined into one big write to disk. The system can use a timeout plug to accumulate enough small writes to combine, or just scan and sort through the write queue. Small writes in neighbor addresses can be combined to favor big sequential writes. Repeated writes to the same address can be canceled out and leave the last write to disk. A fast staging memory, flash, or PMEM can be kept to absorb small writes.

  • Reads coalescing. Like writes, small reads can be combined to favor sequential disk accesses, or save repeated reads with caching. A read query typically needs to scan more physical data than what user requested, which means multiple queries can be batched and share one single disk scan.

  • Read as a path for write. When a read query is scanning data to lookup something, it’s conceptually building the index. The read can leverage the scan and push the index parts to write path, while write path is responsible to build the index. E.g. REMIX LSM-tree leverages range query to build the index for SST files. Write path is also responsible to rewrite newly ingested data into read optimized formats. It can reuse what the read query just scanned and loaded into memory. It’s more useful when the loading is expensive and involves remote network.

  • Write as a path for read. Newly written data is usually more likely to be read. Writes can leave them in memory or in staging area, directly populate cache, and organize them in read-optimized data structures. The following reads can be benefited. E.g. The memtable in a typical LSM-tree.


Inline or offline from write path, FPGA and ASIC are commonly used in offloading from CPU, e.g. compression/encryption, and multi-tenant cloud virtual network processing. Offloading relieves CPU from growing IO hardware throughput, while pushdown shortens data transfer path.

  • FPGA features in reconfiguration, which favors flexibility and early experiments. ASIC are dedicated circuits, hard to change, but once shipped they are much more efficient than FPGA. FPGA had successful usecases like Project Catapult. SmartNICs also went popular.

  • Compression/encryption are typical offload usecases because the logic is fixed, few exception handling, and data pipeline oriented. Network processing is similar. Besides, nowadays high speed RDMA is much more demanding for CPU, and cloud virtual networking involves more layers of redirections.

  • The more recent IPU (Infrastructure Processing Unit) was proposed following DPU, to offload common datacenter infrastructure functionalities to processing chips other than CPU.

  • Smart SSD adds computation chips to SSD. Query filtering or GC/Compaction can be pushed down to SSD internal, without involving the longer data transfer path across PCIe.

Alibaba Pangu 2.0 Filesystem is a good example of offloading. Pangu is the storage backend of Alibaba Cloud Elastic Block Storage (EBS), Object Storage Service (OSS), Network-Attached Storage (NAS), PolarDB, and MaxCompute. The architecture is similar to Azure Storage and GFS. Modern cloud infrastructure storage systems are moving to new directions:

  • Storage engine moves to userspace, and filesystem moves to userspace, to match speed with RDMA and NVMe SSD. Kernel is bypassed. Leverage DPDK and SPDK.

    • Besides, Filesystem becomes specialized, i.e. customized for the storage. Storage wants to access the raw interface of drives, e.g. ZNS SSD, and SMR.

    • Besides, not only Kernel Bypassing. RDMA bypasses remote host CPU to access DRAM. NVMe-over-Fabric bypasses remote host CPU to access SSD.

    • The eventual goal is to offload the entire data path.

  • DPU becomes widely used: 1) Network offloading. 2) Storage offloading. 3) RDMA Smart NIC

    • A typical DPU is composed of FPGA + CPU. The CPU can be an ARM core, or a processor on FPGA. FPGA can also be replaced with ASIC. DPU also equips with extended chips like VPU (Vector Processor Unit), local DRAM, fast DPU inter-connect, fast DPU-GPU inter-connect.

    • The concept of FastPath: A (packet) processing only involves the FPGA/ASIC part of DU (fast pass through). It doesn’t involve the CPU or memory on DPU. It doesn’t involve the host. In the second level, The (packet) processing bypasses middle nodes in the distributed system. Client directly reads/writes from the end2end data node. Besides, FastPath usually requires RDMA and NVMe-oF which bypasses the destination host CPU to access DRAM or SSD.

    • Other typical hardware offloading usages: 1) Offload compression/Encryption and CRC calculation. They can be offloaded to DPU, or RDMA smart NIC. 2) Smart SSD. Databases can pushdown filters or GC to on-SSD chip, which avoids PCIe bottleneck between SSD and CPU.

  • Increasing adoption of zoned devices to reduce the manufacture cost, i.e. SMR drives and ZNS SSD. They expose similar Zoned Block Device (ZBD) interface.

    • SMR drives allows higher HDD areal density. Writes subject to zone-level append-only, and open zone limit. Storage engine or filesystem need to be customized.

    • ZNS SSD is essentially an SSD with a very simplified FTL. Cost is lower. And, it allows storage engine to highly customize GC and data placement.

  • DPU is frequently mentioned with FastPath and all sorts of bypassing:

    • Bypassing DPU’s on-chip CPU and memory, Local host’s CPU and memory, Load balancer (client cached routing info), Remote host’s CPU to access DRAM (RDMA), Remote host’s CPU to access SSD (NVMe-oF).

    • Other types of bypassing are Kernel bypassing (userspace filesystem), Filesystem bypassing (userspace storage engine), Intel DDIO and RDCA (Remote Direct Cache Access).

  • Related materials for Pangu 2.0. Majorly published on FAST23. Alibaba Pangu 2.0, Pangu FS client: Fisc, Pangu RDMA, Pangu Networking: Solar, Pangu DPU: X-Dragon, Pangu SMR.

Data organization

Traditionally “data organization” talks about physical columnar/row-wise data layouts in databases. I choose to view data organization from a broader perspective which is divided by purposes.

  • Durability tier. The basic need to organize data in a storage system is to make it durable. Replication is common, on the cost of storage efficiency, yet vulnerable as corruption can be simultaneously replicated. Replication also couples with performance tier to balance reads with extra replicas. Erasure Coding (EC) reduces storage space, improves durability, on the cost of data reconstruct. Consistency is a common issue that accompanies replication. End-to-end CRC and periodical scrubbing are necessary to protect against corruption happens on write path, data transformation, or silent data at rest. Backup, geo-replication are standard setups for disaster recovery, while time travel is handy to recover manual errors by restoring an early version.

  • Query tier. Disk data needs to support reads and update. Common accesses are sequential/random reads, appends, updates (or read-modify-write) talked in storage systems, and point/range queries, scans, inserts, updates (or update-after-query) talked in databases. Traditionally, disk data serves both durability tier and query tier coupled, which incurs cost in write path to maintain read-optimized format. Separating read path and write path can help, or move read path entirely to performance tier, e.g. in-memory database. Query tier can further specialize for OLTP, OLAP and Datalake that share main techniques but vary at query patterns, consistency, data scale, and structured data.

  • Performance tier. Commonly they are extra data copies to balance reads, an SSD tier for caching (or also serves part of durability), PMEM staging area to absorb and sequentialize repeated random writes, plain memory caching, or in-memory DB that moves all computation to memory. When used as cache, SSD or memory can target small blocks rather than entire chunks from disk, see Data caching section. Data organized in memory is more attached to indexes, unlike on disk, see Data indexing section.

  • Scaleout tier. To cope with increasing volume and high throughput targets, data is partitioned, replicated, and work with placement to serve from more machines. Resource scheduling for heterogeneous job sizes, load balancing, and migration follow the needs. See Data partitioning section. Consistency is always a problem. On single node it easily relies on CPU cache coherence, but scale-up is bottlenecked by CPU power/heat and cache coherence cost between too many cores. Coming to distributed systems, consistency of distributed transaction still incurs high networking cost, unless relaxing it with App level agreement.

Essentially, query tier carries the most DB techniques when it wants to be performant, while durability/scaleout tiers are orthogonal from it and can be offloaded to a shared storage system, and performance tier is usually addressed by caching. We focus on query tier for data organization, and discuss performance/scaleout tiers in other sections.

Data organization section

Durability tier

We covered replication in Consistency section. We will see more about CRC and scrubbing in Data integrity section. Below we briefly expand the design space for Erasure Coding (EC).

  • Storage overhead. The main goal of EC is to store data with comparable durability but less storage space, compared to plain replication.

  • Durability. Data must be recoverable, if a set of disks went bad. Data must be available (with reconstruct) to user reads, if a set of nodes went offline.

  • Performance. Compared to plain replication, reading (reconstruct) EC data incurs significant cost when part of data is offline, especially the tail latency. With less storage copies, total aggregated bandwidth to serve is capped.

EC codecs have great richness in schema variety, especially combined with cluster layouts and user traffic patterns. Briefly, main schemas come from below classes

  • Reed-Solomon Codes. The standard textbook schemas where each data symbol is symmetrically involved in each parity symbol. The code is MDS, which means able to recover the most loss patterns given fixed storage overhead.

  • Local-reconstruct Codes (LRC). To reduce bandwidth needed in reconstruct reads or data repair, part of parity symbols choose to involve less data symbols. In another word, the schema improves performance in the cost of recoverable loss patterns.

  • Regenerating Codes. Another approach to reduce bandwidth in reconstruction. MSR (Minimum Storage Regenerating) codes reach low bandwidth bound without penalty on storage overhead. The code construction is usually more complex and involves more computation.

Data layout for query tier

In high level, we first capture the desired goals of a data layout. Ideally we want every goal to reach optimal, which by far is impossible. Trading off between goals composes the design space.

  • Read amplification. To return the desired value, how many extra data reads needed in count and size? Locating the data starts from index lookup. Without fine-grain index, reads need to scan through the entire chunk. More chunks containing stale data are involved, if chunks host overlapping ranges. Ideally, if any data can be located accurately, scan is not even needed. Read amplification is possible to be amortized by batching queries, at a cost of latency.

  • Write amplification. To write a piece of data, how many extra writes needed in count and size? In-place update in an array can kick off ripple data movements if the slot is too small. Append-only systems trigger background writes for GC/compaction. Background jobs also do rewrites. Write amplification is possible to be pushed off to offline from write path, at a cost of space amplification. Deletes can be treated as a special type of writes. Note extra reads can also accompany write amplification in data movement.

  • Space amplification. Compared to only user data, how much extra storage space is spent at query tier? This includes unclaimed stale (or deleted) values, empty slots pre-allocated for inserts, internal fragmentation inside pages/chunks, external fragmentation that skips allocation. Space amplification can naively be reduced by GC/compact more frequently, at a cost of read/write amplification. Storage space goals are critical to Cloud Storage COGS which sells by capacity.

  • Sequentialize reads. HDD favors sequential reads. We want the next read hits a previous prefetch, multiple reads to batch in one bigger, and range queries to map to sequential on-disk scans. We want data locality to be preserved for access pattern.

  • Sequentialize writes. HDD/SSD favor sequential writes. Append-only systems sequentialize all writes. In-place update systems are harder, but possible with pre-allocated empty slots, or filesystem extents.

  • Compression. Compression is critical to storage efficiency at query tier. It also reduces amplification by transferring less data in reads/writes. Compression needs to work with encryption, where CBC (chained block cipher) can randomize identical blocks. Packing similar data together makes compression more efficient (e.g. columnar layout). Transfer overhead can be reduced by directly querying and passing compressed blocks (late materialization). Queries become even more efficient with SIMD vectorization and JIT compile.

  • Index lookup. An ideal data layout should be easy for index lookup to serve reads or find write locations. Index structure and traversal can embed into data units, or data clustered into index. Given limited index size and granularity, data chunks can have a second level min-max sketching, zone maps, or bloomfilters. Data can also be compressed, support range query, without a separated index; see Succinct data structures.

Next, we define the data unit, e.g. how big a block is, chunks, files. We need to think properties are enforced at which level of data unit, indexing happens at which granularity, data placement & migration unit size, etc. Listing data units from small to big:

  • Individual key-value, or only value if the key can be derived, e.g. incremental row id in columnar layout. This is the smallest data unit. Usually indexing every key-value at this layer is prohibitively expensive.

  • Row group. A file can have multiple row groups. It may still be too small for indexing, but carries itself min-max sketching and aggregation statistics. The example is Parquet, or AnalyticDB row-column layout. A row group contains all columns for a set of rows, while inside the row group data is organized columnar.

  • Chunk. I use “Chunk” to universally donate the smallest data unit to index. An example is the “SST file” in RocksDB, where reads first locates a chunk (think of a “shabby” index here) and then full scan it (can be optimized by per row group sketching). Another example is the “page” in B+-tree index (such systems usually don’t have row groups), where we need to consider record layout inside. The next example is the “block” in filesystems, which is indexed by inode trees; or “extents”, where allocator assigns a larger space than asked, to append future writes.

  • Partition. The smallest unit to choose which server to host. It’s where data starts to participant in a distributed system, and as the unit for placement, replication and migration.

  • Data unit for classification. Storage systems need to decide a data unit as the level for tracking and classification. Classification is a common problem in storage systems for efficient GC/compaction, temperature tiering, and various background jobs. Machine Learning can but not much used mainly due to the metadata size and computation cost for vast tracking units. The unit of classification can either be bigger or much smaller than a partition, given the tracking cost willing to pay.

    • Generation. I.e. the “level” in LSM-tree or RocksDB. It marks how many GC/compaction rounds the data has went through. It classifies how likely the data won’t be deleted/overwritten in future. LSM-tree couples more properties with generation, e.g. chunk size, sort runs, compaction strategies; which is a design choice but not a necessity.

    • Temperature tiering. A tag with statistics to classify how likely the data will be accessed in future with certain traffic. Efficient ways to offload cold data to cheaper storage media is critical for storage space efficiency, while quick response on sudden user reads yet needs the (asymmetric) data unit of transfer. Separating GC/compaction strategies between cold/hot is also beneficial.

    • Workload streams. The storage system is serving mixed user workloads. “Stream” here means to separate out the data operations from a single workload (e.g. single App, single content, single user). The concept came from NVMe protocol, and an example is FStream. Practically, “stream” groups similar data together to yield better compression, dedup, to share close lifecycle in GC/compaction, and temperature.

In the next level, we abstract the properties of a data layout. They constraint the physical data organization intra/inter data units, to which writes pay to maintain, and reads benefit from to speedup. Below lists properties from small to big data units. They map properties to high level goals. Applying properties and trade off between them compose the design space for various techniques.

  • At key-value level, common techniques are to separate keys and values (WiscKey). Most compaction happen at keys, thus saved rewrite amplification on values (which are big). Another technique is to dedup common key prefixes which saves storage space. Examples are “column family” in HBase, trie trees, and Masstree.

  • At row group level, a notable property is whether data is stored in columnar or row format. OLTP database favors row format, where data is organized as rows, and row piles in a page. OLAP database favors columnar format, where data is organized as columns, values from one column is stored consecutively in a row group, and then to the next column.

    • Columnar format. Since column packs similar data, compression is more efficient thus reduces storage space, and less read data when scan. Common OLTP workloads can hardly generate columnar format on start, thus need to pay write amplification for batch and rewrite. Querying one column and then lookup another column in one row, however incurs extra IOs and non-sequential reads, because columns are stored at different locations. Common columnar format examples are Parquet, Apache ORC.

    • Row format. Scans involve unnecessary columns, i.e. a read amplification. Compression are less efficient compared to columnar format, and also cost read transfers. But updating/inserting can directly operate in unit of rows. Looking up all columns in one row costs only one read.

Continue with data layout properties. At chunk level, many properties are covered such as whether data is sorted (or partially sorted), overlapping between chunks, cross chunk linking, allowing in-place updates. They further couple with intra chunk or inter chunk. Much of LSM-tree compaction optimization is talking about this level.

  • Sorted, intra chunk. Examples are RocksDB SST files, or column values in columnar format, which stores records sorted. Sorting favors lookups to locate the record, allows sequential reads in range queries, ease building external index or embedding index inside file. However, since user writes in any order, sorted data cannot be obtained from start, unless either buffer in memory, or pay write amplification for rewrite. Besides, sorted data enables more efficient compression algorithm, e.g. Run-length Encoding (RLE).

    • In-place updates. Maintaining both intra chunk sorted and in-place update is hard. Giving up internal sorting, sort property can be pushed off to inter chunk level, so that read amplification is still capped, and index at chunk granularity can still be built. To absorb inserts, a chunk can pay storage space to pre-allocate empty slots, or pay extra writes to move records elsewhere.

    • Index sort order vs data sort order. Databases records can appear to be sorted by index (e.g. traverse B+-tree in order), but random on-disk. Though a range query saves lookup by leveraging index sort order, on-disk scan still suffers from random reads. To align on-disk data in sort order, it can pay amplification for a rewrite. Or, let index leaf level have larger chunks for sequential reads inside, and then jump to the next chunk. However, secondary indexes can hardly achieve data sort order on secondary keys, while this can be compensated by Z-Order at a cost of read amplification.

  • Sorted, inter chunks. The example is “tiering” vs “leveling” in RocksDB. “Leveling” requires chunks are non-overlapping, i.e. a sorted run, or chunks have a total sort order. It favors read to quickly locate only one chunk that needs scan. However, maintaining the inter chunk sort property requires more eagerly paying writes in compaction. In “tiering”, chunks can have overlapping key ranges. Breaking “sorted” property relaxes writes, but read may need to scan multiple chunks.

    • Overlapping. Can chunks have overlapping key ranges? This is another way to say whether inter chunk sort property is enforced.

    • Partially sorted, inter chunks. The example is “Guards” in PebblesDB. A guard contains multiple chunks which can overlap, but cross guards there is no overlapping. It creates a tunable balance between read/write amplification.

    • Key assignment to chunks matters when maintaining the sort/overlapping property inter chunks. By partitioning keys into non-overlapping ranges (or hashing) and assigning to different chunks, it ensures chunks non-overlapping. You can see data partitioning is not only for scaleout, but also a method to separate conflict spaces that eases algorithm handling. Besides, it also separates addressing space, which reduces metadata size, as you see in Metadata section.

    • Fixed/variable sized blocks. For example, chunks of SST files are variable sized, database pages are fixed size, and storage system may either use fixed sized or variable sized blocks. Fixed size blocks are commonly seen in traditional filesystems, updated in-place, where size variety is however handled by allocator (which can be tricky to be robust). Internal fragmentation can waste space inside blocks. Variable sized blocks favor append-only systems and compression which outputs unpredictable size. Index metadata has to be larger, as no fixed size of tracking units. In balance of the two: 1) The system can take a minimal write size e.g. 4KB, so index metadata size is reduced even for variable sized blocks. 2) Allocate by a large “extent” rather than individual blocks, so that inside the extent it can append fixed size blocks and reduce external fragmentation.

    • Compensation by index. Having an index can ease chunk maintenance. If the chunk means B+-tree index pages, it needs to maintain both non-overlapping and fixed size block property. This is done by key assignment to chunks guarded by the index itself. With overlapping chunks, instead of scanning all matched ones, a global tree/hash index can tell whether certain key exists in a chunk, so as bloomfilters (which is commonly used). In a word, index compensates read amplification.

  • Cross chunk linking. The example is Forwarding pointers in LSM-tree that level L-1 chunks can embed pointers to level L chunks. When a read scanned level L-1 but didn’t find matching records, it can follow forwarding pointers to level L, which saves scanning startover. E.g. REMIX LSM-tree. Essentially, the method is to embed an index at inter chunk level. Note we also mentioned indexing at chunk internal, or a separate external index. Conceptually, index leverages connections between data to build, this right happens when chunks have overlapping key ranges across LSM-tree levels.

  • Data locality. Data to be accessed together should be located physically close, so that a read can fetch all. This can happen at node/partition level to save cross networking, within same chunk/block to be cached as one unit, or aligned in neighbor records to benefit prefetch. An example is graph databases, where edges and vertices are accessed one by one in traversal order.

    • Data clustering & data skipping as we can also call it. Data clustering means data frequently accessed together should be physically packed together, so they can be prefetched or returned in one sequential scan. It becomes tricky when trying to pack different DB table fields. An example is Z-Order. Data skipping is an opposite of data clustering that, it tries to skip as much unnecessary data during disk scan. It leverages the embedded sketching filters or indexes, and avoid clustering too much unrelated data.

Other data layout properties at data unit levels of partition and classification:

  • Partition level. It maps to chunk level to serve individual queries. There are few other “larger scale” properties

    • Replication and placement affects how queries are served at distributed system level, but more proper to discuss at Data partitioning section. Colocation places data used together on same node to benefit prefetching and saves networking coordination cost.

    • Interoperability. Datalake, e.g. Delta Lake, uses open formats (Parquet, Apache ORC, JSON, CSV) for both its internal data, metadata, and transaction logs. This allows any other app to interoperate, and allows launching a new server anywhere else at cloud to resume processing.

  • Classification level. It maps to individual or a group of similar chunks as the tracking unit. The grouping can either be physical to locate chunks together, or logical to track similar chunks with metadata.

We can summarize data layout properties by exploring two extremes, a write-optimized layout and a read-optimized layout. We can tune properties to watch the transition between the two.

  • Write-optimized layout. Newly updated/inserted data are sequentially appended to the end of log without any special handling. Write path has the lowest cost.

  • Read-optimized layout. Chunks are fully indexed at key-value granularity. Chunks are internally sorted, non-overlapping, and large enough to avoid fragmented IOs from a range query. Columnar layout if not too many cross-column lookups. Fields frequently accessed together are packed close.

  • Transition from write-optimized to read-optimized layout. Applying various properties, we can observe three trends: 1) introduce sort order, 2) reduce tracking granularity, 3) group similar data together

    • Introduce sort order. A query needs to exploit sort order to locate data more quickly and skips unrelated ones. On-disk access also benefits from sequential reads. Sorting also benefits compression such as RLE. Chunk internal sort is usually done by a rewrite. Inter chunk sort can dial from loose to tight, by directing records through guards or index.

    • Reduce tracking granularity. The benefits come to indexing and skipping. With smaller granularity located and more unrelated data filtered out, queries save more reads. Metadata overhead is a trade off, where low level index parts / sketching / statistics can be embedded in chunks, rather than pinned in memory. Chunks can be cut smaller, with size more balanced, and embed variety at row group level.

    • Group similar data together. Examples are, separating keys and values, columnar format that groups values from a single column, generation or LSM-tree levels that group data by lifecycle, temperature tiering that groups cold/hot data, workload streams that group similar data from a single workload. Such classification, either based on type rules, statistics, or Machine Learning, are effectively useful everywhere, e.g. compression, scanning, GC/compaction, lifecycle related data movements.

More about optimized layouts

  • Space-optimized layout. Space amplification is important to Cloud Storage COGS but less attended. Write-optimized layout hurts space efficiency due to unclaimed stale values. Read-optimized layout hurts space efficiency, if it keeps internal fragmentation in pages, blocks, or pre-allocated empty slots. Efficient compression is also required. Space-optimized layout can be a columnar layout with closely packed records, which seems can be achieved together with read-optimized layout. If we accept rewrites, we also absorb newly ingested data by write-optimized layout.

  • Balanced-optimized layout. Considering the cost of GC/compaction, we can hardly achieve both write/read-optimized simultaneously. A balanced layout is worthwhile, and it is only optimized when tailored against the target App workload. This is essentially a Machine Learning problem, where Optimal Column Layout explored for in-place updates (binary linear optimization).

Garbage collection (GC) / Compaction

GC/compaction are common in append-only or LSM-tree systems and quite bandwidth consuming. Update in-place systems can also use compaction to generate read-optimized layout, and need GC if some new values are temporarily written out-of-place. I choose to mix the notation of GC/compaction because both reclaim stale/deleted values. Compaction can replace GC in LSM-tree, and GC can go without compaction if index/bloomfilter/versioning tell which key is stale.

Typical design goals for effective GC/compaction are below. They map to the goals of data layout.

  • Be timely enough to reduce read amplification incurred on user reads.

  • Be timely enough to reduce space amplification.

  • Pay less for write amplification either inline in write path or offline in background jobs.

  • Arrange sequential reads and sequential writes when possible.

  • Spend reasonable amount of CPU/memory/disk. Less compete with user traffic or stall them.

The design space of GC/compaction consists of a series of “knobs” choosing when and how to run

  • Size granularity. How big is the data unit selected for GC/compaction? It can be an individual chunk, a group of chunks, or compact with all chunks in a LSM-tree sorted run or level. A chunk can either be configured small or large. Essentially, enforcing sort order on a wider range implies correspondingly larger compaction granularity, which benefits reads but is more costly to maintain. A large granularity costs less tracking metadata but incurs more rewrite on unnecessary data.

  • Selecting candidates. Which chunk to GC/compact with which other chunks. A GC/compaction run is more efficient if it removes the most stale values, or when chunks having more possible overlapping. A new chunk can also be pushed off to accumulate more stale values. Proper indexing and statistics tracking can be spent here. Selecting best candidates optimizes GC/compaction.

  • When to trigger. When to start run GC/compaction? It can be when storage space is filled up, certain LSM-tree level reached max size, a chunk accumulated enough old stale keys, periodical timer triggered, recent read/write cost reached alarm or stalled due to pending work, or user traffic is low enough. They target to proactively maintain system properties while minimally impact user activity.

  • Where to run. Traditionally GC/compaction need to run in local node to save network/disk transfer. However in shared storage, chunks can be loaded by other nodes to scaleout the computation. They can also have extra replicated copies to balance reads, or smart SSDs to compute in hardware. More, GC/compaction can store data in cloud storage (e.g. S3), disaggregate storage components (e.g. Snowflake), and offload computation to cloud (e.g. Remote Compaction).

Base on data unit for classification (mentioned previously), e.g. generation, temperature, workload streams, different GC/compaction strategies (the above) can be applied.

  • Generation. I.e. the level 0, 1, 2, .. N in LSM-tree or RocksDB. Each level is typically configured with different max sizes, chunk size, GC/compaction frequency. They can also use different tiering vs leveling strategy. An example is Lazy Leveling in Dostoevsky. Roughly, lower levels incurs more write amplification because they compact more frequently, while higher levels incur more read amplification because of scanning large chunks.

  • Temperature tiering. It’s beneficial to delay hot data, keep them in memory, which then accumulate more stale values to GC/compact in one run. Cold data should be separated from hot data, to avoid polluting and forced to rewrite altogether. GC/compaction can run more infrequently on cold data because they have less activity. An example is TRIAD.

  • Workload streams. It groups data which have similar temperature hot/cold level. The correlated data with similar lifecycle are more likely to be deleted/updated together, so to be reclaimed in one GC/compaction run. The example is files in Google Drive, where a file is deleted as a whole, but mixing blocks from different files in one chunk results in fragmented lifecycle.

HyPer Scalable MVCC GC paper also gives another similar categorization of GC designs (for DB MVCC versions): Tracking Level, Trigger Frequency, Version Storage, Identification of obsolete versions, Removal of garbage versions.

HyPer paper GC categorization


Columnar format organizes data in compressed way. The compression algorithms yet allow reading records directly without decompression. The below compression algorithm selection taxonomy not only reflects common properties in data, but also data organization in compression to query efficiently.

What compression schema to use in columnar format

The above is categorized as “Columnar compression”. There are more compression algorithm families available for storage systems. They can be categorized as below

  • Block based compression. The classic daily used compression algorithm. LZ77 is the heart of most of them: ZIP, LZ4, LZ* family, GZIP, DEFLAT, 7-ZIP, RAR, Zstd. LZ77 does dedup, where text tokens pointing to each other virtually composes a dictionary. LZ77 is usually used with an Anti-entropy stage to further shorten bit representation, see “Anti-entropy compression”.

  • Explicit dictionary based compression. DB compressing string values can specialize to use algorithms like FSST string symbol table, whose core is a lookup dictionary. Zstd also provides a Zstd dictionary mode, where a pre-trained dictionary can be supplied to compress small documents.

  • Succinct data structures. We mentioned it before. Besides, LZ-End is an interesting algorithm recognized in research. It slightly modifies LZ77 to support random access without block decompression, but yet needs a few extra lookups with address jumps.

  • Columnar compression. Columnar DB uses it to compress columns. These family of algorithms are shown in the above taxonomy picture. Such compression assumes column data shares similarity. They typically support index point lookup, scan, query filtering, without decompression (Late Materialization).

  • Anti-entropy compression. As used along with LZ77 family, the algorithms select bit representation for entries according to their frequency, so that the total bit length is shorter. E.g. Zstd uses LZ77 + FiniteStateEntropy.

Data indexing

Data indexes commonly reside in memory (i.e. DRAM) and forward links to data. Though residing in PMEM is possible but by far it’s still slower than DRAM. Data indexes stem from standard textbook data structures, evolve into more complexity for industrial use, and scaleout in distributed systems. They serve read queries, point where to write, and carry a cost to maintain consistency with data.

Data indexing section

Data index properties

We can summarize common properties in data indexes. They compose the design space and various techniques available

  • Structure. A data index typically has a base structure, i.e. trees, hashtable, lists.

  • Sort order. Examples are tree vs hashtable. Tree-based index commonly maintain the ordering between data, which enables range query. Hashtable is known by O(1) lookup time, but global ordering is lost. Though tree-based index can hardly simultaneously preserve the ordering on secondary keys, and hashtable can track ordering by maintaining a tree-index altogether. In another word, combining multiple indexes together is a way to join read properties, at a cost on updates.

  • Point lookup. All data indexes support point lookup, typically ranging time cost from O(1) to O(log(n)). Essentially, there is a trade off with memory size: 1) If entire key space can be put in memory, we simply need a huge array to map any key to its value. 2) Hashtable collapses the mapping space, with hash as the mapping kernel, thus smaller memory size needed. The new space has to be sparse enough, due to the unpredictable degree of balance in the mapping (unless perfect hashing). 3) Coming to trees, keys are indexed by inter connections, rather than address mapping, thus yet more smaller memory size needed.

  • Range query. Data indexes that preserve sort order can support range query, typically trees. Otherwise it has to be a full scan, unless applying guard/segmentation to preserve partially sorted, where skiplist can be seen as an example. Another way to understand range query is that a data index must support looking up a key’s neighbors, even the key itself doesn’t exist.

  • Update/insert/deletion cost. An index is essentially a constraint on how data is organized, which implies cost must be spent in write path to maintain such constraint structure. Linked structures are easier to insert, while packed arrays have to move data if no empty slots left. Besides, a second extra cost can be spent in/off the write path to: 1) Rebalance data structure to reduce tail latency (E.g. Red-back tree rotates). 2) Handling addressing conflicts (e.g. hashtable). 3) Space expansion or shrink (e.g. expand 2x array size when hashtable is full, or shrink likewise). 4) Garbage Collection / GC (e.g. epoch-based memory reclamation) 5) Compact deltas (e.g. Bw-tree page deltas).

  • Read-only. Some data indexes, e.g. SuRF, doesn’t support updates except of a total rewrite or very expensive operations. Such indexes can be packed in consecutive arrays, highly compressed to favor memory size; and with delicate interleaving to speedup (range) queries.

  • Sequential reads. Are the memory access more sequential when jumping to lookup keys, when neighbor keys are accessed next, and when scanning range queries? This affects CPU cache efficiency, where tree-based index commonly does better than hashtables. Another dimension is on-disk sequential reads, if the index has on-disk components.

  • Sequential writes. Are writing data on disk follow sequential access? A typical example is B+-tree vs Be-tree. Be-tree buffers small writes in middles nodes to flush to disk sequentially. Even append-only logs can be used to buffer updates for DRAM indexes to flush them in sequential batch. LSM-tree can be seen as another type of index to achieve append-only sequential writes to disk.

  • Cache affinity. How efficient to CPU cache when the data index is accessed? Common measures are cache miss, IPC (instruction per cycle), branch prediction misses, pipeline stall, memory waits, and memory writes (vs in CPU register). Typical techniques include: 1) Embed pointers in data struct, rather than using an explicit node struct. 2) Pack data structures to align with cache line. 3) Avoid False sharing. 4) Exploit sequential data structures.

  • Index memory size. How much memory is need by the data index, or commonly the metadata size of a data storage. Tree-based indexes suffer from cross node pointer size, and intra node fragmentation. Hashtables however needs to leave empty slots to avoid conflicts. An example is ART tree that tailors smaller node size when less occupied. Other techniques are: 1) Pointer swizzling that packs data into tail bits of pointer. 2) Replace pointer to shorter bits IDs according to max record count. 3) Data partitioning to reduce address space thus reduces pointer size. More effective ways are decouple and scaleout, see Metadata section.

  • Concurrency. Examples are lock coupling in B+-tree, per inode logs in NOVA, and page deltas in Bw-tree. Common techniques are: 1) More efficient share/intent/exclusive locking protocol, smaller lock granularity and duration. 2) Data partitioning so multiple locks can work in parallel. 3) Lock-free data structures, but need careful design for high race conditions. 4) Symmetric parallel copies (同态多副本), i.e. to shard the space into non-interleaving but identical processing flows, e.g. one thread per disk doesn’t even need locking, e.g. requests targeting different files.

  • Compression. Keys can dedup common prefixes to save memory size, e.g. trie tree or Masstree. Nodes having few children can be merged to one (path compression). Less occupied nodes can trim its container size (e.g. ART tree). Big B+-tree node can also compress its contents. Cold pages can even employ memory compression or offload to disk. Succinct data structures compress data, and provide same ability for search and range query, without needing a separated index.

  • Fuzziness. Data indexes may return false positives, e.g. Bloomfilter, SuRF. Allowing inaccurate results enables new family of highly memory-efficient indexes. They can also be seen as sketch structures, e.g. min-max sketching, zone maps commonly used in DB data chunks.

  • Data clustering. Like index can be embedded in data chunks as forward pointers, data can also be clustered into index. This means data reads has one less fetch after traversed the index, and data is aligned with index physical order. “Clustered index” is a database term.

  • Disk components. There are two dimensions 1) Are the working parts of the index exclusively reside in memory? 2) How the index is recovered from disk to memory after node restart.

    • Disk working parts. A common hashtable, skiplist, ART tree resides only in memory. However, B+-tree has lower level pages resides on disk and load to memory on demand. Bw-tree can also flush page deltas to disk and track with linking pointers. In general, the disk transfer unit is page. But the problem can be thought in another way: Which working part of the index resides in CPU cache, and which in memory? Because cache hardware hides most complexity, the discussion falls into CPU cache efficiency and multi-core concurrency.

    • Disk recovery. A naive approach is to log every operation on the index to disk in append-only fashion. However, replaying (days of) full logs on restart is way too slow. The second approach is to keep a short term of log and periodically flush checkpoints to disk. This is what LSM-tree does. The approach also comes to B+-tree and databases, where pages are synced to disk on demand (also known as checkpointing but not requiring one full flush) and recovery follows more delicate ARIES protocol.

There are quite a few well-encapsulated data indexes widely used in industry. Below briefly lists them. They are reference architectures and source of techniques. Optimization matters in data indexes.

  • Hashtables. Plain old hashtable is yet useful in DRAM indexing, PMEM, and database hash indexes. Hashtables vary when address conflicts, how to choose the next address, and how to add conflicting keys. The second level container for conflicting keys also worth optimization. Hashtable can simultaneously use one, two, or several different hash algorithms, and target them to App level knowledge. Smooth capacity expansion and shrink is another optimization point. Data partitioning helps reduce conflict space and shorten address pointers (e.g. Kangaroo).

    • Known hashtables are Cuckoo hashing that bounces between two hashtables, HotRing that switches hot keys to front, Consistent Hashing and Ceph CRUSH map, Level Hashing.

    • Particularly, Consistent Hashing suffers from load imbalance if a server went offline, and its keys are assigned to the immediate next server on the ring. The issue can be relieved if each server has multiple points on the Consistent Hash ring.

  • Skiplist is first a list structure that preserves data sort order. To speedup lookups, it adds several layers of new lists, each with increasing sparseness by skipping more keys. Essentially, it’s like a tree but nodes linked horizontally. They data index is used in Redis and RocksDB, known for its simplicity and performance at high concurrency. In another way, a list packed in consecutive memory can also be useful to index small amounts of data, where sort order is preserved, and lookup uses binary search. An example is values packed in a B+-tree node.

  • Radix tree is used in Linux Kernel memory management, and NFS and NOVA inode indexing. Radix tree is a trie tree that compressed paths to the most, so that each node’s children count maps to the variety (i.e. radix) of the next level data. It can also be seen as a fragmented array, where the big consecutive array is broken into a few smaller segments, and these segments are indexed by another smaller array at the next level in the tree.

  • Red-back tree is the standard implementation of C++ ordered map. It’s a binary search tree, self-balanced with efficient rotation, yet not too strict to hurt update performance (rather than AVL tree).

  • B+-tree. The plain old database index, but still proven widely useful in storage, PMEM, caching. B+-tree balances itself, uses packed big nodes to limit tree height which maps to disk reads, and preserves sort order with traversal links. B+-tree works entirely in pages, which simplifies disk data transfer in DB and memory management. B+-tree shares various optimization for access efficiency, storage space, and locking concurrency.

    • Steal, no force. The terms came from ARIES protocol for DB transaction recovery (well explained in 阿莱克西斯 ARIES article). They work with DB page syncs between memory and disk. “Steal” allows DB to flush pages of uncommitted transaction to disk, thus introduced the need of undo log. “No force” allows DB to NOT flush pages of committed transaction to disk, thus needs redo log from failure recovery. Not only do “steal, no force” improve DB performance on large transactions, they enable DB buffer management to become a decoupled component from transaction and indexing.

    • B+-tree locking techniques separated DB concepts “latch” (for data structure) vs “lock” (for transaction). It introduced the widely used technique “lock coupling”. The concurrency design space in B+-tree ranges from: 1) SIX locks which introduced intent locks that softly drain on-going writes, 2) lock coupling that steps through parent/child with limited lock span, 3) Blink/OLFIT tree that supports version-based OCC lock-free reads and locked writes, 4) Bw-tree that is lock-free and delta page append-only. (Well explained in 数据库内核月报 B+tree.)

  • Popular data indexes for in-memory databases and PMEM. They essentially stem from B+-tree, and are frequently found in papers and industry products. Since they are already covered in Reference architectures section, here does a brief walkthrough.

    • ART tree. is used in HyPer in-memory database. It is built from a radix tree, dedup key prefix like trie tree, and made space efficient by adapting node sizes to several different record counts. The node is essentially a fixed length array. Leaf nodes can store values inline. Path compression is carried out to nodes with a single child. ART tree supports range queries.

    • Masstree collectively applied many optimization techniques on B+-tree. Trie tree is employed to dedup common key prefixes. The next-to-access nodes are prefetched from DRAM to CPU cache in overlapped pipeline. Operations can carry out concurrently at tree nodes, while read is version-based OCC lock-free, and writes hold a lock. More fine-grain optimizations can be found in Masstree paper.

    • Bw-tree is a lock-free B+-tree variant used in Hekaton and DocumentDB. It appends page delta, which then needs compaction and epoch-based memory reclamation. It employs a Page Mapping Table to avoid recursive propagating COW page updates up to root. The Page Mapping Table also enables many atomic operations that need to switch page pointers. The Bw-tree page deltas can be incrementally flushed to disk, which makes it comparable to the append-only LSM-tree that buffers and sequentialize writes.

    • Be-tree is a B+-tree variant to reduce random writes compared to LSM-tree. Compared to B+-tree, small writes are buffered in nodes, and flushed to lower levels when full. In this way, small writes are batched and disk writes are mostly sequentialized. While compared to LSM-tree, Be-tree still maintains the B+-tree structure in data organization to supports optimal read performance.

  • Other types of indexes. Below a few are handy at special usecases.

    • Bitmap index is used in databases. Compared to B+-tree, it becomes applicable when a column has low cardinality (count of distinct values). It works in the same way with bit-vector compression in columnar layout.

    • Inverted index is used in full-text search engine, e.g. Lucene, ElasticSearch, to lookup documents by words. Weights of words can be evaluated via TF-IDF score. Weights of pages or documents can be evaluated by PageRank algorithm, known from Google, while PageRank is the eigenvector of the page link matrix. Inverted index becomes wider adopted in databases because more started to support full-text search.

Data indexes in distributed storage

We discuss a few secondary topics here about data index

  • How a data index can scaleout in a distributed system? Typically, a tree-based index can host top levels in a Consistent Core, and naturally scaleout bottom levels across cluster. Hashtables and list-based indexes can scaleout with data partitioning on value ranges.

  • How to maintain index consistency with data updates? We already discussed this in Consistency section. Essentially, it needs a distributed transaction interleaving data and index. If the index is partitioned and exclusively co-located with data on the same node, a local transaction is enough. The index can also receive data updates in an eventual consistent way, while versions can guide users about the propagating progress and snapshot isolation. A third alternative builds full index for old data, while new incremental delta data runs without index or a cheap index. Typically, the index can be implemented as another plain database table to reuse data structure and transaction.

  • How to build secondary indexes? We mentioned a few secondary indexes in Reference architectures, which agree on eventual consistency. A typical database can support secondary indexes by paying transaction cost with data updates. On distributed storage, there are yet two categories of secondary indexes

    • Global secondary index builds index on the global space of the secondary key. It needs a distributed transaction to update consistently. However, if treating it as a plain database table, reusing the code is easy.

    • Local secondary index builds an index locally on each data node. Per index only covers the local space, while different data nodes can have overlapping secondary keys but not known by the index. The index only needs a local transaction to consistently update with local data. However, looking up a secondary key needs to query all data nodes. Running parallel queries may not be that bad, considering there are also databases who choose hash partitioning per row. A node can skip query if its bloomfilter tells the key doesn’t exist.

Succinct data structures

Succinct represents a family of data compression algorithms with interesting “self-indexing” property. See below. I add a special section for it. They quite match the usecase for DNA indexing & searching. They can also be used for in-memory indexing, and compressing in-memory data while supporting DB queries.

  • The compressed size is close to the entropy limit (Succinct wiki). I.e. the compression ratio is near the classic block-based compression.

  • Supports point/range query, and especially text search, in-place on the compressed data. There is NO separated index, but the performance is close to using an index, much faster than a full scan. Supporting text search is handy for DNA sequencing.

    • I.e. Succinct can be used to replace in-memory index, especially a secondary index. Besides, succinct also compresses your data.
  • Querying/lookup in Succinct data structure usually involves several address jumps in its internal data structure (e.g. Compressed Suffix Array). This is OK for in-memory indexing/compression, but may not be as handy for on-disk data compression. Besides, sequential read throughput / reading a large block may also be a concern.

  • Succinct data structure is usually slow to build, compared to classic block-based compression. Once built, it is usually hard to modify. Though supporting various queries, succinct data structure can be slow for sequential scan.

    • In column-oriented DB, common columnar compression algorithms (e.g. RLE) make powerful alternatives to Succinct data structure. Columnar compression algorithms also support directly executing DB queries. They are also easier and faster to modify. They get much wider adopted in DB.

There are a few most commonly used Succinct data structures

  • FM-index is one popular and versatile succinct data structure. It is based on Burrows-Wheeler Transform (BWT). How it works is close to CSA.

  • Compressed Suffix Array (CSA) is built from a different knowledge set. But it eventually converges to a very similar data structure like FM-index and BWT. Essentially, it tracks the suffixes of the input string and sort them. Tail char and prev char are extracted from each suffix, and sufficient to rebuild the original input string. The chars extracted are sorted, thus can be efficiently compressed. Text search is based on matching these chars. When point lookup needs address offsets, CSA needs to store them, but uses sampling to reduce the storage overhead.

  • Succinct Trie is a Trie Tree encoded in bits. Rank & Select primitives are used to traverse tree parents/children. The primitives can be optimized to execute faster. Succinct Trie is typically used as a compressed index.

There are several notable adoptions of Succinct data structures

  • Compressed index in TerakaDB/ToplingDB. ToplingDB uses Succinct Trie (CO-Index) to index RocksDB keys, while on-disk data is compressed by PA-ZIP. PA-ZIP supports random access to compressed data, without decompressing the entire block. PA-ZIP is not using succinct.

  • Spark RDD added an Succinct based implementation. It is compressed, and supports text search and text occurrence count. It published GitHub AMPLab/Succinct and a SuccinctStore paper.

  • GitHub simongog/sdsl-lite is an well-known opensource implementation for succinct data structures. The implementation is efficient and is mostly used for researching.

  • DNA sequencing. Searching a sub-sequence in a huge compressed DNA database is handy, and right matches what Succinct does. See an example paper of genomes compression. LZ-End is also a well-known algorithm.

Data caching

Data caching resolves the performance tier in data organization. It exploits the skewness of data hotness and temporal access locality, to trade off expensive small capacity storage media with fast access. Internet services commonly heavily leverage cache (e.g. Redis) to serve most user data. We first plot the design space of data caching by categorizing its different properties.

  • Storage media. The dominating caching device is DRAM, e.g. Redis, Memcached, which layers atop slow disk accesses. Later, SSD is integrated into caching device to exploit larger capacity, warm cache restart, and its speed compared to HDD. PMEM is recent and mostly used for writing staging, offload cold memory, or as fast persistent storage of Filesystems and DBs. At cloud-native scenarios e.g. Snowflake, Ephemeral HDD in local VM is used to cache computed results fetched from remote S3.

  • Durability semantics. Commonly, cache is a duplicate of data persisted elsewhere, where losing cache has no impact to durability. Cache tiering (e.g. in Ceph) however requires cache is persistent, that 1 replica of the 3-replica is migrated to cache, and leaves the other 2 replicas in slow storage (e.g. HDD, ECed). Write staging also requires persistence, while it’s used to absorb recent writes, dedup and sequentialize them, and to cache recent data for following reads. Memory buffer is common in programming that data needs to load from disk into memory before processing. It is volatile. It’s also used in Stream Processing to buffer and composing middle results (e.g. in an Redis server), where durability can be enhanced with RDD.

  • What to cache and the granularity. From small units to bigger ones. Storage/DBs typically cache blocks and pages. Memcached caches key-value pairs, while Redis caches data structures e.g. lists, sets, maps. Tiering systems can move larger chunks or files. DBs can also cache more semantically, i.e. table rows which contains more density of user data than raw blocks, query results, or materialized views which caches data as well as computation efforts. Query optimizer results can also be cached, where parameterized query is common. In-memory DB can be seen as caching for an entire DB level.

  • Where to host the cache. To use a separated system, offload to another server node, run in another process, or embed into the local App.

Data caching section

Memory caching

Caching data in memory is essentially how to manage data with DRAM indexing. We mentioned that in Data indexing section. Typical data structures are hashtables and trees. Additionally, memory compression and cold offloading can be employed to enlarge the capacity. There are a few design properties to consider. We recap here while they are also valid for SSD caching.

  • Cache partition & replication. Scaleout cache via hashing partitioning is common. But it can be capped by IOPS if clients have to split requests. E.g. an originally large request has to split into two small requests, because the queried keys are hash mapped to two servers. Replication comes valid here to scatter load for small pieces of very hot data. It’s also used to save cross-region lookup. Load balancing can be done via partition/replicating, while hot/cold rebalancing migration is usually not necessary because cache is volatile.

  • Cache warmup. A newly restarted cache node needs to run a while to fill with warm data. A cold restart can impact performance of systems which heavily rely on caching. For a warm restart, a cache process can offload data to disk before exit, backfill from other cache nodes, or let a shared process to temporarily keep its memory while being restarted.

  • Item eviction. The methods are shared with storage temperature tiering, which have already been mentioned in Write path section. Additionally, cache can be designed to never evict until a new item comes in full. Cascaded expire or invalidation should be avoided that, a large swarm of cache item eviction can burst miss rate and impacts latency.

  • Propagating updates and invalidation. Updates and invalidation are necessary to keep cache consistent with the underlying persistent store. However, with N cache nodes and M App nodes, an N*M connection count is unwise. An mediator module or a central coalescing messaging queue can be introduced. Cache can also subscribe DB change logs to update itself (shared logging system).

Managing consistency between cache and persistent store has several approaches. Facebook Memcached/TAO papers had rich discussion.

  • For read consistency, a typical method is cache aside. App first reads from DB and then puts item in cache. App is responsible to invalidate cache item when updating DB. A small period of stale reads from cache is tolerable. Cross region cache consistency can be achieved by primary secondary replication, and a sequential ordering of update and invalidation. Facing with cross region lag, a user can request to see its latest updates via causal consistency, which can be implemented with version tracking in cache items.

  • For write consistency, the same typical method is the above cache aside, or call it write through. Cache can also totally hides the backing persistent store, that it will take all writes and guarantee durability (e.g. write staging, or proxy). When a staging cache writes back to the persistent store, write ordering needs to be considered. An anti-example is, journal commit is flushed earlier than journal data.

  • For multi-key consistency, the problem decouples into atomic writes and atomic reads. Both can be enhanced by tagging versions with cache items, detect inconsistency and apply mitigation. A key difference to persistent storage is, cache is OK to be inconsistent first then detect and fix, while persistent storage must guarantee data consistency.

SSD caching

SSD cache also uses DRAM as the first level cache and offloads cold data to SSD. DRAM index is typically hashtable or B+-trees. New challenges come from managing larger index size brought by the larger capacity of SSD, handling SSD rewrites and garbage collection, managing item eviction on SSD, and managing SSD wearing out issue. They are a few design properties.

  • SSD cache structure. There are several approaches. SSD cache has similarities with the hardware cache between CPU and DRAM, and also shares properties with storage.

    • Set-associative cache, e.g. Flashcache and KSet in Kangaroo. Set-associative cache limits the freedom of item location into a cache line, thus needs little memory to host index (same level as a hashtable).

    • Append-only storage, e.g. BCache and KLog in Kangaroo. Cache items are sequentially appended to disk, and organized in a larger bucket as the unit of GC.

    • Key-value stores, e.g. to use RocksDB to manage SSD data. However, RocksDB is not designed to use as cache, disk point lookup has no index, and deleted space is released too slow after many levels of compaction. Cache has a second key difference to persistent store that is, deletion is much more frequent.

  • Managing index size. While a plain method is to set a larger page size, cache items can be divided into small objects and large objects, e.g. Kangaroo. Large objects have fewer count thus can use full DRAM index. Small objects assign most SSD capacity to set-associative cache which incurs little index memory. It overlays a more efficient append-only storage to favor batching, which uses limited SSD capacity thus small DRAM index size. Further metadata size reduction techniques such as “Partitioned Index” can be applied. Bloomfilter is another memory-efficient technique to tell whether an item exists on disk.

  • SSD garbage collection. Set-associative cache has huge write amplification. A cache line is set to be aligned with the flash page. Overwriting a cache item needs to rewrite the entire cache line (i.e. flash page). Append-only storage generally follows the common GC techniques. Buckets composes the resource throttling unit, and high garbage buckets can be picked first. Item eviction is same with what we mentioned before, where memory size needs compact. Note that a flash cache line can merge eviction and insert into one rewrite, i.e. never evict without insert.

  • SSD wearing out. When used as cache, SSD inherently suffers from more severe wearing out. It is the third key difference to persistent storage. This is because cache capacity is much smaller than the underlying persistent store, but cache has to flow through most new writes, and yet to flow more due to periodical data hot/cold shifts. Mitigations can be to prevent cold data from flowing through cache, and to avoid churn by using enough capacity to host a hot/cold cycle.

Metadata caching

This section focuses on caching data, but we also briefly mention metadata caching.

  • Metadata is usually served fully in-memory in a Consistent Core (or partitioned, or disaggregated nodes). A client can directly ask the Consistent Core rather than requiring another cache service. Besides, the size of metadata is usually much smaller than data.

  • The propagating of metadata usually leverages piggybacked requests, gossip protocol, and a direct refresh request to the Consistent Core. Client typically caches what it needs in the local memory, with an expire or version checking policy.

  • Secondary indexes of data can be seen as a type of metadata. Per implementation, they are usually treated as plain data or tables, that share the same caching facility as mentioned in prior sections. As index, they may set higher priority to pin in memory.

Data partitioning & placement

Data partitioning is the fundamental paradigm to scaleout data serving in a distributed system. It has more design properties, that many also resemble those in Data organization section, where you can find partitioning across nodes is like co-locating data in chunks. Data sharding is mostly a synonym of data partitioning.

  • Scaleout. Data partition maps data space to partitions, so that each partition can be served on a different node to scaleout system capacity. The system is dynamic, that an individual partition will grow or shrink in size or hotness, which yet introduced the needs to split or merge partitions.

  • Access locality. Data accessed together should be put into a single partition. E.g. a partition includes consecutive data ranges and preserves sort order to favor range query. E.g. different tables frequently grouped in one transaction are co-located in one partition. E.g. A partition includes different objects or table columns that are frequently accessed together. E.g. a single object can be broken into different components, that each partitioned differently according to access patterns. Access patterns are dynamic, which means either partition or placement need to change by time. Finding the best partitioning can either act greedily on recent metrics, or by Machine Learning optimizing on history behaviors.

  • Granularity of units. Partition can be small for fine-grained scheduling, and still preserve locality by co-locating multiple partitions on one node. However more metadata can be paid as growth of data volume. Existing partition granularity can also be adaptive to future growth/shrink by employing merge/split. However, a hash-based partitioning needs careful deign to avoid excessive data migration.

  • Balance of capacity. How to ensure each node receives similar data capacity? Either this is achieved by equalizing data partitioning, or to rely on balancing data placement. Partition growth/shrink in size further introduces needs to manage merge/split and migration.

  • Balance of hot/cold. How to ensure each node receives similar IOPS/throughput? Hotness is the second dimension other than capacity that requires balancing. The balancing is either embedded in data partitioning level, or rely on data placement. Adaptive data migration is needed to deal with future traffic pattern change.

  • Shuffle. Computation may need a different partition key compared to the existing one. This happens frequently in MapReduce/Spark pipeline that data needs to be aggregated by a different partition key, and in database join operating not on primary keys. Usually the solution is reshuffle that sends data via new key, or sometime a small table can be completely replicated to each destination.

Data placement is the next step that decides which node to place a partition. Usually data partitioning and placement are joined together to solve the above design properties. Data placement has more design properties.

  • Data migration. The first source of migration is balancing, that comes from the asymmetric growth of capacity, change of hotness, change of access locality. Another source is nodes join or exit, that empty nodes need fill up and dead nodes need to place data elsewhere. Hash-based placement usually needs careful design to avoid excessive data migration. The topic is closely related to load balancing, while resource scheduling more focuses on placing jobs with multiple dimensions of constraints such as CPU, memory, IO, latency.

  • Metadata size. It helps balancing and reduce migration to allow full freedom of object placement, and to have a fine-grained tracking unit. However, both requires spending more metadata size. Metadata itself can also be partitioned and scaleout, see Metadata section.

  • Failure domains. Co-related data, e.g. 3-replica or EC symbols, needs to avoid placed into the same failure domain. Failure domain hierarchically consists of disk, node, TOR, datacenter row, T2 switch, and region DNS. Upgrading schedule adds another layer of failure domain.

Data partitioning section

Common techniques

Common data partitioning techniques for key-value structures are hash and range based partitioning. It gets more flexibility for Filesystem inode trees, and graph vertices/edges. Data partition & placement techniques closely relate to Metadata section.

  • Ranges. Frequently seen in DB to support range query, e.g. CockroachDB, HBase. A table is horizontally partitioned by consecutive row key ranges. Ranges are usually dynamically managed by split/merge. A table can additionally vertically partition by columns frequently accessed together.

  • VNode. Keys are hash mapped to buckets called “VNodes”. VNodes are the input for further placement. Compared to directly placing each key, VNode reduces the granularity of tracking, and balances hot/cold. The number of VNodes in a system is usually pre-configured, hard to change. We mentioned VNode before.

    • Hash partitioning Databases, e.g. YugabyteDB, can support hash partitioning. Each partition is like a VNode. Rows are assigned to them via row key hash mapping. A distributed Memcached can also scaleout by hash partitioning. While hashing automatically balances hotness across nodes, IOPS can be significantly increased as a range query involves all nodes.
  • Filesystem inode trees. Like range vs hash, trees can also be partitioned by sub-structure vs hash randomness.

    • Subtree based. E.g. CephFS features in “dynamic subtree partitioning”, that an entire subtree can be migrated to different MDS nodes according to hotness. Subtree based partitioning preserves access locality but is prune to hotness skew. When accessing a deep FS path, each middle node is subject to a metadata fetch, where Subtree partitioning helps localize all them in one node.

    • Hash based. E.g. HopsFS partitions inodes by parent inode ID to localize operations of dir commands. Hashing favors load balancing but breaks access locality.

    • Break into different components. E.g. InfiniFS. Inode metadata is decoupled into access attributes and content attributes. Each has different access locality, thus each is partitioned differently. The method enhances locality for hash based partitioning.

  • Graph partitioning is challenging because interconnections between graph components are irregular. Besides, computation on graph usually can hardly be localized to partitions, e.g. Deep Learning needs Parameter Server.

    • Hash/range partitioning. E.g. FaRM A1 applies hash partitioning to favor randomness. E.g. Facebook TAO is backed by MySQL and assigned a shard_id for partitioning. Adjacent edges are packed to their vertices due to always accessed together.

    • Clique identifies a group of vertices that have dense internal communication but sparse outside. Facebook Taiji partitions data via Social Hashing, i.e. to partition by groups of friends, geo domains, organization units, etc. Expensive partitioning can be calculated offline via Machine Learning.

    • Replication. E.g. Facebook TAO. Some partitions can be frequently needed by computations happened in other partitions. The traffic is expensive if cross region. Such partitions can be replicated to all consumer nodes to favor access locality.

Techniques about data placement follows similar categories with data partitioning.

  • Metadata tracking. Use Consistent Core to track the placement of each partition. It costs metadata size. The placement of a partition have full degree of freedom. All sorts of algorithms can be explored for fine-grain arrangement on capacity/hotness. No excessive migration is needed for node join/exit. Examples are HDFS/HBase, Tectonic.

  • Consistent hashing. Hash methods save metadata size. Naively a partition can hash map its placement to a node, but a node join/exit can churn all existing placement thus cause excessive data migration. Consistent hashing is introduced to stabilize the churn that, only neighbor VNodes are touched. Examples are Cassandra, Dynamo. We mentioned consistent hashing before.

    • CRUSH. Ceph invented CRUSH algorithm which is a hash based placement algorithm. It generate random but deterministic placement, and limits excessive migration during node membership change. Compared to consistent hashing, CRUSH supports hierarchical failure domains organized as a tree, and different weights of devices.

    • Placement dimensions. Hash-based placement is usually single-dimensional by partitioning key space, but placement problem is fundamentally a multi-dimensional resource scheduling problem. Placement Group / VNode provide something in the middle.

  • Content-based addressing. Placement is determined by the hash of the data block content, so that dedup is automatic. The example is XtremeIO. We mentioned it before.

Data integrity

Data integrity is critical. A storage system can be slow, feature less, non-scalable, but it should never lose data. There are several failure patterns affecting data integrity.

  • Durability loss. Enough disk is down that a piece of data cannot be recovered. In compare, Availability loss means a serving node is down, but data is still recoverable offline from the disks. At hardware level, an entire disk failure usually maps to power unit or disk encapsulation, while corruption usually maps to individual sector failures. RAIDShield points out that climbing reallocated sectors is a good predictor for incoming disk failures.

  • Disk error on reads. Disk read can generate transient or persistent read errors. It may or may not map to the underlying bad sector. The rate can be measured by bit error rate, or UBER.

  • Silent disk corruption. A disk sector can go corrupted without notice. The disk hardware may not discover it until the first read. Or the disk read is successful but software level CRC verification finds a mismatch.

  • Memory corruption. Memory bits can corrupt time to time and generates incorrect calculation results. This includes ECC memory. What’s worse is a corrupted pointer, that may tamper a wide range of memory unpredictably.

  • Unexpected data deletion bugs. A high ingress storage system needs to timely reclaim deleted space. But a programming bug can unexpectedly delete valid data. This can be infrequent with careful rollouts, but once happened, much more data can be impacted than plain disk failures.

  • Incorrect metadata bugs. Metadata needs to be frequently updated with data changes. A programming bug can easily incorrectly update metadata, thus loses the track of data location or states. It’s more error prone to handle version incompatible upgrades.

  • Bugs propagated through replication. It’s not uncommon to see a full sets of replica corrupted, due to a bug is replicated too. Replication is effective to protect against hardware corruptions, but not so helpful for software bugs.

Plain techniques are used to improve data integrity.

  • Replication based. Replicate the data or apply EC. Replicate the metadata too in case one copy is corrupted. Perform periodical backup, including to another geo location, and to an offline system to prevent bug propagation.

  • CRC is pervasively used to verify a piece of data matches verification, with a cost of computing polynomials on finite fields. Compared to cryptographic hash, CRC is reversible to recover wrong bits. CRC algorithm satisfies linear function (CRC wiki), which can be used for optimization. A 32-bit CRC is able to detect any 2 bit errors, burst errors of length <= 31, any double bit errors, or any odd number of errors (CRC lecture).


The techniques should be used with thoughtful methodologies. See more in Reliability against bugs article.

  • CRC should be end-to-end. User client generates the CRC, and the CRC is persisted in the last level of system. Data is verified with CRC before returned to user. CRC calculated in the middle of processing is less reliable because the input data may already be corrupted. The more general principle is, end-to-end verification is necessary.

  • Any data transform needs verify. Replication, EC, buffer copy, compression, network send, format change, store/load from disk, etc. Any data transformation should compare CRC before/after, in case any memory corruption happens in middle. The more general principle is, each incremental step needs verification.

  • Save metadata twice. Metadata is too critical that, it can be saved one time in Consistent Core, and keeps another copy on data nodes. The two copies are updated with different workflows. If metadata corruption happens in Consistent Core, they are still recoverable from data nodes. The more general principle is, heterogeneous verification, that critical data or computation should be persisted or verified by two different workflows, so that corruption at one side can be recovered from the other side.

  • Data ordering needs verify. Distributed system can receive packets in inconsistent order. When data is being appended, their overall ordering should be verified that no change happened in middle.

  • Periodical disk scrubbing. This is common on distributed storage, e.g. Ceph, that disk needs periodical scrub to prevent from silent corruptions. To finish scrubbing on schedule, it requires enough throughput and deadline scheduling.

  • Verification pushdown. A storage system can be organized by multiple layers. Verification computation can be pushed down to the bottom layer, to shorten the data transfer path. It is applicable because verification logic is usually fixed, few exception handling, and data pipeline oriented. They can also also be offloaded to hardware accelerators chips or smart hardware.

  • Chaos engineering. Periodically inject failures and corruptions in the system to test system ability of error detection and recovery. Periodically drill the engineering operations of data recovery. The more risky activities should be carried out more frequently.

Data integrity section

High availability

I choose to combine HA in this section because it’s related to durability, most contents already covered before, and the fundamental goal of integrity is to ensure the correct data is always available. Availability issue is usually transient and gone after node recovery, but durability issue means data lost availability in infinite future.

  • Replication. The fundamental technique for data/metadata HA is to persistent multiple copies. Once copy to recover another, and 2 in 3 copies can vote out 1 incorrect data. Synchronized replication acks client only after all copies done updating, while geo-replication or backup can be employed with an RPO.

  • Active-active. The fundamental technique for computation/service HA is to run multiple instances of services and allow failover. Active-standby saves computation resource at the standby machine, but suffers from an RTO delay for standby startup. Paxos is the pervasively active-active algorithm where the majority quorum arbitrates a split-brain. Active-active can be extended to multi-datacenter or multi-region, either by Paxos/sync or async replication.

    • Cell architecture partitions data and encapsulates depended services into cells. Each cell specifies only one active primary datacenter, while all datacenters run active cells. So that all datacenters are active-active, no standby datacenter. Data can be sync/async/not-replicated across datacenters. Datacenter failover needs caution to avoid overloading alive ones.

    • Multi-zone services. AWS AZ and Azure redundancy divide disaster failure domains in a geo region into availability zones. A services can span multiple zones that a single datacenter disaster won’t impact availability. Zones are active-active.

  • Two geo locations three datacenters are commonly used in banks. One city deploys two datacenter with synchronized replication, and a second city deploys the third datacenter with async replication for disaster recovery.

  • Reducing blast radius. A core concept in availability is to reduce blast radius. Cell architecture isolates at cluster level and reduces deployment size. Partitioning reduces blast radius and eliminates the SPOF. When a partition fails, instead of letting availability drop, migration can restore it by serving the partition at another group of nodes. Such migration is usually designed to move fast. However, it can breach the blast radius if a buggy partition keeps crashing new nodes. Logical Failure Domain can be employed to quarantine it.

    • Practical tips. Do you have services (e.g. DNS, config portal) that span more than one clusters, datacenters, or even regions? Are config changes also following a canary baking rollout SDP procedure? Are there config changes that could instantly impact all services? Can the customer or devs or ops happen to delete something that shoot down all services (GCS UniSuper outage)? Deletion is usually the more devastating cause of data loss than node failure, bad disk, and data corruption, because the later ones won’t proactively physically eliminate user data.

HA relies on robust detection of failures, where the major issue is Observational Difference caused gray failures. Examples are dead App but heartbeat thread still working, network link degradation only at a high percentile, inconsistently reported healthy status, intermittent failures. Common techniques to overcome such issues are stemmed from Metadata consistency:

  • Synchronized locksteps between heartbeat and application progress, e.g. use request execution count as heartbeat, or use expiring fencing token / lease.

  • Gossip protocol that multiple peers can participate in observing failures, and an ask request can go confirm with multiple peers.

  • Quorum decision that important events such as node failure or node membership change should engage a consistency quorum to make the final decision.

Besides, Disaggregation can be used to improve availability and durability. An example is VastData DASE architecture. JBOD is disaggregated from server node, so that they won’t fail with server. Server is easier to fail due to more complex software and upgrade, while JBOX can be a simple box with pure functions and hard-wired chips. JBOD and server are connected with fast RDMA. Upon server failure, JBOD can be taken over by other servers. The zstorage article also has good analysis.

  • Smaller failure domain. Compared to a big server running everything, disaggregation naturally leads to smaller failure domain. For example, server failure should not impact disk, metadata plane failure should not impact reads. Inside one function unit, it can be partitioned to future reduce the failure domain size.

  • Shared everything. If a parent fails, the child function unit should be able to transfer another parent. For example, if the server fails, the disk should be taken over by another server. Expanding the picture, everything should be shared by everything so that they never likely to fail. This dramatically improve availability/durability to unlock potentials throughout the system.

  • Network cost vs PCIe cost. Disaggregation has a fundamental cost that is to transfer more data through the network. It becomes more acceptable with fast network bandwidth growth today. In compare, centralizing function units into one server means it’s replacing network cost with PCIe bus inside the server. PCIe bandwidth also has rapid growth today.


Durability usually share similar techniques with HA, except more emphasis on disk failures/corruptions and integrity verification. They have already been covered before. Reliability modeling is commonly used, where exponential distribution satisfies most needs.

Resource scheduling

Multi-dimensional resource scheduling on cloud is a big topic, see DRF/2DFQ etc mentioned in Reference architectures section. In this section I cover system properties in a typical storage system.

  • Priority. A user/background job/request should be handled first or delayed, with maximum or minimal resources. Priority are also reflected as weights on different user jobs. Usually, critical system traffic e.g. data repair > user latency sensitive workloads > user batch workloads > background system jobs.

  • Throttling. A user/background job/request should not use more resources than its quota. Throttling also means to isolation the propagation of impact from one user to another, where shared resources like CPU, network, IO bandwidth, latency can easily become the channel. Typical throttling algorithms are token-based Leaky bucket, or a simple queue limit on request count/size.

  • Elastic has multiple meanings: 1) A service can timely expand to more resources in respond to the growing load. 2) A background job can borrow unused resource for faster processing, even temporarily exceeds its quota. 3) A low priority job can timely shrink itself, if a high priority job suddenly demands more resources. Elasticity involves quick startup or growing resources, predicting usage with Machine Learning, instantly enforced quota, and probing growth, that sometime resembles congestion control in networking protocols.

    • Resource utilization should eventually be improved, without impacting latency sensitive workloads. This also benefits energy efficiency, which is a main datacenter operating cost. CPU can dial down frequency. Vacant nodes can shutdown.
  • Fairness. Commonly mentioned in locking or resource allocating. User jobs should be given similar chances to get resources, proportional to their priorities/weights, rather than being biased or starved.

    • Anti-starvation is the other side of coin. Low priority background jobs should not be delayed too much, e.g. GC/compaction to release capacity. It resembles important but non-urgent quadrant in time management. It requires detecting starved jobs and apply mitigation.

    • Priority inversion is another issue. High priority can be waiting on the resource held by another low priority job, e.g. a lock. Dependency link should be traced to bump priority, or preemptively kill and retry.

    • Preempting. It defines the strategies whether higher priority jobs should stop/pause lower ones to take up its resources. Besides job scheduling, preempting is also seen in transaction scheduling and deadlock resolving. It varies whether younger jobs should preempt older ones, or vice visa. The cost to preempt a long live transaction can be high. OCC can also be seen as first win jobs preempts slower ones, where frequent retry can cost high.

Design dimensions

There are a few design dimensions to consider when designing resource scheduling.

  • Job granularity. Small jobs generally benefits resource schedule balance. Think randomly tossing balls into bins: the smaller and more balls, the balancer per bin’s final ball count. The method is widely used for multi-core processing, i.e. async multi-stage pipeline. While small job granularity is beneficial, it costs metadata, increases IOPS, and disks still favor batches.

  • Overload control. System overload and then cascaded failures are not uncommon, e.g. synced massive cache expire, retry count amplified across layers, node failure repair/retry than bringing down more nodes, CPU/memory/network exhausted and propagating the churn, crash failover then crash again, etc. Operation control knobs, graceful degradation, circuit breaker are necessary.

  • Cost modeling. Read/write size is the common practical cost modeling in storage systems. Together they compose queue count and queue size. The most comprehensive cost modeling as a reference can be found in DB query optimizers. The predicted IO cost can be combined with deadline to early cancel requests that cannot finish in time or resource limits.

Resource scheduling section

About Load balancing

Latency issues can generally be categorized as:

  • Resource overloading. Lasting minutes to hours span. The problem happens at resource allocation. The solution is Load Balancing, which can be alike Google Borg.

  • Temporary bursts. Lasing seconds or shorter. Common sense load balancing can hardly work on this tiny granularity. The solutions can be:

    • Stealing. It’s like thread pool job stealing. The one who gets overloaded can steal the resource from a general pool or from others. Each unit can contribute a bit of its resource to compose the virtual pool. Or say, L1 cache is stealing from L3 cache.

    • Levels & Pooling. N resources are allocated to U users, and each user gets N/U. But usage is always biased, and bias is usually sparse. Thus, we introduce an extra overflow pool with M resources, and we allow the biased usage to overflow to it. As a result, they appear like each user gets N/U + M resources. This shows how adding an extra level with small capacity can fix bias in resource usage. Resource pooling further extends the idea. Such pattern of resource allocation and bias are common. It can even be seen at CPU cache design too, e.g. Victim Cache, L1/L2/L3 cache.

  • Isolation. This is always needed. Solutions can be alike Google Heracles. A common API like Linux Container’s CGroup can be convenient.


Though running the system fast is the most typical meaning of performance, performance maps to more system properties:

  • Latency & Throughput. Latency measures how fast a request is served and returns. It matters more to small requests. Throughput measures how fast given size of data is processed and returns. It matters more to a single large request, or a batch of requests up to size. Note requests in queue negatively affect latency, by adding arbitrary queuing latency to serving latency. But they generally benefit throughput, if the system is not overloaded, by exploiting batching and parallelism in request serving. Queue depth (QD), or outstanding/active/on-going request count, measures such behavior.

    • Tail latency. Request latency is a probability distribution that usually P25/P50/P99 vary greatly, especially in cloud storage that is serving mixed workloads from many customers, with unpredictable burst patterns, and in a large scale. P99 matters because it still maps to many customers. P25 is usually achieved by cache hit, while P99 can point to bad cases in request execution. Typical techniques to reduce tail latency include sending extra requests, monitor lagging nodes with proactive retry, and the power of two random choices.

    • Queuing theory. The system is abstracted into components connected sequentially/in-parallel by queues. While stochastic math can be used in modeling, simulations with production samples are generally more practical. Though Queuing theory points out serving latency can grow to infinity with 100% resource utilization, the assumption is fully stochastic request ingestion. In a well scheduled system, where requests arrive at chosen time instead of stochastically, is still possible to achieve high resource utilization with low latency (for high priority jobs). Queuing theory is also used for Capacity Planning, where the queuing layout can point out the bottlenecks of data flow, while it also helps debugging/troubleshooting to narrow down which point injected the excessive latency. Queuing theory also guides configuration tuning, that only when the queue sizes and capacity at each component are well fitted, the overall system performance can reach its max.

    • [Instruction per cycle] (IPC). While latency/throughput are useful to measure IO systems, what are the concepts extended to CPU-cache-memory area, or in-memory processing systems? The typical measures are IPC, cache misses, memory stalls, from CPU statistics. A well designed program increases IPC by reducing mis-predicted branch jumps, making efficient use of CPU cache, pipeline and prefetch memory; as well as to reduce the cache invalidation, cache line locking, process lock wait due to concurrency control algorithms.

  • Predictable performance. A higher requirement for latency/throughput is, they should be consistent among requests, among time, and among any scale. A typical anti-example is SSD performance varies time to time due to background GC is running, where the term “Deterministic latency” is often used. Another anti-example is SSD performance starts to drop after over-provisioned space is used up, where the term “Sustainable performance” is often used. People also expects Cloud storage to provide consistent latency from request to request, i.e. to shorten the gap between P50 and P99; and to ensure a stable performance during App/VM is running for days and being migrated.

    • Factors affecting predictable performance. Background maintenance job like GC/Compaction can easily block user requests with a large read/write request at the head of queue. Workloads have changing hotspots, while load balancing and migration may not kick in in time. Customer TPS/Capacity can grow rapidly, with bursts, while auto scaling is not responsive enough, and the switching is not smooth. Migrating itself also consumes resources. A VM can run with noisy neighbors, where co-locating is necessary for resource efficiency, but quota/throttling isn’t perfect. Cache can miss, while cold restart or traffic churn can cause cascade failures. Switching between cache hit/miss, or anything similar, is a behavior of Bi-modality, that is a fundamental cause of performance variances. DBs may have schema changes at background. Adaptive execution switches strategies, data structures, and indexes being used in middle according to traffic pattern, more efficient, but can create a non-smooth jump of performance. Networking can also have burps, congestion, and incast problems. Overall, achieving predictable performance is still one of the challenges in cloud storage.

    • Service-level Agreements (SLA) / Service-level Objectives (SLO). Cloud storage offer customers with SLA, a money insured guarantee about performance and availability/durability, while SLO gives more rigid measured numbers. Offering a predicable performance is even more important to customers than simply saying we are fast. What may also overweight fast is to offer a rich feature set, trustworthy customer service, helpful troubleshooting and visualization, and extreme data safety & security.

    • Graceful degradation. When overloaded, or some components are offline (e.g. Auth service), or new feature disabled / rolled back, the system should have a graceful path to degrade the serving level. What should be avoided are cascaded failures, retry storms, or missing operation knob for recovery. Typical techniques include throttling with circuit breaker, cancel requests that cannot meet future deadline, avoid amplifying retry at each level, etc.

    • Quota/throttling/admission control/deadline. These words have overlapped meanings. Customer accounts or allocated objects are provisioned with quota, and these quotas are further used for job scheduling. Throttling is the common need in cloud storage with multi-tenancy, that enforces resources used by quota, protects from system overloading, and avoid affecting latency by noisy neighbors. Soft quota are usually allowed to share between customer objects, or between different customers, to temporarily absorb bursts. Longterm or periodical traffic changes can be learned by Machine Learning to proactively scale up/out on-demand. Throttle can be dialed with incremental feedback control loop, and interference measured with runtime micro experiments (NyxCache paper).

    • Cold restart is a typical issue that if a cache node restarted, it cannot serve requests well until filled up again. This can easily introduce a churn during batch upgrade, overload the system, kill more nodes, and bring a cascaded failure. AWS Redshift introduced Warmpools to provision pre-warmed cache nodes.

  • Scalability. The fundamental way to concur a scale problem is to divide and conquer. With the fast growth of modern hardware, being efficient in Scale-up is also necessary, e.g. to work with manycore CPU with efficiency concurrency, to handle large memory with NUMA, to respond fast with RDMA networking, PMEM, NVM SSD. Scaleout is the classic cloud storage solution, with infinite scale (in theory), but every step in the distributed consistency and communication charges your COGS.

    • Partitioning & Replication. Partitioning scales out the performance for the overall data space, while Replication scales out the performance for a specific data unit. They can work at different and non-symmetric fine-grain levels. Caching can also be seen as a case of replication, which leverage more expensive hardware to increase performance within space/temporal locality.

    • Data tiering. Caching replicates data across faster storage hardware, while data tiering migrates data across them. Another fundamental way to increase performance is to run it on a better hardware. The recent years of growth in hardware industry, e.g. memory, networking, SSD, disk density, are even faster than software, so that buying new generation hardware is even a better choice than human optimizing the software for cost and time-to-market.

  • Resource efficiency. Commonly better performance requires programming efficient code. The techniques vary at different system layers e.g. CPU-cache, in-memory computing, networking, and at different storage media e.g. HDD, SSD, PMEM, DRAM. The next fundamental way to increase performance is to Do less things. A typical example is a system runs faster if turned off all logging, and a new system with fewer features usually runs faster. The next key part for resource efficiency is load balancing. It’s not too few resources, but problems at exchanging and fair assignment that cause starvation.

    • Load balancing. The first hop of load balancing is efficient job scheduling and placement on servers, that best coordinates with resource utilization, fairness, and co-locating jobs with SLA guarantee. Customer jobs run with close monitoring at growth, bursts, and hotspots that involve scaleout and partition split/merge. The cluster runs monitoring for over/under-utilized nodes that conducts migration time to time. Quota/throttling/admission control are the next part to protect SLA, ensure predictable performance, and as a trigger for migration. Node failure detection is an infrastructure ability needed in between, where gray failures can inject intermittent latency or report inconsistent healthy status, that need robust handling.

    • COGS. Overall, the cost of IOPS, storage space, and query TPS, should be measured and controlled to understand the end-to-end resource efficiency. It’s also the Project/Product Management that incorporated into decision making whether an investment worths its cost. The COGS is essentially sellable earning compared to overall spending at datacenter purchase/operation, telecomm renting, R&D, etc. Capacity Planning also takes part in COGS about what SKU and how many to purchase, usually in ahead of months to years.

    • Kernel bypassing. Intel DPDK went popular with RDMA that require faster CPU processing, where Linux Kernel networking stack is relatively slower so they get bypassed. RDMA can also be seen as a bypassing of server CPU. The approach then gets adopted at Intel SPDK that Kernel bypassing makes faster CPU processing for PMEM and NVM SSD. DPU further bypasses host CPU to take over common storage infrastructure. Ceph also built BlueStore which underlyingly implements customized BlueFS that bypassed many functionalities compared to the original Linux Filesystems. Kernel bypassing is another example of Do less things: shorten call path, less jump nodes, direct access, direct return.

  • Hardware acceleration/offloading. While CPU is general purpose, the same (or less) money spent on specific purpose chips can yield even higher computation throughput at a low energy consumption. Besides, CPU itself is becoming harder to catch up with rapidly growing processing speed required by modern IO devices like PMEM, RDMA networking, and Deep Learning / Machine Learning. Offloading is easier when computation is more standardized, e.g. network packet processing, compression/encryption; while disk IO is usually more complex and interleaved with variable data formats and exception handling.

    • ASIC based compression/encryption cards are common. AWS Nitro / Microsoft Catapult are successful business cases that ASIC/FPGA boost virtual cloud networking, as well as compression/encryption, etc.

    • SmartNIC builds virtualization, RDMA, processor offloading in NIC. CPU work can be offloaded to NIC level, with shorter roundtrip path. While Smart SSD (or Computational SSD drives) builds query processing at SSD level, bypassing PCIe for early filtering data.

    • GPU/TPU are leading Machine Learning acceleration, dedicated for FLOPS in thirsty Deep Learning training. IPU/DPU try to consolidate datacenter infrastructure into more COGS efficient chips. More advanced GPU interconnection like NVLink are being developed, composing an HPC cluster.

    • HPC is another area that high-end hardware, usually with customized accelerators, and manycore, are used for scientific processing. The accelerators usually then gain maturity and enter the market of commodity servers, like RDMA.

  • Debugging & Troubleshooting. Performance is not only a matter of now, but also a good velocity to improve it. Only when there are metrics, there are insights to make the improvement. Well-designed monitoring system involves realtime time-series metrics, logging with exchangeable standards, and a data warehouse for retention and complex queries. OpenTelemetry, which is similar to Google Dapper, is a typical microservice tracing framework that can be used to debug performance issues.

    • A typical analysis involves top down breakdown of component calling hierarchy (or queuing layout), and to narrow down which component injected latency. The culprit requests are then correlated with recent system changes, certain SKU tags, source units generated traffic patterns, etc. After going to the server level, the narrow-down further branches to disk IO, network IO, or to CPU/caching inefficiency. At each branching point, there should be supporting tools for investigation and visualization. In the end, the analysis should give estimated impact numbers that matches with the observation, to validate the hypothesis.

    • Thought experiment starts from a bottom up approach. Suppose latency was injected at a bottom component, by a certain type of requests, at a specific percentile level. Does the system have enough metrics and troubleshooting tools to find it out? And then from the top down again, what is the main contributor that affects latency? Performance troubleshooting shouldn’t be a hard problem. Instead, it should be a systematic approach that discovers what we can and where we miss metrics and tools, and then enhance the infrastructure step by step.

    • Line speed, gap analysis. Another approach to analyze performance is to first find out the raw hardware speed (line speed) of the underlying storage device or networking device, and then analyze what composes the gap from line speed to the actual performance of the storage system. This provides a systematic approach to dissect performance layer by layer, and guaranteed to reach its max given abundant dev resource invested. Anyway, optimization should start from the bottleneck, backed with metrics insight.

Performance section

Concurrency & parallelism

Exploiting concurrency & parallelism is the key technique to improve performance. This section covers those techniques. We mainly focus on optimizing a single node here with multi-core, while distributed scaleout systems are put to the later section. In general, parallel means happening at the same time (need hardware support), while concurrency means happening together but not necessarily at the same time (by interleaved scheduling).

The fundamental ability of parallelism comes from Hardware parallelism. E.g. CPU cache chip can be designed to lookup all cache lines at the same time, while software hashtable has to resort to various multi-threading techniques backed with CPU multi-core. The best performance comes from utilizing all parallel units, with minimal coordination/synchronization overheads.

  • Typical hardware parallelism to exploit are listed here. First, the most commonly CPU socket -> CPU cores -> CPU hyper-threading. Next, NUMA and DRAM banks. SSD built-in parallelism at Plane level (Chip -> Die -> Plane -> Block -> Page -> Cell). PMEM may have similar internal parallelism like SSD.

  • SIMD, vectorized execution are common DB techniques to exploit the data parallelism per instruction. Column scan is treated as operating (bit) vectors. Further, Code Generation and JIT are used to produce more CPU efficient execution plans. AWS Redshift further looks up recent compilation in external caches.

  • ASIC, FPGA, TPU, GPU Specialized hardware can further boost parallelism and efficiency for the target workload. For FPGA, how large the chip area is, how many computation units can be programmed, and thus how many can work in parallel. More chips can be interconnected with high bandwidth link (e.g. NVLink) to compose an HPC cluster.

Load balancing is critical to achieve max efficiency among multiple hardware units being utilized in parallel. This is just like a scaleout distributed system.

  • Tasks cutting into smaller units are easier to balance, just like tossing smaller balls to bins. This explains why storage engines can benefit from Multi-staged Pipeline. It resembles to smaller partition size in a distributed system. Besides cutting tasks, Pipelining overlaps task execution to improve utilization of the underlying resources. Prefetching and Speculative execution further overlap future with now.

  • Work stealing is another common technique. Idle thread seize jobs from busy threads. The cost of scheduling tasks are automatically amortized to more idle threads. It resembles to job migration in distributed systems.

Reduce communication is the most important topic. Locking and synchronization are the top areas in concurrency & parallelism. They are used to coordination communication. But the best system is designed to rather NOT requiring communication, thus the most simplified. This same applies both in a distributed scaleout system, or a single node scale-up system with manycore.

  • Symmetric parallel copies (同态多副本). The data and tasks are sharded into multiple copies. Each copy processes in exactly the same way. Copies don’t need any interaction. E.g. processing requests from different customers in a Cloud storage systems. E.g. Ceph OSD that each thread exclusively owns a disk. E.g. A networking switch that one core schedules tasks to all other cores doing plain packet processing.

  • Communication density. Locking/latching, reading another thread’s thread local, and accessing shared memory/cache address are all communication. Plot the communication connections by each CPU core. How frequent are such communication done? What’s the connection fan-out? What’s the webbing density? A good algorithm should reduce all three.

    • Lock/latch free algorithms usually have high communication cost in manycore condition, as pointed out by HyPer Scalable MVCC GC paper. The communication point is usually a CAS operation which underlyingly locks CPU cache line. All cores race on the lock, creating an N-to-N communication map, which is frequently triggered, high fan-out, and thus high webbing density.

    • Flat Combining, and also the Thread Local technique used in the above HyPer paper. Each thread only works in its thread local, and a leader thread consults all other threads to do coordination work. This reduces communication to a fan-out of 1-N, and thus reduces webbing density.

    • Epoch based reclamation further reduces the communication frequency. Only when epoch passed and each thread local is away from the shared resource, the leader thread will do the coordinated resource cleanup. Similar idea applies for techniques like Sloppy Counters, delayed batched async updates, etc.

  • Reduce competing resources. Avoid racing on resources when it’s not necessary. A typical example is False Sharing that CPU cores race on a cache line, whose dependency is not required by the App, but introduced by compiler packing memory objects.

    • Partitioning and reducing lock granularity. A typical technique is to partition the hashtable, and each lock only owns a shard. This partitions the communication web to reduce connection density. Also, typical programming courses teach about fine-grain locking. This reduces the duration of communication connection, similar with reducing frequency, and may also reduce the connection fan-out.

    • B+-tree lock coupling steps lock through tree parent/child nodes with limited lock span, like a crab. Compared to locking the whole sub-tree, it also reduces the lock scope, thus reduced the competing resource. It’s another example of fine-grained locks. Acquiring lock in the same order is related, that by pre-building coordination with fixed rules, deadlock can be avoided.

    • Copy-on-write, immutable data objects, shadow paging, and delta updates are related techniques. Instead of working on the original data, updates work on a copy, or only write deltas. In this way, the updaters avoid racing on the original data. Besides, Immutability can greatly simplify system design, but yet poses pressure on later GC.

    • Concurrency by scheduling. The example is NetApp Waffinity. Accesses to disjoint files and address partitions can be safely parallelized. Instead of programming low level locks, NetApp uses a top level scheduler to ensure racing accesses won’t be scheduled.

Here also to mention Engineering aspects of concurrency & parallelism. I categorized coroutine in this part.

  • Coroutine, thread, and process. In theory, they should be able to achieve the same level of performance or parallelism, except coroutine allows bypassing the Kernel, and threads are more lightweighted to share resource/memory than processes. However, a nice programming API does matter, that by which coroutine quickly gains adoption with async/await. Threading are left to give developers root control on concurrency & parallelism, where the thread execution pool can be carefully designed; while processes are better at resource/fault isolation.

  • Sync & Async. In theory, they should be able to achieve the same level of performance or parallelism. But Async programming is easier to overlap CPU time with IO time to improve efficiency (e.g. epoll), and more easily cut long function into smaller tasks to benefit load balancing. Fundamentally, Async can be implemented by either busy polling (Sync) or periodically checks. Busy polling can sometime be more efficient when working with high performance IO devices like NVM and RDMA.

  • Lock and preempting. A simple lock let first comer win and blocks later comers. But it can be implemented differently to either let first/later comer win, either blocking/non-blocking, and with OCC retry. Such techniques can be used to optimize DB transactions, especially those mixed short live OLTP transactions with long running OLAP transactions.

  • Testing the correctness of a complex concurrency program is not easy and important for Cloud storage. C# Coyote searches through the large execution ordering space to find potential bugs. FoundationDB also equips with Deterministic Simulation Testing built by Flow. Besides, TLA+ is used to model the state machine aside to verify liveness and invariants.

CPU-cache and in-memory

Performance optimization can be broken into several aspects

  • CPU, cache, and memory. They usually overlap with Scale-up topics and optimizing a single node, i.e. how to efficiently utilize them after stacked more CPU cores and large memory. We’ll cover below.

  • IO and networking. They usually overlap with Scaleout topics, where a distributed system interconnects many nodes. Also, disk IO and networking traditionally are slower than the CPU, cache, and memory plane. We’ll cover more in the Networking section.

Per optimizing CPU, cache, and memory plane, there are a few aspects below

  • Concurrency & parallelism, as we covered in the previous section.

  • Memory wall. Today CPU is much faster than DRAM (1ns vs 100ns, see Interactive latency), which relies on Cache as the middle bridge. Efficiency can be measured by IPC (instruction per cycle), Memory stall, and Cache miss counters. A good algorithm needs to 1) exploit locality for caching 2) pipelining with cache prefetching 3) avoid racing on the same cache line 4) avoid extra memory writes while keeping most operations in CPU register and cache.

  • Branch mis-predict is costly for CPU speculative execution. An efficient data processing program should avoid too many if-branches which are not deterministic. Such principles become yet more important for GPU, who has minimal control units and most chip area is dedicated for synced data operations.

  • Do less things will always make the program faster. DB query Code Generation and JIT can be seen as an example, where highly customized code is compiled for each specific query SQL to improve CPU efficiency. Though the code is either unfriendly for human programmer, or there are too many combinations for handcraft.

Scaleout storage

In this section we focus on optimizing performance at the distributed scaleout storage plane. The previous section already covered most topics, such as Load balancing, Tail latency, Pipelining, etc. More former sections discussed about Partitioning, Caching, Indexes, Sequentialize IOs, etc. The majority of performance improvement comes from scaleout itself, and carefully optimizing single node performance. We add bullet(s) not covered by the above

  • Compression is a seemingly separated topic but can significantly improve performance because fewer data are transferred across IO devices. We have talked much about it in previous sections.


Networking is generally orthogonal to storage design. It’s more attached to datacenter construction and hardware equipment. It affects storage system in a few aspects:

  • Performance plane. Networking affects the latency between node communication. The network links, considering oversubscription and path routing, affect max throughput of data transfer. The datacenter design affects the total capacity a storage cluster can grow to now and future. The stableness of network affects how much churn the application needs to handle transient node failures and messaging losses.

  • Cost of ownership. The power consumption of networking devices costs money per month, and doubles by cooling (Power component comparison). Resource utilization affects how efficient the money was spent, considering traffic engineering and CPU/network bottleneck.

  • Upgrade management. Networking speed is quickly growing. Old/new switches with different bandwidth need to co-work efficiently. Upgrade needs to carry out without interrupting live services, and with minimal traffic degradation. Construction of a new datacenter can take years.

Compared to the other parts of a storage system, networking has several key differences:

  • Non-persistent. The whole storage stack is built around durability, but networking doesn’t need to worry about that. The stateless property relieves upgrading and design complexity, and allows shifting more focus to standardized interchange protocols and high performance switching.

  • HA architecture. Today’s datacenter networking is usually based on Clos network architecture. TORs, leafs and spine switches are connected in full-mesh, which naturally support HA when one device goes down. Switches (usually merged router functionality) dynamically update routes with protocols like OSPF. Flows choose paths in an HA manner, with multipathing protocols like ECMP, WCMP.

  • Standardized logic. Comparing to disk IO which involves custom data formats and exception handling, network functions are more standardized and stateless. Networking in fact converges to cross-vendor protocols and published specifications. Functionalities are frequently offloaded to lower level components, e.g. SmartNIC, FPGA/ASIC, RDMA.

  • Quickly growing speed. Networking speed is quickly growing these years, from 10Gbps, 40Gbps, to 100Gbps, even 200Gbps. Though NIC works well with small packets, CPU core can hardly catch up. Given 100Gbps and 1KB packet size, a core has only 80ns to process each packet.

Networking section

Networking architecture

The fundamental level is networking architecture. It defines how datacenter networking infrastructure is constructed, which constraints the baseline performance and scalability.

  • Clos network is the commonly used networking architecture. Compared to its ancestor, a huge “single switch bar”, Clos is built by connecting small switches. The advantages are allowing expansion by adding individual switches, and tolerating individual switches offline. It consists of multiple tiers, e.g. T0 (TOR), T1 (Leaf layer), T2 (Spine layer). Each tier is a group of switches. Neighbor tiers are bipartite connected in full-mesh. Realworld deployments can have customization:

    • Oversubscription. To save cost, higher tiers may have sparser links and lower aggregated bandwidth than lower tiers. It assumes locality that denser consumption is constraint in lower tiers. Similar patterns can be seen in databases to pushdown query filtering to Smart SSD, MapReduce to co-locate worker job to the data node, and ECWide to reduce T1 traffic by leveraging intra rack repair.

    • Sub domains. In Google Jupiter Rising, instead of full-mesh connecting T1 to T2, T1 is cut into disjoint domains called “Aggregation block”. Separating domains reduces link density. “Aggregation block” is also the controller domain used for failure isolation in Google Orion SDN. Dragonfly topology shows the similar idea called “Subnetwork” / “Group”.

    • Internal of an Aggregation block is “two layers of switches bipartite connected”. It aggregates traffic before routing through T2. It acts as a mini Clos network, or a Virtual switch with high radix (many ports). The similar idea also shows in Dragonfly+ topology.

    • Sidelinks. In standard Clos, a switch cannot directly connect to another switch in the same tier. It must go through a high tier. Instead, Google B4 After (WAN networking) introduces Sidelink that adds connection within the same tier (within same datacenter). It exploits the asymmetry that Sidelink is cheaper than links that are across tiers (cross datacenter WAN). If every possible Sidelink is added, the Clos network degenerates into a symmetric full-mesh connection (like the group-to-group connection in Dragonfly+).

    • Optical Circuit Switch (OCS). Google Jupiter Evolving replaced the spine layer with OCS. Besides lower latency, OCS is data rate agnostic. It simply reflects colored lights with tiny motored mirrors. When old electrical switches are upgraded with faster ones, OCS needs NO change to support higher bandwidth. Internal of OCS needs to reprogram virtual optical circuits, which is done by Google Orion SDN. OCS takes bidirectional optical links, which is converted from electrical cables, via optical circulator & WDM transceiver.

  • Routing. Today datacenter switches usually merge with router functions. Standard routing protocols update dynamically based on neighbor advertisements, e.g. OSPF (Link state based), RIP (Distance vector based). However, datacenter networking architecture is mostly static. Google Jupiter networking instead relies on Orion SDN, a centralized control plane, to periodically refresh the routing and push down to switches via OpenFlow protocol.

    • SDN enables Traffic Engineering, which realtime collects metrics, splits flows across heterogeneous switches with WCMP multipathing, and reacts to temporal traffic churns and device failures.
  • Control plane. Traditionally, network admin configures each switch with individual terminals. Nowadays the trend shifts to a centralized control plane, which is usually combined with SDN. The centralized controller collects and aggregates metrics to show in one place for network admin. Configurations and routing are determined with the global view, and push down to end switches (data plane). OpenFlow is the de-facto protocol to communicate both control plane and data plane.

    • Besides Google Orion, there are other SDN controllers like OpenDaylight, Openstack Neutron. SDN can do interesting things like virtual distributed router, VM private networks.
  • Data plane. Besides commercial switches, Open vSwitch (OVS) is one of the most known SDN data plane software that can be installed on commodity switches (e.g. Linux). They follow OpenFlow protocol to co-work with SDN controller.

    • Though typical datacenter networking data plane follows Clos, CDN can use P2P architecture. E.g. In Facebook Owl, neighbor nodes exchange data copies, while they still maintain a centralized control plane.

    • Control plane network is named in Google Orion SDN but generally applicable. For reliability, control plane network and data plane network are usually separated, e.g. management port and normal ports on the switch. When a misconfig interrupted date plane network, we can still use control plane network for debugging, fixing, and send remote commands.

  • Cross datacenters. The above mainly focuses on networking within a datacenter. Global datacenters can be interconnected with tunnel protocols (e.g. VPN) running on WAN. WAN is composed of Autonomous Systems (e.g. Internet ISP, company networks), where BGP becomes the typical routing protocol (stable, limit update frequency, rich pathing polices).

    • Google B4 Experience follows a consistent approach with Jupiter. It uses Google Orion as the SDN controller and applies traffic engineering. It adds one “Aggregation Block” in Jupiter network, but for external facing traffic, called “FBR supernode”. Google datacenter network says internal facing traffic (large DB queries & responses) is much heavier than external facing traffic (customer web requests & responses).

Load balancers

The next level of networking is load balancer. It’s the gate for requests to enter datacenter. It dispatches traffic to proper nodes. It also merges with handful of utility functions. It’s not a necessary part of distribute storage system, but usually serves as the frontend.

  • Global load balancing is the first stop. A customer request should go to the nearby datacenter within the same geo location. The load balancing is usually done by DNS resolving. The same website domain name is translated into different IP addresses. Each IP address maps to the nearby datacenter to route to.

  • Datacenter load balancer. The load balancer at the gate of datacenter. There are many types categorized by hardware or software, single node or distributed.

    • A commercial load balancer, e.g. F5 BIG-IP, is typically a hardware box (with a backup node) deployed at the gate of datacenter. Load balancer exposes a VIP (virtual IP) to external customers and hides the group of internal servers running on physical IPs.

    • Alternatively, distributed load balancer can be built with software, e.g. Google Maglev, UCloud Vortex. It’s a group of servers that horizontally scale out. They can maintain shared states via Paxos (or simply in a database). They receive packets from the external router via ECMP. They distribute requests to internal servers via consistent hashing.

    • There is another set of load balancers coming from web application area. Typical examples are NGINX, HAProxy, LVS (Linux Virtual Server). They achieve HA with a mutually watching backup node and Keepalived.

  • Load balancer working layers. Load balancers can be categorized by which layers of info in OSI model are leveraged to dispatch requests.

    • Layer 2 (by MAC address) load balancers are hard to see.

    • Layer 3 (by IP address) load balancers routes by IP addresses. The typical example is routing requests targeting VIP to a group of physical IPs. Routers can also be seen as load balancing at layer 3, e.g. split traffic with multipathing, ECMP, and BGP to select Autonomous Systems with lower toll.

    • Layer 4 (by TCP/UDP) load balancers take ports into consideration, which allows e.g. mapping port 80/443 to http/https pools, and NATs.

    • Layer 7 (by application content) load balancers leverage application level message content to dispatch requests, e.g. URL cookies to implement sticky session. Leveraging layer 7 is more complex, slower, but can be powerful, e.g. integrate firewall into load balancer (more examples below). Mostly, every load balancing is moving to cover all layers upto 7.

  • API gateway is a word brought up by microservice. Load balancer is usually at the gate of datacenter or service cluster. Many more features can be integrated to it. Like firewall, advanced features require load balancer to cover Layer 7.

    • Router and load balancer can merge into one (hardware) box. Just like switch and router functions can be merged. Load balancer can also run BGP to route large responses through proper Autonomous Systems.

    • Heartbeat health check can be done by load balancer. It needs to tell which internal server is bad and avoids sending more packets to it.

    • HTTPS. Load balancer can serve as the boundary to handle encryption/decryption. External clients connect to load balancer via HTTPS, while cluster internal works on HTTP (trusted environment). VPN can be served in a similar way. Even user authentication can also be integrated into load balancer.

    • Load balancer can merge with Firewall. Being the critical path of all traffic flows, it unpackets application messages to filter malicious contents. Similarly, load balancer can be extended to protect from DDOS attacks.

    • URL dispatching. Load balancer can merge with API gateway. It understands web application URLs and dispatch them to the desired pools of servers. Each URL pattern can map to a different microservice, where load balancer works as the service router.

    • Load balancer can merge with Circuit breaker. It tracks realtime API or user traffic usage, performs throttling, and degrades the service if it overloads the cluster, to prevent cascaded failures.

  • Direct server return (DSR) is a technique typically used with load balancer. Internal servers send response packets directly to external clients, bypassing the load balancer. It saves load balancer bandwidth, especially when responses are much larger than customer requests, e.g. video streaming.

Congestion control

Coming to the transport layer, a big topic is to how maintain max transfer throughput while avoiding congestion. There are a few key factors that affect datacenter networking performance:

  • Switch buffer buildup. More messages queued in switch buffer, the higher the latency. When buffer overflows, switch drops packets and marks congestion (ECN). This is when congestion happens. The dropped packets experience yet another round of latency due to resend. Resend yet overloads the network more.

    • The next level of problem is, TCP is guessing the switch buffer usage. TCP increases/shrinks sending rate according to the congestion signals saw locally. However, the steps may either be too aggressive or too slow compared to the optimal. What TCP sees have delay to the real switch buffer usage. The switch buffer is also shared by many other servers. As a result, TCP’s guesses can be inaccurate and cause periodical congestion churns and under utilization.
  • Incast is the many-to-one traffic pattern. It’s common in datacenter communication when aggregating query results, MapReduce, or erasure coding reconstruct reads. Besides overloading the destination switch, it quickly overflows the switch buffer to cause congestion. Switch starts to drop packets, which yet causes source resends that cascadingly increases the load.

  • Flow interference. Switch buffer is shared between ports. One flow caused congestion can impact other flows. A switch can run flow control protocols to mitigate congestion (e.g. PFC), however it may impact non-related flows only because they share the same switch. A flow with small messages can also be impacted by another flow with large messages, i.e. head-of-line blocking.

There are a few known TCP congestion control protocols. They customized the default TCP stack to improve traffic performance, fairness, network utilization, and tolerate bursts.

  • DCQCN targets congestion control for RDMA deployed on RoCEv2. It’s based on per flow congestion control (QCN) instead of PFC (per port based). It divides congestion control into CP Algorithm (switch side, congestion point), RP Algorithm (sender side, reaction point), and NP Algorithm (receiver side , notification point).

    • The switch (CP) marks ECN when buffer queue exceeds limit (i.e. congestion triggered). Receiver side (NP) batches ECN then sends CNP (RoCEv2 defined congestion notification) back to sender side (RP). Sender maintains an estimation of the portion of ECN marked packets, called “α” in the paper. Upon congestion (i.e. sender sees CNP), sender decreases sending rate by α (nearly exponentially) each round.

    • Sender recovery is done by FastRecovery, AdditiveIncrease, and then HyperIncrease. FastRecovery executes a fix number of rounds. In each round, current rate is set to (target rate + current rate) / 2 (exponentially shortening the gap). AdditiveIncrease sets current rate with the same formula, but additionally increase target rate by e.g. 5Mbps each round. HyperIncrease uses the same formulas with AdditiveIncrease, but changing the 5Mbps to e.g. 50Mbps.

  • DCTCP. DCQCN is based on DCTCP and QCN. DCTCP targets normal network rather than RDMA. As mentioned in the DCQCN bullet, DCTCP introduced 1) let switch mark ECN and receiver echos it back to sender, 2) estimate the “α” to decrease sending rate.

    • Unlike DCQCN, DCTCP doesn’t change the slow start behavior in TCP default congestion control. Slow restart doubles in-flight packets (congestion window) each round (much slower recovery than DCQCN), until a packet loss is detected (bad, congestion already happened, switch buffer overflows).
  • BBR. Unlike DCQCN/DCTCP which target datacenter network, BBR targets WAN. The default TCP treats packet loss as congestion. The assumption is OK in datacenter network, but NOT in WAN where packet loss is common. Besides, BBR tries to reduce switch buffer usage, where high usage (i.e. high queue length) increases latency. As the solution, BBR ignores packet loss. It gradually increases in-flight packets to probe the optimal bandwidth and latency. Congestion is avoided by approaching sending rate to where bandwidth is max and switch buffer usage is zero. BBR has been successfully deployed in Google B4 and YouTube.

Networking stack

The next level is networking stack, i.e. the software to run networking that can be optimized to achieve better performance. Typical techniques are:

  • TCP vs UDP. TCP is a connection based protocol. Maintaining connection costs host memory. TCP handles packet resend to ensure delivery. TCP implements congestion control to set proper speed with switches, and sliding window to set proper speed with the receiver. UDP has none of them. As a result, UDP is fast, lightweight, and suitable for usecases that tolerate packet losses, e.g. video streaming. UDP is also a basis to build customized protocols.

  • Kernel bypassing. Networking is fast, however the Kernel, system calls, and context switches are dragging it down. The typical technology is DPDK which processes networking in userspace. The client can choose to poll rather than callback because notification interval is too short, like a spinlock vs a blocking lock that yields CPU. Another example is RDMA, where DMA bypasses CPU to operate host DRAM.

  • Offloading network processing to a lower level, especially to NIC acceleration hardware. It will be covered later. Today CPU is much slower than high speed networking. It needs extra chips to help.

Networking stack also extends to customized hardware and acceleration chips. Due to CPU is slow compared to networking, and most network functions are standardized, they are suitable to be offloaded.

  • RDMA relies on specialized NIC rather than CPU to access host DRAM. It typically runs on plain Ethernet with RoCEv2 (need RDMA NIC), or run on InfiniBand with completely new hardware stack. There are more RDMA design guidelines.

    • Further, a host can use RDMA to directly access the PMEM in another host. This enables building even faster storage systems, e.g. Orion/Octopus.
  • FPGA can be deployed near NIC to accelerate common network functions like tunneling (required by VM virtualization, e.g. VxLAN, GRE), encryption, compression, QoS, ACLs (access control list). When the code becomes stable, they can be burnt into ASIC chips that are more performant, power efficient, but hard to change.

    • AWS Nitro is a success story that uses ASIC card to offload cloud networking (Nitro card). Microsoft Catapult is another example.
  • SmartNIC embeds computation ability into NIC. It can integrate functionalities (and chips) mentioned in the above FPGA bullet. It can support SR-IOV to virtualize NICs for VM. SmartNIC can even integrate vSwitch, which offloads CPU from building VM private networks. There is a more detailed Azure SmartNIC paper.

    • SmartNIC is closely related to SDN. Commodity switches typically run Linux and Open vSwitch, which are programmable by SDN controller. These switches need acceleration hardware to compete with customized chips in commercial switches (less programmable). SmartNIC comes to fill the gap.

Application layer

The last level of networking comes to the application layer. It involves how application can use the network stack efficiently and reliably.

  • Messaging style. Servers can communicate by sending messages directly, or via PRC call. The requests can either by sync or async with callback. Servers can also exchange messages via a message queue, e.g. RabbitMQ, via topics and subscriptions. Message queue can run with exactly-once semantics, e.g. Kafka Transactional. Servers can also share information with Gossip in a P2P style, with bounded converge time. It’s usually used in metadata propagation, e.g. Ceph, that piggybacks updates and health checks.

    • Connection management. A practical need is to reduce the TCP connection count. Suppose N servers, it’s expensive to manage all N*N connections. Besides pooling and keep-alive, a solution is to introduce a mediator, e.g. mcrouter, to reduce connection count to 2*N.
  • Serialization. Application objects living in memory need to be serialized before messaging, and then unserialized at receiver side. Serialization is CPU intensive. Though compression saves transfer bandwidth, it costs more CPU, which is yet becoming slower compared to today’s network. In general, serialization protocols need to make compact bit representation (e.g. varint encoding), fast encoding, and allow schema change with backward compatibility.

    • Typical serialization protocols are protobuf, bond, Thrift, FlatBuffers. Parquet, Apache ORC are columnar formats used for on-disk storing. Apache Arrow is a columnar format used for in-memory processing. Compared to Parquet, Apache Avro is the row-based format. In the end, JSON is slower, bigger, but welcomed by human readability.

    • Client Protocol Redesign. Customized client protocol can be efficient when returning large responses, e.g. adaptively choose between row format vs columnar layout, enable compression when overweights CPU cost, truncate unnecessary data and padding, parallelize iterators with prefetching. Custom serialization protocol can leverage application-aware knowledge. Large strings can be specially handled with e.g. custom compression, dictionary, prefix dedup.

  • Unstable network. Seen by the application, network is unstable. A typical issue is membership detection, where a node transiently goes up and down. Too aggressively marking it dead causes unnecessary churn (e.g. data repair). Too slow to mark it dead impacts service reliability. More, brain split, Observational Difference, or grey failures can happen, where different groups of nodes cannot agree on what they see. The typical solution is decision making through a Consistent Core, e.g. Service Fabric, Google Orion SDN (“Fail Static”).

    • Messaging integrity is another layer of protection due to unstable network. In fact, a distributed storage can never assume network is reliable. Packet losses, message reorder, and replay can happen unpredictably. Application layer usually implements its own CRC, idempotent operations, and epoch invalidation.

More topics

Compared to Storage components breakdown section, there are a few topics I didn’t cover.

  • Allocator. It refers the to disk space allocator by a single node filesystem. There are mature and off-the-shelf solutions in production filesystems. A distributed storage usually directly leverage them by building atop the local node filesystems. On the other hand, “Allocator” in a multi-node case is the Data placement we covered before.

    • Allocator can go complex with filesystem compression, e.g. Btrfs compression COW, Ceph BlueStore compression. Firstly, space allocation unit (i.e. extent, a large sequential chunk) and update unit (blocks) are unaligned. Secondly, the range being updated and compression boundary can be unaligned (partial write problem). Thirdly, the block size after compression are unaligned (extra indexing), and an overwritten block may not fit in its original physical slot (append or off-place write). Filesystem Allocator and Index need to work together for these problems, with extent level GC/compaction, to reduce internal/external fragmentation, and to reduce read/write amplification. In general, append-only filesystem makes compression easier to implement.
  • Upgrade/deployment. Safe and incremental upgrade on a large scale distributed storage system with atomic rollback can be complex and with many engineering practices. Microsoft SDP is an example. But they are off the topic so I didn’t cover them in this article.

  • Configuration management. CMDB is an interesting topic. E.g. you need a database to manage the many baremetal nodes in a large scale cloud. However they are off the topic so I didn’t cover them in this article.

  • Operational ease. It’s an interesting topic to design a system that makes daily operation smooth, safe, and to avoid human errors. It involves monitoring, safe config/deployment procedure, throttling & degradation, and Interoperability with Devops systems. However they are off the topic so I didn’t cover them in this article.

  • Machine learning in storage. I covered these topics but didn’t go into depth. In general, it can be categorized into

    • Classification problems. Given a data unit, how to predict its future traffic, lifecycle, hot/coldness? It allows best technology to be applied for each fine-grain scenario. A similar type of data can be grouped together to enjoy compression or tiering. Examples are G-SWAP, Cache replacement, Siberia.

    • Optimization problems. Given a few data layout strategies, how to best align them with the predicted traffic pattern? Given requests and jobs, how to best fit them to a pool of resources (i.e. scheduling)? Examples are Optimal Column Layout, DB optimizer, Cloud Resource Scheduling.

    • Auto tuning. Give a large config parameter space and the specific user scenario, how to search for the best one? It can be grid search with either model analysis or simulation run. The Examples are Dostoevsky, G-SWAP, AWS Redshift ATO, Azure SQL Database auto tuning, OtterTune.

    • Calibration Run micro experiments or controlled simulation for a target workload to find out its characteristics and interference with others. The findings are used for better QoS control and scheduling. Examples are NyxCache, Quasar.

    • Feedback control loop. This is the classic way to find the best operation point. It’s simple. Keep increasing the load, until system monitors report warning. Examples are TMO, Heracles, request throttling, TCP probing the best sending rate.


This article (almost a book now) is composed of two parts: software architecture methodologies and storage technology design spaces. In the first part, we went through the purposes of software architecture, how to view it from the organization, the process to carry it out, and key methodologies and principles. Software architecture bridges user scenarios to a detailed working software. It handles the complexity of user facing functions and hidden system properties. It navigates through the best path in large technology design space. It drives collaboration between BUs and ensures the deliverables with quality. Software architecture is a fight with complexity. It constructs the matching model with human mind to reach simplicity, which naturally converges to human language, the battle-tested modeling of the reality. It becomes an art of structuring, to sense the influences between organization chains, the momentum from customer markets, the tension between system properties and technologies, that weaves transforming information flows into flying wheels of software construction.

In the second part, we went through technology design spaces for the distributed storage system. We first listed Reference architectures in different storage areas, and then breakdown each storage component’s system properties and design spaces. Storage components breakdown section lists the storage areas, components, and system properties to consider in software architecture. Popular techniques burn into language and becomes a design pattern. An architecture design pattern frequently interleaves multiple components and trade-off between system properties. Discrete techniques join into a continuous space of design, where the shape of landscape emerges. We breakdown, navigate, and re-combine whatever we need to reach the optimal point of problem solution. Storage industry is quickly changing, with more powerful hardware, growing scale, new business scenarios, and a constant focus on reliability and COGS. They push technology design space to continuously evolve. New opportunities emerge.

Distributed storage landscape overview

Create an Issue or comment below