This chapter was focused on comparing some common data models with each other. Common data models include (not limited to): Relational, document, graph-based.

Evolution of data models

Hierarchical and network models

In the early years, network and hierarchical models were the main approaches for data storage and querying.

Hierarchical model
In 1970s, the most popular database was IBM’s Information Management System (IMD). It used a simple hierarchical modelβ€”very similar to JSON model (later used by document databases).

  • It worked really well for one-to-many relationships; but many-to-many relationships difficult and didn’t support joins
  • Problems of ’60s and ’70s were very similar to problems devs faced with using document databases

Deep-dive into hierarchical model

Store
└── Electronics
   β”œβ”€β”€ Smartphone
   β”‚   β”œβ”€β”€ Brand: "Apple"
   β”‚   └── Price: 999
   └── Laptop
       β”œβ”€β”€ Brand: "Dell"
       └── Price: 1200
└── Clothing
   β”œβ”€β”€ Jacket
   β”‚   β”œβ”€β”€ Brand: "North Face"
   β”‚   └── Color: "Red"
   └── T-shirt
       β”œβ”€β”€ Brand: "Nike"
       └── Size: "Large"

It is organized in a tree structure with parent-child relationships.

  • Structure of each item is rigid
  • Querying for particular attribute requires traversing through the tree
    • To access price β€” Store β†’ Electronics β†’ Smartphone β†’ Price
  • Very node must have exactly one parent (except the root node)
  • Records are linked with help of pointers

There various solutions proposed to solve the limitations of the hierarchical model. The two prominent ones were (1) relational model (2) network model

Network model
This is a generalization of the hierarchical model.

Deep-dive into network model

Categories:
β”œβ”€β”€ Electronics
β”œβ”€β”€ Clothing
β”œβ”€β”€ Special Offers

Products:
β”œβ”€β”€ Smartphone
β”œβ”€β”€ Jacket
β”œβ”€β”€ T-shirt
β”œβ”€β”€ Laptop

Attributes:
β”œβ”€β”€ Brand: "Apple"
β”œβ”€β”€ Brand: "Nike"
β”œβ”€β”€ Color: "Red"
β”œβ”€β”€ Size: "Large"
β”œβ”€β”€ Price: 999
β”œβ”€β”€ Price: 1200

Relationships (Network):
β”œβ”€β”€ Electronics ↔ Smartphone ↔ Brand: "Apple" ↔ Price: 999
β”œβ”€β”€ Clothing ↔ Jacket ↔ Brand: "North Face" ↔ Color: "Red"
β”œβ”€β”€ Clothing ↔ T-shirt ↔ Brand: "Nike" ↔ Size: "Large"
β”œβ”€β”€ Electronics ↔ Laptop ↔ Brand: "Dell" ↔ Price: 1200
β”œβ”€β”€ Special Offers ↔ Smartphone
└── Special Offers ↔ Jacket

The links between records are pointers. Similar to hierarchical model, accessing a record requires traversing from root to the target record. This is called an access pattern.

For both hierarchical and network models, any change in the data model (such as adding a new access path) required updates to the access path, which meant modifying the application code.

SQL and NoSQL models

Relational model β€” data is organized into relations (tables), where each relation is an unordered collection of tuples (rows).

It included a new way of querying data: SQL is a declarative query language, whereas IMS queried using imperative code.

When SQL was defined, it was very similar to relational algebra:

select * from animals where family = 'Sharks';
Declarative queryImperative query
Specify the ops to perform and its orderSpecify the pattern, conditions and transformation; but not how to achieve the goal.

DB system’s query optimizer decides which indexes and join methods to use, and in which order to execute various parts of the query.

Driving forces for NoSQL Databases

  • A need for greater scalability than relational databases can easily achieve, including very large datasets or very high write throughput
  • A widespread preference for free and open source software over commercial database products
  • Specialized query operations that are not well supported by the relational model
  • Frustration with the restrictiveness of relational schemas, and a desire for a more dynamic and expressive data model
  • Not Only SQL

For a data structure that is self-contained, a JSON representation can be quite appropriate. Doc-oriented databases like MongoDB, CouchDB support this model.

The JSON representation has better locality than the multi-table schema.

Several differences when comparing relational databases to document databases - fault-tolerance properties and handling of concurrency.

  • Doc model - schema flexibility, better performance due to locality
  • Relational model - support for joins, many-to-one and many-to-many relationships

Schema flexibility in the document model

Document databases are called schemaless, which is misleading. More accurate term is schema-on-read.

Schema-on-write - traditional approach of relational databases.

Schema-on-read is similar to dynamic type checking in programming languages (vs) Schema-on-write is similar to static type checking.

Data locality for queries β€” Storage locality

  • storing the entire model in a single document is useful if the entire model is required (for example, to render a page).
    • But reading small portion of it, can waste memory as the entire document is read into memory
  • The idea of grouping related data together for locality is not limited to the document model
    • Column-family concept in the Bigtable data model; this is used in Cassandra and HBase, has a similar purpose of managing locality.

Quick summary

  1. Data was represented as hierarchical model which wasn’t good for representing many-to-many relationships.
  2. Relational model was invented to solve that problem. However, it doesn’t fit well in all the model structures.
  3. NoSQL data stores were introduced; diverged into two directions:
  • Document databases β€” self-contained documents and relationships are rare
  • Graph databases β€” opposite of document; targeting use cases where everything is related to everything

Similarity b/w graph & document β€” typically don’t enforce a schema; question of whether the schema is explicit (enforced on write) or implicit (handled on read)

Different models has its own query language β€” SQL, MapReduce, MongoDB, Cypher, SPARQL, and Datalog.

Appendix