This post will present Cassandra data modeling practices in dimensional modeling terms. This post is targeted at those individuals who come from a Data Warehousing background and find themselves being asked to generate data models in a Cassandra environment. The goal of this post is to help those with Dimensional Modeling backgrounds wrap their brain around modeling in Cassandra.
Please keep in mind that we are deliberately trying to find similarities between modeling in Cassandra and modeling for Data Warehouses/Marts. Also, keep in mind that we were recently doing a lot of Data Warehousing/BI/batch integration work prior to jumping head first into Cassandra-land.
Also, please keep in mind that this post is not meant to be an exhaustive review of Cassandra data modeling techniques and best practices. Please refer to Patrick McFadin's data modeling presentations for more in depth Cassandra modeling guidelines. Patrick's presentations are fantastic!
A bit of backgroundCassandra and Data Warehousing use cases differ greatly, but the data modeling guidelines used in both technology genres are very similar, or can be if you squint your eyes a bit.
Use Cases:Cassandra - typically referred to as online or transactional (here's a link to the target use cases for Cassandra, just click on Use Cases), think lots and lots of small amounts of data flowing into a system very quickly (milliseconds or less).
Data Warehousing (used loosely) - typically referred to as analytical or batch, think lots of large amounts of data flowing into a system at periodic times throughout the day using a data processing procedure that lasts minutes to hours.
Data Modeling Guidelines:
- Disk space is cheap, writes are blazing fast (thanks log structured storage and sequential writes)
- Align table layout to satisfy read patterns
- Duplicate data to satisfy different versions of read operations
- Eliminate joins if possible (materialized views in Cassandra-speak)
- Leverage unique keys for data distributions (partition keys)
- Leverage cascading secondary keys for column clustering and dynamic schemas (cluster keys)
- Access data through cascading key structure (a.k.a. hierarchical data access)
- No aggregations dues to distributed architecture (store pre-aggregated data)
Data Warehousing (more specifics can be found here ...warning possible TL;DR..just like this blog)
- Align table layout to satisfy read patterns, ad hoc requests, and reporting structure
- Duplicate data to simplify read access patterns (think Slowly Changing Dimensions (type II or higher)
- Leverage unique keys to identify unique combinations of dimensional attributes (surrogate keys)
- Nest cascading hierarchies in dimension attributes to enable "drill" capabilities in BI layer
- Access data through cascading hierarchical attributes using GROUP BY clauses
- For query performance, limit joining to simple joins between facts and dimensions (avoid snow-flaking)
- For performance pre-aggregate data when applicable and store in OLAP cubes, in memory layer, or summary fact tables
Hopefully you can spot a few common themes that we would like to explore in this blog, mainly:
- read based table design
- data duplication for read performance
- hierarchical data access
- storing pre-aggregated data
Read Based Table Design
Okay, so if we could provide only one piece of advice to Cassandra data modelers with a DW background it would be to approach a data modeling problem in Cassandra as you would in Data Warehousing. Meaning, understand what queries are going to be executed against the data stored in Cassandra to guide modeling. This same techniques can be used during requirements gathering exercises of Data Warehouse projects, i.e. meet with end users of the system and analyze how the end users would like to query the data. This is a different approach to modeling data in a traditional OLTP type of system.
Note that in Cassandra the read use cases will probably be a lot smaller, more targeted, and won't require ad-hoc query design practices. In other words, the read requirements for Cassandra backed systems will probably be a bit simpler compared to a DW system. So even though you should start with understanding how data should be queried in Cassandra, don't overthink this step. Also note that you are probably only dealing with one subject area, i.e. step number 1 in Dimensional Modeling. It's going to be straightforward, or at least it should be straightforward, let it be straightforward.
Start with understanding the read access patterns like you would in a Data Warehouse project and you're more than halfway to creating the right data model in Cassandra.
Data Duplication for Read Performance
Duplicating data by repeating values in one table:
This is where Cassandra and Dimension data modeling share similarities. We are assuming that reads of this post have all created dimensions. Dimensions duplicate data to enable easier query access. For example if I had a location dimension, I would repeat the State in every row for a City. The dimension might look something like the following:
Dimension Name : Location
Key : Location Key
Attributes: City, State
For every City, the state value will be duplicated.
The same concept exists in Cassandra. You should duplicate data in Cassandra to match the read design pattern. Do this to eliminate joins, just like in a DW system.
There is a slight nuance here between Cassandra and DW systems. In DW systems, joins slow things down just a bit (or worse a lot). In Cassandra some joins are possible because of the distributed architecture (i.e. the database lives on many physical servers, nodes). Therefore joins occur at the client level.
Duplicating data by repeating data in multiple tables:
This is where Cassandra and Fact data modeling share similarities. Think of modeling in Cassandra as you would modeling multiple versions of Summary fact tables. At the atomic level all fact tables should be unique, and as my friend Leo would argue should be business process aligned. This means at the atomic level of a Data Warehouse, fact data will be contained in one and only one table. However, for performance reasons, many Data Warehouses persist similar data in special summarized fact tables that are typically referred to as Summary Fact tables.
For example, say we had an atomic POS fact table that was at the grain of an individual sale transaction. But say we had one group of users who would like to analyze POS data rolled up to monthly levels with the ability to slice that data by states. Let's say we had another group of users who would like to analyze POS data rolled up to weekly level with the ability to slide that data by state-crossing market boundaries (like MSA). A common solution to this requirement in a Data Warehous would be to create 2 summary fact tables, containing slightly different versions of the same data:
Monthly, State, Summary Fact Table - Grain Month, State : Fact - Sales
Weekly, MSA, Summary Fact Table - Grain Week, MSA : Fact - Sales
This concept exists in Cassandra as well. We try to duplicate data based on the grain of the query. For example, say we had a POS system in which we wanted to lookup transactions by time of day and transactions by state, we would create two different tables in Cassandra:
Transactions by Time of Day Table - Grain Date/Time, Transaction ID : Data - Transaction Attributes
Transactions by State Table - Grain State, Transaction ID : Data - Transaction Attributes
Speaking in terms of grain in the Cassandra example may be a bit misleading as both tables have an atomic grain of Transaction ID, but to help communicate concepts in terms that DW data modelers will understand, we chose to use the concept of Grain for this example.
Hierarchical Data Access
Okay, so this one's probably a stretch, but it makes a point about table structure in Cassandra.
Cassandra has some strict rules for query access, which, if thought of as drill paths, in DW speak, enables a data modeler to quickly create the correct table layout in Cassandra.
These rules are as follows (my summary of the "rules"):
1) Every table has to have at least 1 Partition Key.
- the partition key is the first field listed in a Primary Key and it is the value Cassandra uses to distribute data in a ring.
2) Partition Keys must be included as the first element in a WHERE clause of a query and can only be accessed using the IN or = value, i.e. explicit filtering.
- explicit filtering enables Cassandra to "find" the data on the correct nodes in a ring.
3) Primary Key values that are declared after the Partition Key are called clustering columns and they determine the sort order of data contained in a CQL3 table.
4) Clustering columns must be placed into a WHERE clause in the same order as they are declared in a CQL3 table Primary Key.
5) Clustering columns can leverage WHERE clause conditions with the following operators =, <=, <, >, >=, IN
6) Only attributes defined in the Primary Key can be included as predicates in WHERE clauses for Cassandra queries.
CREATE TABLE test (
PRIMARY KEY (col1, col2, col3))
WITH CLUSTERING ORDER BY (col2 desc);
would store table based on col1, col2 in descending order, and col3 in ascending order. This means that if we wanted to get the largest value for col2 for any particular value of col1 we could simply do something like the following:
SELECT * FROM test WHERE col1 = somevalue LIMIT 1;
We could not run the following query though.
SELECT * FROM test WHERE col2 = somevalue;
So how is this type of, strict, query access pattern similar to Data Warehousing. Well, if your data warehouse leverages a BI tool that has strict drill hierarchies defined, without the ability to skip levels (i.e. use a DESCENDANTS OLAP clause), then you are using the same type of query access pattern that is modeled in Cassandra.
The lesson here is to model data in Cassandra just like you would create a strict drill path within a Data Warehouse BI or OLAP tool. You can't get to col3 without specifying which values of col1 and col2 to "drill through" to get to the desired col3 value.
Store Pre-Aggregated Data
There is no way to aggregate data stored in Cassandra, other than limited COUNTS functionality. This is, again, by design as Cassandra is a distributed system, preforming aggregations would be very costly. I believe Jonathan Ellis wrote a nice blog on this topic but I can't seem to find it at the moment.
Anyway, the solution to this, if you would like to aggregate data, is to store data in an aggregated manner in Cassandra. This is very similar to the Summary Fact table example mentioned above.
This means that if you would like to see sums of POS transaction values per day, using the same POS example listed above, then you would want to aggregate this data in client code, or using Spark/Storm/or Map Reduce, and store the aggregated results in Cassandra for lighting fast data access.
To us the stored aggregation components reminds us of the role of OLAP and how purely additive metrics are pre-calculated in OLAP solutions (specifically MOLAP) so results are returned very quickly.
We hope this post helps Dimensional Modelers and Data Warehousing data modelers wrap their head around modeling in Cassandra. If you come from an ERD background, then go ask a Dimensional Modeler how they create data mdoels. This will help you get into the right mindset for data modeling in Cassandra.