Sunday, May 4, 2014

Measuring Page Cache Utilization in Cassandra with PCSTAT

Overview

The purpose of this post is to show how to leverage this great, little tool named pcstat that our colleague Al Tobey created to understand how well your OS page cache is being utilized with Cassandra.  The premise is that, Cassandra read performance will increase if reads are satisfied with page cache vs. from disk.

Process

We are going to measure and observe how well page cache tuning works with Cassandra 2.0.7.  To do this we will create a baseline measurement using pcstat and then tweak the following two parameters via the cassandra.yaml file and measure how well Cassandra leverages page cache:
1) populate_io_cache_on_flush
2) compaction_preheat_key_cache (default is true in 2.0)

Step 1 - Baseline

We created a very simple table for this experiment:

CREATE KEYSPACE test WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': '1'
};

CREATE TABLE test (
  col1 text,
  col2 text,
  PRIMARY KEY (col1)
)

We then loaded some simple data in this table and preformed a flush.  Here's a look at the data and the results of the flush.

cqlsh:test> select * from test;

 col1 | col2
------+-----------------------------------------------------------------------------------------------------------------------------------------
    6 | This should hopefully be a lot of data to put in the column to fill up the value of the column so we can see what happens when testing.
   90 | This should hopefully be a lot of data to put in the column to fill up the value of the column so we can see what happens when testing.
   46 | This should hopefully be a lot of data to put in the column to fill up the value of the column so we can see what happens when testing.

nodetool flush

/var/lib/cassandra/data/test/test$ ls
test-test-jb-1-CompressionInfo.db  test-test-jb-1-Data.db  test-test-jb-1-Filter.db  test-test-jb-1-Index.db  test-test-jb-1-Statistics.db  test-test-jb-1-Summary.db  test-test-jb-1-TOC.

Here's the baseline after flushing
pcstat /var/lib/cassandra/data/test/test/*-Data.db
|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-1-Data.db | 2370           | 1          | 0         | 000.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|

Here's the baseline after preforming some reads
pcstat /var/lib/cassandra/data/test/test/*-Data.db
|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-1-Data.db | 2370           | 1          | 1         | 100.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|

As expected, the OS cached pages after reads, especially since this is one small SSTABLE. 

Step 2 - Change populate_io_cache_on_flush

In Step 1, we saw what we expected to see.  That is, a new SSTable was not cached until a read was preformed against that table.  Let's see if we can leverage Cassandra's populate_io_cache_on_flush to get sstables to cache on:
1) flush events (though these will be manual)
2) compaction events (though this will be manual)

First, we must make our populate_io_cache_on_flush change.  I added this field to the yaml and restarted Cassandra.

Now we will write (overwrite) some data and preform a munal flush.

Here are the new SSTables for this table.

/var/lib/cassandra/data/test/test$ ls
test-test-jb-1-CompressionInfo.db  test-test-jb-1-Index.db       test-test-jb-1-TOC.txt             test-test-jb-2-Filter.db      test-test-jb-2-Summary.db
test-test-jb-1-Data.db             test-test-jb-1-Statistics.db  test-test-jb-2-CompressionInfo.db  test-test-jb-2-Index.db       test-test-jb-2-TOC.txt
test-test-jb-1-Filter.db           test-test-jb-1-Summary.db     test-test-jb-2-Data.db             test-test-jb-2-Statistics.db

And here are the results of the pcstat test

|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-1-Data.db | 2370           | 1          | 1         | 100.000 |
| /var/lib/cassandra/data/test/test/test-test-jb-2-Data.db | 2107           | 1          | 0         | 000.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|

Yikes, what happened here.   The expectation was that our results would populate cache on flush.  Very interesting. 

Let's see if a major compaction causes the new sstable to cache.

nodetool compact

After the compaction we see the newly created SSTable.

ls
test-test-jb-3-CompressionInfo.db  test-test-jb-3-Data.db  test-test-jb-3-Filter.db  test-test-jb-3-Index.db  test-test-jb-3-Statistics.db  test-test-jb-3-Summary.db  test-test-jb-3-TOC.txt

Now, lets see if the newly created SSTable was cached.

 pcstat /var/lib/cassandra/data/test/test/*-Data.db
|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-3-Data.db | 2368           | 1          | 0         | 000.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|

Again, no luck.  We know that populate_io_cache_on_flush was removed from the .yaml file in 2.0 and have heard that it wasn't fully honored by Cassandra.  With more time and investigation we could figure out exactly what's occurring.

BTW: Here's an interesting article about cache changes in 2.1

Big picture, thanks to Al, we now have the ability to monitor cache at a granular level for Cassandra files.  Very handy.

Step 3 - Change preheat_kernel_page_cache

So, let's take a look at the affects of tweaking another page cache setting, though this one is new to Cassandra, preheat_kernel_page_cache.

After making this change to our little node and restarting, we see that the output of pcstat is still 0% cached.

pcstat /var/lib/cassandra/data/test/test/*-Data.db
|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-3-Data.db | 2368           | 1          | 0         | 000.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|

Let's add some more data, flush and measure results.

Here is the output of ls after the flush.

ls
test-test-jb-3-CompressionInfo.db  test-test-jb-3-Index.db       test-test-jb-3-TOC.txt             test-test-jb-4-Filter.db      test-test-jb-4-Summary.db
test-test-jb-3-Data.db             test-test-jb-3-Statistics.db  test-test-jb-4-CompressionInfo.db  test-test-jb-4-Index.db       test-test-jb-4-TOC.txt
test-test-jb-3-Filter.db           test-test-jb-3-Summary.db     test-test-jb-4-Data.db             test-test-jb-4-Statistics.db

And, here is what pcstat shows.

pcstat /var/lib/cassandra/data/test/test/*-Data.db
|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-3-Data.db | 2368           | 1          | 0         | 000.000 |
| /var/lib/cassandra/data/test/test/test-test-jb-4-Data.db | 1918           | 1          | 0         | 000.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|

Now, let's issue a read and see what happens.  The workload we are using for this exercise is update-centric, so a single partition should exist across multiple SSTables.

Here's the output of pcstat, showing both SSTables are now cached.
pcstat /var/lib/cassandra/data/test/test/*-Data.db
|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-3-Data.db | 2368           | 1          | 1         | 100.000 |
| /var/lib/cassandra/data/test/test/test-test-jb-4-Data.db | 1918           | 1          | 1         | 100.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|

Let's check cfhistograms to ensure we really did read 2 SSTables.

SSTables per Read
2 sstables: 3

Yep, we queried 3 rows, using an IN clause, and can see that 2 sstable were accessed.

For grins, we will issue a compaction and see if the newly created SSTable is cached.

The new SSTable is not compacted.
pcstat /var/lib/cassandra/data/test/test/*-Data.db
|----------------------------------------------------------+----------------+------------+-----------+---------|
| Name                                                     | Size           | Pages      | Cached    | Percent |
|----------------------------------------------------------+----------------+------------+-----------+---------|
| /var/lib/cassandra/data/test/test/test-test-jb-5-Data.db | 2373           | 1          | 0         | 000.000 |
|----------------------------------------------------------+----------------+------------+-----------+---------|


Conclusion

OS Page Cache is a very valuable tool to ensure performant read operations.  However, understanding exactly when and how it's utilized has historically been a bit of a mystery.  Thanks to Al Tobey, we have an easy way to precisely measure when, where, and how much page cache is being utilized during Cassandra operations.