Purpose
The purpose of this blog is to provide an aggregated view of batch/bulk techniques available for integrating with Apache Cassandra with a CQL3 based data models, including Cassandra 1.2 and 2.0 (DSE 3.2 and 4.0). This blog post will be augmented overtime as techniques evolve.
Background
As long as people store data digitally, there will be a need to move large "chunks" of data between systems. This could be for reasons of:
- aggregation and combination (i.e. Analytical systems) where multiple sources of data are combined and queries
- archive purposes (removal of old and non needed data)
- data migration projects
- large scale data model changes
- database transition initiatives
- other
This blog will provide a comprehensive view of bulk integration techniques when moving data into and out of Cassandra:
- with links to code samples
- tips and tricks
- recommended use cases
Information will be presented using a matrix view that lists different techniques as rows. The matrix will contain 2 columns, one for loading data into Cassandra from a source (Into Cassandra) and the other for loading data from Cassandra into a source (Out of Cassandra). "A source" generally means an RDBMS system, flat file, Hadoop, mainframe, etc. The cells of this matrix will contain a summary for the row topic as well as links to additional posts that contain details supporting the summary.
We will work to provide a low latency matrix in a subsequent post to help with near real time integration/data pipe-lining techniques.
We will work to provide a low latency matrix in a subsequent post to help with near real time integration/data pipe-lining techniques.
Batch Integration Matrix
Batch Technique
|
Into Cassandra
|
Out of Cassandra
|
Cassandra Bulk Loader (SSTableLoader)*
|
· Loads SSTables into a Cassandra ring from a node or set of nodes that are not actively part of the ring.
· Good option for migration, cluster generation, etc
· Enables parallel loading of data
· Should be executed on a set of “non cluster” nodes for optimal performance.
· Requires creation of SSTables (see below)
· CQL3 Support depends on the creation of SSTables
· Counters may not be fully supported
· Leverages Cassandra internal streaming
· After completion, run a repair
|
n/a
|
Bulk Loading via JMX*
|
· JMX based utility to stream sstable data into Cassandra.
· Leverages same code functionality as SSTableLaoder
· Enables loading of data from a cassandra node into the same node without requiring the configuration of a separate network interface, which is required for SSTableLaoder on the same node
· Same requirements and limitations of SSTableLoader.
|
n/a
|
Copy SSTables
|
Into Data Directory
· Loads SSTables into a Cassandra ring
· Good option for migration, cluster generation, etc
· Requires creation of SSTables (see below)
· Could leverage Snapshot SSTables for migration purposes
· CQL3 Support depends on the structure of SSTables.
· Working example and blog to be provided soon
|
Out of Data Directory
· Get access to SSTable data with minimal production system impact
· Can be used as a source to populate another cluster
|
ColumnFamily<>Format
|
ColumnFamilyOutputFormat
· Thrift based driver to load data into Cassandra Thrift based column families
· Not compatible with CQL3 tables
· Example
|
ColumnFamilyInputFormat
· Thrift based driver to extract data out of Cassandra Thrift based column families
· Not compatible with CQL3 tables
· Example
|
BulkOutputFormat
|
· Thirft based tool/no CQL3 support (CQL3 support has to be creatively programmed using composites similar to the SSTable loader techniques for the Simple and SimpleUnsorted SSTable Wrtiers
· Used to stream data from MR to Cassandra
· Similar to Bulk Loading above but no need for a “fat client”, i.e. cassandra node to execute process.
· Can be used with Pig
|
n/a
|
CQL<>Format
|
CQLOutputFormat
· CQL3 based driver to load data into Cassandra CQL3 tables
· This is not necessarily a “bulk loader”
· Example
|
CQLPagingInputFormat
· CQL3 based driver to extract data out of Cassandra CQL3 tables
· This is not necessarily a “bulk loader”
· Example
|
CQL3 Statements via M/R
|
· Have heard that several users simply create Map only jobs and insert data into Cassandra leveraging a java driver, like the DSE java driver.
· Example
|
n/a
|
CQL3 Batch Statements
|
· Batch statements group individual statements into single operations and can be used for bulk-like integration processes.
· Use the UNLOGGED if performance is a concern, though you lose atomicity
|
n/a
|
ETL Tools - Pentaho
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this but will do so soon.
· Working example and blog to be provided soon
|
ETL Tools - JasperSoft
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
ETL Tools - Talend
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We are still investigating what Talend supports with regards to Cassandra. We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We are still investigating what Talend supports with regards to Cassandra. We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
Sqoop (DSE version)
|
· Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.
· The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
· This is a very promising feature and we will update once DataStax releases the CQL3 support.
· For CQL2, here is a write up with a working example
|
· Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.
· The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
· This is a very promising feature and we will update once DataStax releases the CQL3 support.
· For CQL2, here is a write up with a working example
|
CQLSH Copy
|
· Good tool for bulk loading a small amount of data, less than 1 million records is the recommendation.
· Uses .csv files as import sources only
|
· Good tool for writing out a small amount of data, less than 1 million records is the recommendation, to a cs file.
· Uses .csv files as export source only
|
Flume Integration
|
· Found a project on GitHub that offers Flume integration into Cassandra.
· Currently this is Hector based. |
*SSTable Writers
In order to load data into Cassandra using the bulk loading technique, denoted by *, SSTables need to be generated to load. The following table provides information on the available techniques for the creating of SSTables.
Batch Technique
|
Overview
|
Limitations
|
CQL3 Support
|
CQLSSTableWriter
|
· Fully supports CQL3 compatible SSTable creation
|
· Contained in C* 2.X and higher
|
Full
|
SSTableSimpleUnsortedWriter
|
· Generates SSTables simply and easily using partitioned order
· Does not inherently support CQL3
· CQL3 support can be created using composite types
|
· Limited CQL Support
· More complex CQL configuration compared to CQLSSTableWriter
|
Limited
|
SSTableSimpleWriter
|
· Generates SSTables but not in sorted order, requires data be added in partition sorted order
· Not recommended for use unless there is a specific use case.
|
· Requires inserting data in partition sorted order
|
Limited
|