Wednesday, February 26, 2014

Bulk Integration Into and Out of Cassandra CQL3 Data Models


Purpose

The purpose of this blog is to provide an aggregated view of batch/bulk techniques available for integrating with Apache Cassandra with a CQL3 based data models, including Cassandra 1.2 and 2.0 (DSE 3.2 and 4.0).  This blog post will be augmented overtime as techniques evolve. 

Background

As long as people store data digitally, there will be a need to move large "chunks" of data between systems.  This could be for reasons of:
  • aggregation and combination (i.e. Analytical systems) where multiple sources of data are combined and queries
  • archive purposes (removal of old and non needed data)
  • data migration projects
  • large scale data model changes
  • database transition initiatives
  • other
This blog will provide a comprehensive view of bulk integration techniques when moving data into and out of Cassandra: 
  • with links to code samples
  • tips and tricks
  • recommended use cases
Information will be presented using a matrix view that lists different techniques as rows.  The matrix will contain 2 columns, one for loading data into Cassandra from a source (Into Cassandra) and the other for loading data from Cassandra into a source (Out of Cassandra).   "A source" generally means an RDBMS system, flat file, Hadoop, mainframe, etc.  The cells of this matrix will contain a summary for the row topic as well as links to additional posts that contain details supporting the summary.

We will work to provide a low latency matrix in a subsequent post to help with near real time integration/data pipe-lining techniques.

Batch Integration Matrix

 

Batch Technique
Into Cassandra
Out of Cassandra
Cassandra Bulk Loader (SSTableLoader)*
·  Loads SSTables into a Cassandra ring from a node or set of nodes that are not actively part of the ring.
·  Good option for migration, cluster generation, etc
·  Enables parallel loading of data
·  Should be executed on a set of “non cluster” nodes for optimal performance.
·  Requires creation of SSTables (see below)
·  CQL3 Support depends on the creation of SSTables
·  Counters may not be fully supported
·  Leverages Cassandra internal streaming
·  After completion, run a repair
n/a
Bulk Loading via JMX*
·  JMX based utility to stream sstable data into Cassandra.
·  Leverages same code functionality as SSTableLaoder
·  Enables loading of data from a cassandra node into the same node without requiring the configuration of a separate network interface, which is required for SSTableLaoder on the same node
·  Same requirements and limitations of SSTableLoader.
·  Source and Documentation (look for bulkLoad(String))
n/a
Copy SSTables
Into Data Directory 
·  Loads SSTables into a Cassandra ring
·  Good option for migration, cluster generation, etc
·  Requires creation of SSTables (see below)
·  Could leverage Snapshot SSTables for migration purposes
·  CQL3 Support depends on the structure of SSTables.
·  Working example and blog to be provided soon
Out of Data Directory
·  Get access to SSTable data with minimal production system impact
·  Can be used as a source to populate another cluster
ColumnFamily<>Format
ColumnFamilyOutputFormat
·  Thrift based driver to load data into Cassandra Thrift based column families
·  Not compatible with CQL3 tables
·  Example
ColumnFamilyInputFormat
·  Thrift based driver to extract data out of Cassandra Thrift based column families
·  Not compatible with CQL3 tables
·  Example
BulkOutputFormat
·  Thirft based tool/no CQL3 support (CQL3 support has to be creatively programmed using composites similar to the SSTable loader techniques for the Simple and SimpleUnsorted SSTable Wrtiers
·  Used to stream data from MR to Cassandra
·  Similar to Bulk Loading above but no need for a “fat client”, i.e. cassandra node to execute process.
·  Can be used with Pig
n/a
CQL<>Format
CQLOutputFormat
·  CQL3 based driver to load data into Cassandra CQL3 tables
·  This is not necessarily a “bulk loader”
·  Example
CQLPagingInputFormat
·  CQL3 based driver to extract data out of Cassandra CQL3 tables
·  This is not necessarily a “bulk loader”
·  Example
CQL3 Statements via M/R
·  Have heard that several users simply create Map only jobs and insert data into Cassandra leveraging a java driver, like the DSE java driver.
·  Example
n/a
CQL3 Batch Statements
·  Batch statements group individual statements into single operations and can be used for bulk-like integration processes.
·  Use the UNLOGGED if performance is a concern, though you lose atomicity
n/a
ETL Tools - Pentaho
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Pentaho Data Integration 5 supports CQL3 (JIRA)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Pentaho Data Integration 5 supports CQL3 (JIRA)
·  We have not had the chance to test this but will do so soon.
·  Working example and blog to be provided soon
ETL Tools - JasperSoft
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Jaspersoft Studio with the Cassandra Connector v 1.0 supports CQL3 (Release)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  Jaspersoft Studio with the Cassandra Connector v 1.0 supports CQL3 (Release)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
ETL Tools - Talend
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  We are still investigating what Talend supports with regards to Cassandra.  We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
·  Visual, ETL approach for the bulk integration of Cassandra data with other sources.
·  We are still investigating what Talend supports with regards to Cassandra.  We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
·  We have not had the chance to test this approach but will do so soon.
·  Working example and blog to be provided soon
Sqoop (DSE version)
·  Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.  
·  The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
·  This is a very promising feature and we will update once DataStax releases the CQL3 support.
·  For CQL2, here is a write up with a working example
·  Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.  
·  The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
·  This is a very promising feature and we will update once DataStax releases the CQL3 support.
·  For CQL2, here is a write up with a working example
CQLSH Copy
·  Good tool for bulk loading a small amount of data, less than 1 million records is the recommendation.
·  Uses .csv files as import sources only
·  Good tool for writing out a small amount of data, less than 1 million records is the recommendation, to a cs file.
·  Uses .csv files as export source only
Flume Integration
·  Found a project on GitHub that offers Flume integration into Cassandra.  
·  Currently this is Hector based.  


*Requires SSTable Creation

*SSTable Writers

In order to load data into Cassandra using the bulk loading technique, denoted by *, SSTables need to be generated to load.  The following table provides information on the available techniques for the creating of SSTables.

Batch Technique
Overview
Limitations
CQL3 Support
CQLSSTableWriter
·  Fully supports CQL3 compatible SSTable creation
·  Contained in C* 2.X and higher
Full
SSTableSimpleUnsortedWriter
·  Generates SSTables simply and easily using partitioned order
·  Does not inherently support CQL3
·  CQL3 support can be created using composite types
·  DataStax Example (Non CQL)
·  Limited CQL Support
·  More complex CQL configuration compared to CQLSSTableWriter
Limited
SSTableSimpleWriter
·  Generates SSTables but not in sorted order, requires data be added in partition sorted order
·  Not recommended for use unless there is a specific use case.
·  Requires inserting data in partition sorted order
Limited

Conclusion

This list is meant to be a comprehensive guide.  Please let us know if we missed anything or if you have feedback on specific techniques.  Hopefully this post provides value to people who are analyzing different techniques to move large chunks of data into or out of Cassandra.

2 comments:

  1. Get 1 Billion worldwide data for bulk marketing your business...
    http://100milliondata.blogspot.com

    ReplyDelete
  2. I would like to share one of my opinions. To transform any data into any database fast and easily ETL tools are required. There are so many useful tools. People from all over the word can use whenever they need to transform any data into any database faster. So, I think ETL tools is the best tools ever in the word.
    http://www.etl-tools.com/

    ReplyDelete