The Information Management Survival Blog: Bulk Integration Into and Out of Cassandra CQL3 Data Models

Purpose

The purpose of this blog is to provide an aggregated view of batch/bulk techniques available for integrating with Apache Cassandra with a CQL3 based data models, including Cassandra 1.2 and 2.0 (DSE 3.2 and 4.0). This blog post will be augmented overtime as techniques evolve.

Background

As long as people store data digitally, there will be a need to move large "chunks" of data between systems. This could be for reasons of:

aggregation and combination (i.e. Analytical systems) where multiple sources of data are combined and queries

archive purposes (removal of old and non needed data)

data migration projects

large scale data model changes

database transition initiatives

other

This blog will provide a comprehensive view of bulk integration techniques when moving data into and out of Cassandra:

with links to code samples
tips and tricks
recommended use cases

Information will be presented using a matrix view that lists different techniques as rows. The matrix will contain 2 columns, one for loading data into Cassandra from a source (Into Cassandra) and the other for loading data from Cassandra into a source (Out of Cassandra). "A source" generally means an RDBMS system, flat file, Hadoop, mainframe, etc. The cells of this matrix will contain a summary for the row topic as well as links to additional posts that contain details supporting the summary.

We will work to provide a low latency matrix in a subsequent post to help with near real time integration/data pipe-lining techniques.

Batch Integration Matrix

Batch Technique	Into Cassandra	Out of Cassandra
Cassandra Bulk Loader (SSTableLoader)*	· Loads SSTables into a Cassandra ring from a node or set of nodes that are not actively part of the ring. · Good option for migration, cluster generation, etc · Enables parallel loading of data · Should be executed on a set of “non cluster” nodes for optimal performance. · Requires creation of SSTables (see below) · CQL3 Support depends on the creation of SSTables · Counters may not be fully supported · Leverages Cassandra internal streaming · After completion, run a repair · Write up (with sample) from DataStax · Source and Documentation · Patrick Callaghan Example (CQL3) · Patrick Mcfadin Example (CQL3)	n/a
Bulk Loading via JMX*	· JMX based utility to stream sstable data into Cassandra. · Leverages same code functionality as SSTableLaoder · Enables loading of data from a cassandra node into the same node without requiring the configuration of a separate network interface, which is required for SSTableLaoder on the same node · Same requirements and limitations of SSTableLoader. · Write up (with sample) from DataStax · Source and Documentation (look for bulkLoad(String)) · Patrick Callaghan Example (CQL3)	n/a
Copy SSTables	Into Data Directory · Loads SSTables into a Cassandra ring · Good option for migration, cluster generation, etc · Requires creation of SSTables (see below) · Could leverage Snapshot SSTables for migration purposes · CQL3 Support depends on the structure of SSTables. · Write up from Planet Cassandra · Working example and blog to be provided soon	Out of Data Directory · Get access to SSTable data with minimal production system impact · Can be used as a source to populate another cluster
ColumnFamily<>Format	ColumnFamilyOutputFormat · Thrift based driver to load data into Cassandra Thrift based column families · Not compatible with CQL3 tables · Source code and Documentation · Example	ColumnFamilyInputFormat · Thrift based driver to extract data out of Cassandra Thrift based column families · Not compatible with CQL3 tables · Source code and Documentation · Example
BulkOutputFormat	· Thirft based tool/no CQL3 support (CQL3 support has to be creatively programmed using composites similar to the SSTable loader techniques for the Simple and SimpleUnsorted SSTable Wrtiers · Used to stream data from MR to Cassandra · Similar to Bulk Loading above but no need for a “fat client”, i.e. cassandra node to execute process. · Can be used with Pig · Write up from DataStax · Source Code and Documentation	n/a
CQL<>Format	CQLOutputFormat · CQL3 based driver to load data into Cassandra CQL3 tables · This is not necessarily a “bulk loader” · Source Code and Documentation · Example	CQLPagingInputFormat · CQL3 based driver to extract data out of Cassandra CQL3 tables · This is not necessarily a “bulk loader” · Source Code and Documentation · Example
CQL3 Statements via M/R	· Have heard that several users simply create Map only jobs and insert data into Cassandra leveraging a java driver, like the DSE java driver. · Example	n/a
CQL3 Batch Statements	· Batch statements group individual statements into single operations and can be used for bulk-like integration processes. · Use the UNLOGGED if performance is a concern, though you lose atomicity · Write up (with sample) from DataStax	n/a
ETL Tools - Pentaho	· Visual, ETL approach for the bulk integration of Cassandra data with other sources. · Pentaho Data Integration 5 supports CQL3 (JIRA) · We have not had the chance to test this approach but will do so soon. · Working example and blog to be provided soon	· Visual, ETL approach for the bulk integration of Cassandra data with other sources. · Pentaho Data Integration 5 supports CQL3 (JIRA) · We have not had the chance to test this but will do so soon. · Working example and blog to be provided soon
ETL Tools - JasperSoft	· Visual, ETL approach for the bulk integration of Cassandra data with other sources. · Jaspersoft Studio with the Cassandra Connector v 1.0 supports CQL3 (Release) · We have not had the chance to test this approach but will do so soon. · Working example and blog to be provided soon	· Visual, ETL approach for the bulk integration of Cassandra data with other sources. · Jaspersoft Studio with the Cassandra Connector v 1.0 supports CQL3 (Release) · We have not had the chance to test this approach but will do so soon. · Working example and blog to be provided soon
ETL Tools - Talend	· Visual, ETL approach for the bulk integration of Cassandra data with other sources. · We are still investigating what Talend supports with regards to Cassandra. We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA) · We have not had the chance to test this approach but will do so soon. · Working example and blog to be provided soon	· Visual, ETL approach for the bulk integration of Cassandra data with other sources. · We are still investigating what Talend supports with regards to Cassandra. We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA) · We have not had the chance to test this approach but will do so soon. · Working example and blog to be provided soon
Sqoop (DSE version)	· Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra. · The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X. · This is a very promising feature and we will update once DataStax releases the CQL3 support. · For CQL2, here is a write up with a working example	· Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra. · The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X. · This is a very promising feature and we will update once DataStax releases the CQL3 support. · For CQL2, here is a write up with a working example
CQLSH Copy	· Good tool for bulk loading a small amount of data, less than 1 million records is the recommendation. · Uses .csv files as import sources only · Write up (with sample) from DataStax	· Good tool for writing out a small amount of data, less than 1 million records is the recommendation, to a cs file. · Uses .csv files as export source only · Write up (with sample) from DataStax
Flume Integration	· Found a project on GitHub that offers Flume integration into Cassandra. · Currently this is Hector based.

*Requires SSTable Creation

*SSTable Writers

In order to load data into Cassandra using the bulk loading technique, denoted by *, SSTables need to be generated to load. The following table provides information on the available techniques for the creating of SSTables.

Batch Technique	Overview	Limitations	CQL3 Support
CQLSSTableWriter	· Fully supports CQL3 compatible SSTable creation · Apache JIRA · Source and Documentation	· Contained in C* 2.X and higher	Full
SSTableSimpleUnsortedWriter	· Generates SSTables simply and easily using partitioned order · Does not inherently support CQL3 · CQL3 support can be created using composite types · Source and Documentation · DataStax Example (Non CQL) · Patrick Callaghan Example (CQL) · Patrick Mcfadin Example (CQL)	· Limited CQL Support · More complex CQL configuration compared to CQLSSTableWriter	Limited
SSTableSimpleWriter	· Generates SSTables but not in sorted order, requires data be added in partition sorted order · Not recommended for use unless there is a specific use case. · Source and Documentation	· Requires inserting data in partition sorted order	Limited

Conclusion

This list is meant to be a comprehensive guide. Please let us know if we missed anything or if you have feedback on specific techniques. Hopefully this post provides value to people who are analyzing different techniques to move large chunks of data into or out of Cassandra.

10 comments:

UnknownAugust 5, 2015 at 1:15 PM
Get 1 Billion worldwide data for bulk marketing your business...
http://100milliondata.blogspot.com
AnonymousOctober 26, 2015 at 1:20 AM
I would like to share one of my opinions. To transform any data into any database fast and easily ETL tools are required. There are so many useful tools. People from all over the word can use whenever they need to transform any data into any database faster. So, I think ETL tools is the best tools ever in the word.
http://www.etl-tools.com/
UnknownMay 24, 2018 at 3:45 AM
Quickly Export and Import Entire Cassandra Database with Cassandra Technical Support
In the event that you are pondering this inquiry how to import and fare the Cassandra database then without quite a bit of pressure simply influence an association with Cassandra Customer Service and Cassandra Database Support. From bringing in to sending out we additionally give speedy help to reinforcement, recuperation, establishment and design. Be that as it may, on the off chance that you experience any issue or specialized hiccups at that point contact to Cassandra Database Consulting and Support.
For More Info: https://cognegicsystems.com/
Contact Number: 1-800-450-8670
Email Address- info@cognegicsystems.com
Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801
Adi smithJuly 26, 2018 at 3:34 AM
The Apache Cassandra is a free and open-source database which recommends you can on an astoundingly fundamental level alter, change and utilize it. It is versatile in nature since it enough changes the information association and handles extensive measure of information when ascended out of some excellent databases. Regardless, in the event that you're Cassandra's information has been lost then what will you do? How to recuperate them? Accreditation, the most ideal approach to manage directs understand this issue is Cassandra Database Support or Apache Cassandra Support. At Cognegic you will get most dazzling help and sensibly get back your information with Cassandra Database Consulting and Support.
For More Info: https://cognegicsystems.com/
Contact Number: 1-800-450-8670
Email Address- info@cognegicsystems.com
sukkuNovember 21, 2019 at 10:15 PM
microsoft online job support from india,microsoft project support from india,PHP online job support from india,PHP project job support from india,ETL testing online job support from india,ETL testing project support from india
INTELLECTONovember 21, 2019 at 10:16 PM
java online job support from india,
java project support from india,
mainframe online job support from india,
mainframe project job support from india,
workday online job support from india,
workday project support from india
sukkuNovember 21, 2019 at 10:16 PM
DevOps online job support from india,
DevOps project support from india,Android online job support from india,Android project support from india,Teradata online job support from india,Teradata project support from india
sukkuNovember 21, 2019 at 10:16 PM
pentaho online job support from india,
pentaho project support from india,SAP SD online job support from india,SAP SD project job support from india,ReactJS online job support from india,ReactJS project support from india
sukkuNovember 21, 2019 at 10:16 PM
microsoft online job support from india,microsoft project support from india,PHP online job support from india,PHP project job support from india,ETL testing online job support from india,ETL testing project support from india
UnknownNovember 21, 2019 at 10:21 PM
Peoplesoft
PeopleSoft online job support from india

PeopleSoft project job support from india

Informatica

Informatica online job support from india

Informatica project job support from india

Python

Python online job support from india

Python project job support from india

The Information Management Survival Blog

Wednesday, February 26, 2014

Bulk Integration Into and Out of Cassandra CQL3 Data Models