Purpose
The purpose of this blog is to provide an aggregated view of batch/bulk techniques available for integrating with Apache Cassandra with a CQL3 based data models, including Cassandra 1.2 and 2.0 (DSE 3.2 and 4.0). This blog post will be augmented overtime as techniques evolve.
Background
As long as people store data digitally, there will be a need to move large "chunks" of data between systems. This could be for reasons of:
- aggregation and combination (i.e. Analytical systems) where multiple sources of data are combined and queries
- archive purposes (removal of old and non needed data)
- data migration projects
- large scale data model changes
- database transition initiatives
- other
This blog will provide a comprehensive view of bulk integration techniques when moving data into and out of Cassandra:
- with links to code samples
- tips and tricks
- recommended use cases
Information will be presented using a matrix view that lists different techniques as rows. The matrix will contain 2 columns, one for loading data into Cassandra from a source (Into Cassandra) and the other for loading data from Cassandra into a source (Out of Cassandra). "A source" generally means an RDBMS system, flat file, Hadoop, mainframe, etc. The cells of this matrix will contain a summary for the row topic as well as links to additional posts that contain details supporting the summary.
We will work to provide a low latency matrix in a subsequent post to help with near real time integration/data pipe-lining techniques.
We will work to provide a low latency matrix in a subsequent post to help with near real time integration/data pipe-lining techniques.
Batch Integration Matrix
Batch Technique
|
Into Cassandra
|
Out of Cassandra
|
Cassandra Bulk Loader (SSTableLoader)*
|
· Loads SSTables into a Cassandra ring from a node or set of nodes that are not actively part of the ring.
· Good option for migration, cluster generation, etc
· Enables parallel loading of data
· Should be executed on a set of “non cluster” nodes for optimal performance.
· Requires creation of SSTables (see below)
· CQL3 Support depends on the creation of SSTables
· Counters may not be fully supported
· Leverages Cassandra internal streaming
· After completion, run a repair
|
n/a
|
Bulk Loading via JMX*
|
· JMX based utility to stream sstable data into Cassandra.
· Leverages same code functionality as SSTableLaoder
· Enables loading of data from a cassandra node into the same node without requiring the configuration of a separate network interface, which is required for SSTableLaoder on the same node
· Same requirements and limitations of SSTableLoader.
|
n/a
|
Copy SSTables
|
Into Data Directory
· Loads SSTables into a Cassandra ring
· Good option for migration, cluster generation, etc
· Requires creation of SSTables (see below)
· Could leverage Snapshot SSTables for migration purposes
· CQL3 Support depends on the structure of SSTables.
· Working example and blog to be provided soon
|
Out of Data Directory
· Get access to SSTable data with minimal production system impact
· Can be used as a source to populate another cluster
|
ColumnFamily<>Format
|
ColumnFamilyOutputFormat
· Thrift based driver to load data into Cassandra Thrift based column families
· Not compatible with CQL3 tables
· Example
|
ColumnFamilyInputFormat
· Thrift based driver to extract data out of Cassandra Thrift based column families
· Not compatible with CQL3 tables
· Example
|
BulkOutputFormat
|
· Thirft based tool/no CQL3 support (CQL3 support has to be creatively programmed using composites similar to the SSTable loader techniques for the Simple and SimpleUnsorted SSTable Wrtiers
· Used to stream data from MR to Cassandra
· Similar to Bulk Loading above but no need for a “fat client”, i.e. cassandra node to execute process.
· Can be used with Pig
|
n/a
|
CQL<>Format
|
CQLOutputFormat
· CQL3 based driver to load data into Cassandra CQL3 tables
· This is not necessarily a “bulk loader”
· Example
|
CQLPagingInputFormat
· CQL3 based driver to extract data out of Cassandra CQL3 tables
· This is not necessarily a “bulk loader”
· Example
|
CQL3 Statements via M/R
|
· Have heard that several users simply create Map only jobs and insert data into Cassandra leveraging a java driver, like the DSE java driver.
· Example
|
n/a
|
CQL3 Batch Statements
|
· Batch statements group individual statements into single operations and can be used for bulk-like integration processes.
· Use the UNLOGGED if performance is a concern, though you lose atomicity
|
n/a
|
ETL Tools - Pentaho
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this but will do so soon.
· Working example and blog to be provided soon
|
ETL Tools - JasperSoft
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
ETL Tools - Talend
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We are still investigating what Talend supports with regards to Cassandra. We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
· Visual, ETL approach for the bulk integration of Cassandra data with other sources.
· We are still investigating what Talend supports with regards to Cassandra. We did notice that it appears as if Talend would like to support the SSTableLoader mentioned above. (JIRA)
· We have not had the chance to test this approach but will do so soon.
· Working example and blog to be provided soon
|
Sqoop (DSE version)
|
· Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.
· The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
· This is a very promising feature and we will update once DataStax releases the CQL3 support.
· For CQL2, here is a write up with a working example
|
· Offers a good tool for bulk/batch integration between many different databases, hadoop, and Cassandra.
· The DSE 3.X version does not as yet support CQL3, but this should be coming in DSE 4.X.
· This is a very promising feature and we will update once DataStax releases the CQL3 support.
· For CQL2, here is a write up with a working example
|
CQLSH Copy
|
· Good tool for bulk loading a small amount of data, less than 1 million records is the recommendation.
· Uses .csv files as import sources only
|
· Good tool for writing out a small amount of data, less than 1 million records is the recommendation, to a cs file.
· Uses .csv files as export source only
|
Flume Integration
|
· Found a project on GitHub that offers Flume integration into Cassandra.
· Currently this is Hector based. |
*SSTable Writers
In order to load data into Cassandra using the bulk loading technique, denoted by *, SSTables need to be generated to load. The following table provides information on the available techniques for the creating of SSTables.
Batch Technique
|
Overview
|
Limitations
|
CQL3 Support
|
CQLSSTableWriter
|
· Fully supports CQL3 compatible SSTable creation
|
· Contained in C* 2.X and higher
|
Full
|
SSTableSimpleUnsortedWriter
|
· Generates SSTables simply and easily using partitioned order
· Does not inherently support CQL3
· CQL3 support can be created using composite types
|
· Limited CQL Support
· More complex CQL configuration compared to CQLSSTableWriter
|
Limited
|
SSTableSimpleWriter
|
· Generates SSTables but not in sorted order, requires data be added in partition sorted order
· Not recommended for use unless there is a specific use case.
|
· Requires inserting data in partition sorted order
|
Limited
|
Get 1 Billion worldwide data for bulk marketing your business...
ReplyDeletehttp://100milliondata.blogspot.com
I would like to share one of my opinions. To transform any data into any database fast and easily ETL tools are required. There are so many useful tools. People from all over the word can use whenever they need to transform any data into any database faster. So, I think ETL tools is the best tools ever in the word.
ReplyDeletehttp://www.etl-tools.com/
Quickly Export and Import Entire Cassandra Database with Cassandra Technical Support
ReplyDeleteIn the event that you are pondering this inquiry how to import and fare the Cassandra database then without quite a bit of pressure simply influence an association with Cassandra Customer Service and Cassandra Database Support. From bringing in to sending out we additionally give speedy help to reinforcement, recuperation, establishment and design. Be that as it may, on the off chance that you experience any issue or specialized hiccups at that point contact to Cassandra Database Consulting and Support.
For More Info: https://cognegicsystems.com/
Contact Number: 1-800-450-8670
Email Address- info@cognegicsystems.com
Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801
The Apache Cassandra is a free and open-source database which recommends you can on an astoundingly fundamental level alter, change and utilize it. It is versatile in nature since it enough changes the information association and handles extensive measure of information when ascended out of some excellent databases. Regardless, in the event that you're Cassandra's information has been lost then what will you do? How to recuperate them? Accreditation, the most ideal approach to manage directs understand this issue is Cassandra Database Support or Apache Cassandra Support. At Cognegic you will get most dazzling help and sensibly get back your information with Cassandra Database Consulting and Support.
ReplyDeleteFor More Info: https://cognegicsystems.com/
Contact Number: 1-800-450-8670
Email Address- info@cognegicsystems.com
microsoft online job support from india,microsoft project support from india,PHP online job support from india,PHP project job support from india,ETL testing online job support from india,ETL testing project support from india
ReplyDeletejava online job support from india,
ReplyDeletejava project support from india,
mainframe online job support from india,
mainframe project job support from india,
workday online job support from india,
workday project support from india
DevOps online job support from india,
ReplyDeleteDevOps project support from india,Android online job support from india,Android project support from india,Teradata online job support from india,Teradata project support from india
pentaho online job support from india,
ReplyDeletepentaho project support from india,SAP SD online job support from india,SAP SD project job support from india,ReactJS online job support from india,ReactJS project support from india
microsoft online job support from india,microsoft project support from india,PHP online job support from india,PHP project job support from india,ETL testing online job support from india,ETL testing project support from india
ReplyDeletePeoplesoft
ReplyDeletePeopleSoft online job support from india
PeopleSoft project job support from india
Informatica
Informatica online job support from india
Informatica project job support from india
Python
Python online job support from india
Python project job support from india