Thursday, November 21, 2013

Import cfhistograms output into Excel

One of my team members asked if I would create a tool to import cfhistograms into Excel.

Done!  It's called Excel Import :)

This post will walk people through how to import cfhistograms into Excel via the the Import utility.   We'l use the recipe style for this one.

Ingredients:

1)  nodetool cfhistograms output piped into a text file (we'll call this cfhistograms.txt)
2)  Excel (We'll use Excel 2011 for this item)

Recipe:

1) Create the cfhistograms text file using the following command:
./nodetool cfhistograms <keyspace> <columnfamily> <textfile>
 
here's an example of this on one of my test machines
./nodetool cfhistograms classproject recordscounter > /root/monitor/cfhistograms.txt

2)  Move your cfhistograms.txt file to a location that can be accessed by Excel. 
-- I placed my copy on my macbook 

3)  Open Excel and create a new workbook

4)  Select Import under the File menu
5)  Select Text File in the menu and click Import

6)  Select the input file and click Get Data
7) Ensure the Fixed Width radio button is selected and increase the "start import at row" to 2, then click Next

8)  Now create fixed width columns that align to the last character of each line.  
See the diagram for an example.
  Look at the top box for instructions on how to create columns.
  Once you have your columns created correctly, click Finish.
9)  Choose the destination for the .csv file and click Ok
10)  Enjoy your new Excel rendering of cfhistograms

Hope you all enjoyed this one.  I put almost a lot of effort into it, meaning about none :)



Analyzing cfstats in Excel

Here's a great way to get an overarching view of your Cassandra cluster

The nodetool cfstats output is a great tool to get an understanding of what exists in your cluster.  There are many resources out there to help you understand how to interpret the output of this utility, here's one from the C* 1.2 documentation.

We like to use the output of cfstats as an entry point for analyzing C* clusters.

However, I have found that reviewing this data in a different format, i.e. in Excel, is even more helpful.

Cfstats provides output as a scrollable list of data points.   Here's a snippet from one of my test environments:


This is a good format as there are many helpful data points listed in this output.

However, leveraging an excel table format to analyze cfstats output provides a lot of benefits over the standard output.

A picture's worth a thousand words, so here's what I mean by analyzing in Excel:


Using cfstats in a tool like Excel provides the following benefits over the traditional approach to analyzing cfstats:
  • You can sort your output to find interesting patterns and quickly pinpoint problems, like which CFs are read from or written to the most.
    • In the above example, i've sorted by read latency and can immediately see that I have a few CFs that are behaving poorly with read operations.
  • You can add aggregated columns, like Read Ratio and Write Ratio, to empirically understand the usage patterns in a keyspace.
    • In the above example, I've added to aggregate columns and changed the header cell color to grey
  • You can add charting to visually compare CF metrics
  • Leverage the output as a communication tool with other team members and management
    • Management really does love Excel don't they :)
So, how do you get the above format into an excel format?  Easy, you can use this little utility that I've posted on GitHub here:  https://github.com/jlacefie/cfstats-csv-parser.

Check out the README file for instructions on using the tool.  It's really easy and straightforward.

Using cfstats in Excel as an entry point into cluster analysis is a handy way to get a big picture understanding of the contents and behavior of your cluster.  From here, leverage tools such as OpsCenter, cfhistograms and tpstats for troubleshooting and deep-dive investigation.  

Good luck and happy "Cassandra-ing"



Monday, November 11, 2013

Step by Step: CentOS Installation for Cassandra/DSE Node

Step by Step: CentOS Installation for Cassandra/DSE Node


The purpose of this blog is to provide a step by step guide to installing CentOS in preparation of a Cassandra installation.  By following this blog you will have a clean node that is specifically ready for a Cassandra (DSE) install.

Note:  We will be using a VirtualBox image as the baseline for this blog and for example purposes only.  It is not recommended to run Cassandra or DSE on VirtualBox in production.  Please note the bare metal installation steps are fairly similar and we will call out the differences along the way.

Note:  For help in selecting the right hardware for Cassandra, please visit the DataStax documentation here:  Architecture Planning

Step 1:  Prepare Your Installation Media:  

We used the 6.4 x86_64 iso version of CentOS:  http://wiki.centos.org/Download
We used the x86_64 LiveCD iso version of CentOS when we installed on our VirtaulBox image

When doing a bare metal install, you will need to create installation media.  We created a USB key from the LiveCD version of the install.  If you need help creating a bootable USB Key, just ask Uncle Google and you will receive a lot of guidance.  Also, be sure to change any BIOS settings necessary to ensure your machine will see the USB Key when booting.

Step 2:  Start the Installation

Once your bootable media is created and you have  configured your server to see the bootable media, it is time to start the install.   I created a new VirtualBox image with 4 GB of RAM and 8 GB of disk space for the purposes of this blog.  I started the machine and selected the start-up disk. 



Hit esc when CentOS starts loading.  This will bring you to the bootable directory.  Select Install to install with the GUI, it's what we are going to use for this blog.  If you're up for it, go with Text mode :)


Now hit enter and let's get the install started!

Step 3:  Configure the Basics

On the first screen click Next.
Then, choose your language, we chose English, click Next.
Then, select your keyboard, we chose U.S. English, click Next.
Then, select Basic Storage Devices, click Next.
   - Note: Do not install or use any type of NAS to support Cassandra.  Cassandra is a distributed system.  One of the main benefits of Cassandra is that it doesn't have a single point of failure.  Leveraging any type of NAS would create a single point of failure for the system, which is undesirable.  You may ask, what about all the Amazon installs of Cassandra?  When installing on Amazon, we leverage ephemeral storage, not shared storage.
Select Yes, wipe my data, click Next.
Then, give your machine a hostname, click Next.
Select your timezone, click Next.
Select the root password, click Next.

Step 4:  Configure Storage

Okay, here's where we did some very specific things for Cassandra.  
We chose to install without a swap drive.  Doing this means if the node runs out of memory, it will "blow up".  We are disabling swap for the following reasons:

1)  Cassandra is a distributed system.  If a node goes down, that's okay, the system will continue operating and we will get an opportunity to know why, i.e. an opportunity to learn!
2)  Masking a memory issue with swapping is undesirable with Cassandra as it will cause disk performance issues.  We would much rather find out we have exhausted memory resources and take corrective action.  Again, Cassandra is a distributed system.  In fact, we call it "anti-fragile", thanks Mr. Taleb for that concept.  Cassandra installations, and teams, tend to get better when node failures are observed and corrected.
3)  Cassandra is a database system.  Like any database system, we want to explicitly control as much disk I/O as possible.  We do not want undesired disk I/O form memory swapping.
4)  Our colleague, and fellow Bald-Jonathan, Jonathan Shook said it was a good idea and that guy's awesome.

On the following menu select Create Custom Layout, click Next

Delete all partitions that exist on your hard drives.  This assumes you have nothing important on the hard drives.  If you have anything of importance on your system, then stop the installation and move your data.  Once all partitions are deleted, you should see something like the following.

Each physical driver should be empty and should only show "Free" under the physical drive.  Because we are using a virtual machine for this exercise,  we only see sda.  

Note:  You should have at least 2 separate drives in your machine when installing Cassandra, preferably more.  Try to use SSD's.  Here's a good link to physical disk recommendations for Cassandra:  Hardware Recommendations (SSDs)

Now create a set of standard partitions:
1) Boot partition for the OS
2) Cassandra partition
   -- If you have only 2 disks and are not using SSDs, it is recommended to place the OS and data together on the same partition, thus isolating the Commit Log. 
   -- If you are utilizing SSDs, then you can place the commit log and data on the same partition.
   -- We aren't going to discuss the differences of using RAID or JBOD in this blog.  That will be a topic for another day.

Here's a screenshot of what you will see when you click the Create button.  Select Standard Partition.

Here's a screenshot of the partition information.  Your information will differ.  It is recommended to use xfs for the file system type if you are able.

Here's a screenshot of the Cassandra partition.  We are using SSDs, so we only have one partition.

Finally, here's what our disk layout looks like once we are ready for the install.  
Note, that in a real, non Virtual, install we would have the /boot partition on a separate disk from the /cassandra partition
Click Next.
When prompted stating you have not specified a swap partition, select Yes.
Next, select Format on the format prompt and then click continue.
Now, select Next when prompted about installing a boot loader.
You are now installing CentOS!

Once the installer completes you will see the following message.

Give yourself a high five, then reboot.  Walk through the last few steps.

Note:  In Production installations, be sure to use an NTP to keep the Cassandra node clocks synchronized.

Step 5:  Configure the OS for Cassandra.

In this step we are going to do a few items to get the OS ready for the Cassandra install:
1) update the OS
2) enable SSH so we can use a terminal.  BTW:  have you used Cluster shell?  It makes working with clusters a lot easier.  Check it out here:  http://sourceforge.net/projects/clusterssh/   Thanks to Johnny Miller for making us aware of this tool!
3) disable unwanted "stuff", including the UI if you installed it
-- if needed disable the firewall and SE Linux.
    -  we will not cover this topic.  It's up to you to decide if this is a good idea for your environment.
4) since we are using SSDs, we need to tweak a few settings to ensure the system can use SSDs properly
5) install the right version of JAVA

Note: We are using a root to install and interact with the system.  Using root as a user is never recommended in a Production environment.

Update the OS:
   1)  run yum update

Enable SSH:  
   1)  To enable sshd one time run the following:   /sbin/service sshd start
   2)  To enable sshd on startup run the following:  chkconfig sshd on

Disable Unwanted Stuff
   1)  We disabled the UI by opening the following in your favorite editor:  /etc/inittab
         -- now change the 5 to a 3 in the following line  id:5:initdefault: to id:3:initdefault: 
         -- now reboot
   2)  We turned off a lot of unwanted services by following these guides:
   3)  Because we are running a demo machine, we disabled SE Linux and the Firewall
        -- to disable SE Linus open the following file in your favorite editor and change enabled to disabled:  /etc/sysconfig/selinux
        -- to disable the Firewall run the following:
             # service iptables save
             # service iptables stop
             # chkconfig iptables off

 Enable SSDs on the OS
 We want the OS to understand how to interact with SSDs.  Most OS's are configured for HDDs.  The following three commands will help the OS understand how to leverage SSDs to their fullest by tweaking the schedule and two disk io specific settings.
           echo noop > /sys/block/sda/queue/scheduler
           echo 0 > /sys/block/sda/queue/read_ahead_kb
           echo 0 > /sys/block/sda/queue/rotational

  Install JAVA
  Here is a good Cassandra guide to installing the right version JAVA.  Please note that the openSDK is not a good choice for Cassandra.   Cassandra 1.2 JAVA Installation

Step 6:  Install Cassandra

Well, at this point you are ready to install Cassandra/DSE.  The documentation for installing Cassandra and DSE is pretty straight forward.

We will follow-up with another blog to show you how to create a demo environment that contains a few individual nodes hosted on a single machine.  

Good luck and feel free to reach out with questions or feedback.