Wednesday, February 13, 2013

Install Cassandra Cluster within 30 minutes

Installing Cassandra cluster is pretty straight forward and you will find a lot of detailed documentations on DataStax site. But if you still find it overwhelming, you can follow the steps I mention here. It's just a sorted version of all of the steps you need to install a basic setup of Cassandra cluster. But I still recommend you to read DataStax documents for detailed information.

I am using a EBS backed CentOS AMI for this setup. You can choose different AMI based on your needs.

Please check the current Cassandra version if you want to setup the most latest one. Today, I am going to install Cassandra 1.2.3 version.

Step #1: Create an instance in Amazon AWS. I choose the following one for this setup:

EBS Backed CentOS 6.3

This is an EBS backed 64-bit 6.3 CentOS AMI. For me, the IP address which was assigned for this server:

Step #2: Once you created this instance, by default it will not allow you to login as root user. So, login to that server as ec2-user.

Step #3: Now allow root login on that machine:
Using username "ec2-user".
Authenticating with public key "imported-openssh-key"
[ec2-user@ip-10-0-0-57 ~]$ sudo su
[root@ip-10-0-0-57 ec2-user]# cd /root/.ssh
[root@ip-10-0-0-57 .ssh]# cp /home/ec2-user/.ssh/authorized_keys .
cp: overwrite `./authorized_keys'? y
[root@ip-10-0-0-57 .ssh]# chmod 700 /root/.ssh
[root@ip-10-0-0-57 .ssh]# chmod 640 /root/.ssh/authorized_keys
[root@ip-10-0-0-57 .ssh]# service sshd restart
Step #4: Now you should be able to login as root user on that server. So, login again as root user and check the current Java version:
[root@ip-10-0-0-57 ~]# java -version
The AMI which I used for this server didn't have Java pre-installed. So, I'm going to install last version of Java 1.6 on this server. It is recommended not to use Java 1.7 for Cassandra.

Step #5: Download Java rpm (jre-6u43-linux-x64-rpm.bin) from the following location:

Step #6: Copy your downloaded rpm to that server by using WinSCP. (You can use any other tool if you want).

Step #7: Give required permissions to that rpm and install it:
[root@ip-10-0-0-57 ~]# chmod a+x jre-6u43-linux-x64-rpm.bin
[root@ip-10-0-0-57 ~]# ./jre-6u43-linux-x64-rpm.bin
Step #8: Java should be installed successfully and it should show you expected information:
[root@ip-10-0-0-57 ~]# java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)
Step #9: Now download Cassandra 1.2.3 rpm from the following location and copy it to your server by WinSCP:

DataStax Cassandra 1.2.3 rpm

You can also install it by using DataStax repository (check DataStax Cassandra manual for that), I'm just following this way as a preference.

Step #10: After you copy the Cassandra rpm to your server, install it:
[root@ip-10-0-0-57 ~]# yum install cassandra12-1.2.3-1.noarch.rpm
Step #11: Cassandra should be installed successfully and you can check the status by this (By default, Cassandra server is stopped right after its installed):
[root@ip-10-0-0-57 ~]# /etc/init.d/cassandra status
cassandra is stopped
Step #12: During Cassandra installation, it downloads OpenJDK version of Java as a dependency and it will overwrite your java setting. So follow these to set back your Java to use Oracle version:

Step #13: At this point, your Cassandra is ready to start. But before you start your Cassandra, you need to update its configuration file for your cluster. All configuration file are present in the location "/etc/cassandra/conf" of your server by default if its packaged install. For the cluster, I will change one of the file from there which is cassandra.yaml. The cassandra.yaml is the main configuration file for Cassandra. 

There are so many properties you can change in the main configuration file and you can find its details here: But as I am just installing the most basic version of Cassandra cluster, I will change only the following property:
  • cluster_name: Name of your cluster. It will be same for all hosts or instances.
  • initial_token: Used in versions prior to 1.2. For this setup, I will manually set this value. But that is not required and you can setup your cluster by using both initial_token & num_tokens property. You can read more about it on the web and can try out different configs once you are familiar with Cassandra.
  • partitioner: It determine which node to store the data on. Remember, paritioner cannot be changed without loading your all data. So configure your correct partitioner before initializing your cluster.
  • seed_provider: A list of comma-delimited IP addresses to use as contact points when a node joins a cluster. This value should be unique for all your hosts.
  • listen_address: Local IP address of each host.
  • rpc_address: Listener address for client connections. Make it to listens on all configured interfaces.
  • endpoint_snitch: Sets which snitch Cassandra uses for locating nodes and routing requests. Since, I'm installing Cassandra cluster on AWS with a single region, I will be using EC2Snitch. Again, this value should be unique for all of your hosts.
So, after modifying, here is my updated configuration file (reflects only changes which I made):
Step #14:

Murmur3Partitioner: This is a new partitioner which is available from Cassandra 1.2 version. initial_token value is depends on the partitioner you are using. To generate initial_token value for Murmur3Partitioner, you can run the following commands:
Note: Here, 3 is the number of nodes which I will be using for my cluster setup. Change that value based on your needs.

RandomPartitioner: If you want to use RandomParitioner then in that case you can generate your initial_token value by using the tooken-generator tool which comes with Cassandra installation:
Step #15: Now, I am going to create a new AMI from my current instance so that when I create a new instance from that, I will have Cassandra installation ready with expected configuration.

Since, I'm installing a 3 node cluster, I will create two new instances from the AMI which I just created. So, I got IP addresses like this:

- cassandra node1 ->
- cassandra node2 ->
- cassandra node3 ->

If you are using Amazon Virtual Private Cloud (VPC), then you have the option to choose specific IP address based on your needs.

Step #16: Remember, even though you created instances from the AMI, you still need to change some values in cassandra.yaml file which varies based on hosts. Those are:

Node #2:
initial_token: -3074457345618258603
Node #3:
initial_token: 3074457345618258602
Step #17: You may need to update your "/etc/hosts" file in case your hostname is not configured. I have updated that file in each server like this:
Step #18: That's it! Now you are ready to start your Cassandra. Execute this in each node to start your cluster:
[root@ip-10-0-0-57 ~]# /etc/init.d/cassandra start
Starting cassandra: OK
Step #19: You can check the status of your Cassandra by executing this:

Two major things here you need to look on the status, those are "Owns" and "Status" columns. You see here all nodes are up and sharing same percentage of the total ownership. 

As I said at the beginning, this is the most basic (or minimum) setup of Cassandra cluster. There are a lot of things you can change, tune and modify based on your needs. Once you are familiar with basic Cassandra, I highly recommend you to do some experiments by using different configurations. The more you try the more you learn!

In another post, I will write about how to install DataStax OpsCenter to monitor this Cassandra Cluster.

Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.

1 comment:

  1. The blog gave idea to install cassandra cluster My sincere thanks for sharing this post Please continue to share this post
    Hadoop Training in Chennai