Monday, February 25, 2013

Setup a Storm cluster on Amazon EC2

Storm - Real-time  Hadoop! Yes, you can call it in that way. As you know, Hadoop provides a set of general primitives for doing batch processing, Storm also provides a set of primitives for doing real-time computation. It's a very powerful tool and pretty straight forward to setup a Storm cluster. If you want to setup a Strom cluster on Amazon EC2, you should try Nathan's storm-deploy project first which will auto deploy a Storm cluster on EC2 based on your configurations. But if you want to manually deploy a Storm cluster, you can follow these steps (if you want more detailed information, you can also follow original documentation of Storm):

Let me show you my machine's current configuration first:
  • Machine type: m1.large (for supervisor) and m1.small (for nimbus)
  • OS: 64-bit CentOS 6.3
  • JRE Version: 1.6.0_43
  • JDK Version: 1.6.0_43
  • Python Version: 2.6.6


For this tutorial, I am going to setup a 3-node Storm cluster. IP addresses of each hosts and my targetted configurations is:

10.0.0.194 - StormSupervisor1
10.0.0.195 - StormSupervisor2
10.0.0.196 - StormSupervisor3
10.0.0.182 - StormNimbus

Storm depends on Zookeeper for coordinating the cluster. I have already installed Zookeeper in each of those above hosts. Now apply each of the following steps on all Supervisor and Nimbus nodes:


A. Install ZeroMQ 2.1.7

Step #A-1: Download zeromq-2.1.7.tar.gz from http://download.zeromq.org/.

Step #A-2: Extract the gzip file:
[root@ip-10-0-0-194 tool]# tar -zxvf zeromq-2.1.7.tar.gz
Step #A-3: Build ZeroMQ and update the library:
Note: If you are facing "cannot link with -luuid, install uuid-dev." error when you are executing "./configure", then you need to install it. You can install it by executing "yum install libuuid-devel".


B. Install JZMQ

Step #B-1: Get the project from the Git by executing:
Step #B-2: Install it:

C. Setup Storm

Step #C-1: Download the latest version (for this tutorial, I'm using 0.8.1 version) from https://github.com/nathanmarz/storm/downloads.

Step #C-2: Unzip the downloaded zip file:
[root@ip-10-0-0-194 tool]# unzip storm-0.8.1.zip
Step #C-3: Now change the configuration based on your environment. Default location of the main Storm configuration file is: "/storm/conf/storm.yaml". Any setting you write on this file will overwrite default configuration file. Here is what I changed in the storm.yaml file:
Note: I have created the "storm" folder manually inside "/var" directory.

At this point, you are ready to start your Storm cluster. Here, I installed and setup everything on a single instance first (supervisor1 - 10.0.0.194) and then I created AMI from that instance and later created rest of the two supervisors and one nimbus node from that AMI.

Launch daemons by using the storm script (bin/storm) on each nodes. I started nimbus and UI daemons on the nimbus host and supervisor daemon on each of the supervisor nodes.

  • bin/storm nimbus on 10.0.0.182
  • bin/storm ui on 10.0.0.182
  • bin/storm supervisor on 10.0.0.194,10.0.0.195,10.0.0.196



You can see Storm UI by navigating to your nimbus host: http://{nimbus host}:8080. For my case, it was: http://54.208.24.209:8080 (here, 54.208.24.209 is the public IP address of my nimbus host).



Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.


5 comments:

  1. " I have already installed Zookeeper in each of those above hosts."

    Why do you need Zookeeper in each machine? Only one Zookeeper is needed.

    ReplyDelete
  2. Thanks Rami for stopping by. It's true that single node Zookeeper is sufficient for most of the cases, But when I said "Installed Zookeeper in each of those above hosts" I meant replicated Zookeeper (for failover). Use of Zookeeper cluster is also suggested when deploying large Strom cluster.

    ReplyDelete
  3. Thanks man, Useful article

    ReplyDelete
  4. I am also doing the same trail of setting up a clustered setup on EC2 with 2 supervisors and one nimbus and one zookeeper.But for me only one supervisor instance is showing in the storm ui at a time.Both supervisors are able to connect and communicate with zookeeper but at a given time only one is being showed in the ui.There is a continuous switching between the supervisors in some random time difference. Need help.
    Thanks

    ReplyDelete
    Replies
    1. change the data dir in supervisors.path must be different in supervisors.

      Delete