tag:blogger.com,1999:blog-7408549942888352432024-02-08T01:42:05.587-06:00Cloud for BeginnersI generally write tutorials related to cloud and Hadoop in our private wiki. I realized that some of those tutorials might be also helpful to others, specially for someone who wants to get their hands dirty in this domain. So my primary intention here is to write on those things which will help someone at the very beginning and will show a path to move forward. If you are already an expert in this field, this may not be the right place for you. This blog is only for beginners. Happy Clouding!Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.comBlogger14125tag:blogger.com,1999:blog-740854994288835243.post-88675286463248458912013-06-15T14:06:00.000-05:002014-01-11T00:53:43.236-06:00Elastic Load Balancing (ELB) with a Java Web Application + Tomcat + Session Stickiness<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Suppose you have a web application and you want to deploy it in Amazon cloud environment with load balance support. The whole process is pretty straight-forward and it generally doesn't take much time.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
For this post, I'm using Apache Tomcat web server and I already have a war file from my HelloWorld application. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Here is the Tomcat version I'm using:</div>
<div style="text-align: justify;">
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqGHGk-X23qxt35RF1V3KF5Yd70EIcbeOrsZTZsx84AIS3mXjzkMgDVuqG1ogRn82g_S9OmjlOcOPVlRqczzbKguAc2UNrP0heMwBgJI3DmInjw62jTuLRZQXTWUthUY6Mrn64k3-HpgN4/s1600/1.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhqGHGk-X23qxt35RF1V3KF5Yd70EIcbeOrsZTZsx84AIS3mXjzkMgDVuqG1ogRn82g_S9OmjlOcOPVlRqczzbKguAc2UNrP0heMwBgJI3DmInjw62jTuLRZQXTWUthUY6Mrn64k3-HpgN4/s1600/1.JPG" height="177" width="640" /></a></div>
<br /></div>
<div style="text-align: justify;">
I'm using two instances and I have extracted my tomcat zip file into <b><i>/opt/</i></b> folder in each of those two instances. I have also placed HelloWorld.jar file into <b><i>/opt/apache-tomcat-7.0.39/webapps</i></b> folder.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAWxsXnNHr4x_E90deM3seDEibetPuuTzo0C5AI_d7WcRj60xyo5A3hHVvga3hYWGHpLJta8YuGDxS4-w41s3jA3ZDXXDzz405HK4XzrvJy7I9o365DglrsIY8RNQi3-DjLfqE9446NuZ7/s1600/2.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjAWxsXnNHr4x_E90deM3seDEibetPuuTzo0C5AI_d7WcRj60xyo5A3hHVvga3hYWGHpLJta8YuGDxS4-w41s3jA3ZDXXDzz405HK4XzrvJy7I9o365DglrsIY8RNQi3-DjLfqE9446NuZ7/s1600/2.JPG" /></a></div>
<br />
Now, I will go to each of those two instances and will start tomcat server. After some minutes (or seconds) I should see my deployed web application is up and running. Which means, I can navigate to these URLs and able to see Log-In screen (initial page of my web app).<br />
<br />
<ul>
<li>http://ip.address.instance-1:8080/HelloWorld/login.jsp</li>
<li>http://ip.address.instance-2:8080/HelloWorld/login.jsp</li>
</ul>
<br />
All of the above steps which I described so far, have nothing to do with Elastic Load Balancing (ELB). Just like everyone, I just deployed a web app in tomcat server. Before I start showing steps for ELB, I'm assuming your web application is also up and running and you can navigate through URLs separately.<br />
<br />
<br />
<b><u><span style="font-size: large;">Create Load Balancer</span></u></b><br />
<br />
<b><u>Step#1:</u></b> On AWS EC2 console, click on the Load Balancer option under "<b>Network & Security</b>" section. If you do not have any ELB yet, you will see an empty list. Click on "Create Load Balancer" button.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikbkZUMwuDj01mtInwSapr-clNvBF-nqDc52UpNw2AS20VyVIOe1l7U18B9072aq4t0EX_RgS3DXRzS_HidkmyndtVgb9kyQ1QmprQ0pRUjRoNiTx9UdWNayu7g8PQ5Cd876YvmiDy-nta/s1600/3.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEikbkZUMwuDj01mtInwSapr-clNvBF-nqDc52UpNw2AS20VyVIOe1l7U18B9072aq4t0EX_RgS3DXRzS_HidkmyndtVgb9kyQ1QmprQ0pRUjRoNiTx9UdWNayu7g8PQ5Cd876YvmiDy-nta/s1600/3.JPG" height="216" width="400" /></a></div>
<br />
<b><u>Step#2:</u> </b>Write a name of your Load Balancer, this name will be used when it will create a default link. I'm also creating this Load Balancer inside my Virtual Private Cloud (VPC) that's why I'm selecting a specific VPC Id. By default, you might see only port 80 in the listerner configuration list, I have added port 8080 as my web app is running on port 8080. Add appropriate port based on your web application and click "<b>Continue</b>".<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLWQxJQA9P2O-nun50rd0WzW3g8JBcKlvHpfJpOqafJ1SQQKlKSYr45y-clrqV7rualeZJPHdBrAUBn3KwYJ5GcI4j3lqm-oWtRrmvy3wEqnoc-0wVgEkW2zWiUG9ImoMPoaLFe6ah4MSf/s1600/4.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjLWQxJQA9P2O-nun50rd0WzW3g8JBcKlvHpfJpOqafJ1SQQKlKSYr45y-clrqV7rualeZJPHdBrAUBn3KwYJ5GcI4j3lqm-oWtRrmvy3wEqnoc-0wVgEkW2zWiUG9ImoMPoaLFe6ah4MSf/s1600/4.JPG" height="293" width="400" /></a></div>
<br />
<b><u>Step#3:</u> </b>This screen is dedicated to Health check configuration. Based on configuration, ELB will ping that path with that port to check the health condition and if it fails it will automatically remove your instances from the load balancer.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUh4EOLUVhS5DVbGxrqfW2aD6fAaj7dPi4ijQeLgdWRzThiVa0M_JDlf9gKINpj5W9yrWguDY_0AasZ-0B7wDSnMobXm7q-d5XoCekS-44LLZtEfs9CTk41aY-lrp5HOj3EhLg7e2Rk0Ox/s1600/5.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgUh4EOLUVhS5DVbGxrqfW2aD6fAaj7dPi4ijQeLgdWRzThiVa0M_JDlf9gKINpj5W9yrWguDY_0AasZ-0B7wDSnMobXm7q-d5XoCekS-44LLZtEfs9CTk41aY-lrp5HOj3EhLg7e2Rk0Ox/s1600/5.JPG" height="291" width="400" /></a></div>
<br />
Since, Log-In is the default screen of my application (welcome page), so I'm using the path of Login screen as my ping path.<br />
<br />
<b><u>Step#4:</u> </b>Choose your Subnet id based on where you want to use your Load Balancer. For my case, <i><b>subnet-2e961843</b></i> is my expected Subnet id.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7i8d1Kg7rCFJ8sjLvVum0eVVh9N3Xh773U0e8We7ZI6vPw_lck-nkbAA9APwIT6UtexYCGC2JEDaDh_phwD03u-GugqZdVwbJrzIl9zvmoW9-CgNPOZQm8-YMAbN4ACDYCxel0e_QwxI_/s1600/6.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh7i8d1Kg7rCFJ8sjLvVum0eVVh9N3Xh773U0e8We7ZI6vPw_lck-nkbAA9APwIT6UtexYCGC2JEDaDh_phwD03u-GugqZdVwbJrzIl9zvmoW9-CgNPOZQm8-YMAbN4ACDYCxel0e_QwxI_/s1600/6.JPG" height="290" width="400" /></a></div>
<br />
<b><u>Step#5:</u> </b>Next screen will ask you to select your security groups. I already have a security group for my VPC and I'm using it here too.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9WWotCDZna76ySfJ0zzm0gGsaEAu5V1JMFEGgmUOI9TzTosktSNOz5W_iZ1pJCK1MZBrCMLPYWpkYtxAk8EjCxad3xplDGbWyeC5vxTcZHpdKIEAmlDghSCUl6oAaNYvUJgumQKmVX3n9/s1600/7.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEh9WWotCDZna76ySfJ0zzm0gGsaEAu5V1JMFEGgmUOI9TzTosktSNOz5W_iZ1pJCK1MZBrCMLPYWpkYtxAk8EjCxad3xplDGbWyeC5vxTcZHpdKIEAmlDghSCUl6oAaNYvUJgumQKmVX3n9/s1600/7.JPG" height="291" width="400" /></a></div>
<br />
<b><u>Step#6:</u> </b>In the "Add EC2 Instances" section, add the instances in where you already deployed Tomcat and your web application.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFq2J7pimjx5EVV_at7PHAeR3vj7r8KdjGF_9J94L2AlzVlRRZZRwXhG7Qx7RgiSqAkKSozUvuNlLLi4egw87LE8kkfD7nizTGqHDru2W9fO2uitkjT9n7TRdxel1LyGLtEQfHVqi6dSYQ/s1600/8.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFq2J7pimjx5EVV_at7PHAeR3vj7r8KdjGF_9J94L2AlzVlRRZZRwXhG7Qx7RgiSqAkKSozUvuNlLLi4egw87LE8kkfD7nizTGqHDru2W9fO2uitkjT9n7TRdxel1LyGLtEQfHVqi6dSYQ/s1600/8.JPG" height="287" width="400" /></a></div>
<br />
<b><u>Step#7:</u> </b>This screen is for review purpose. Once you review it you can finally create your load balancer by clicking on the "Create" button.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKAMt58wLKlWE8CetqHgCUGCyLZ8H5kOnPDOX8NTyyQvrGJ_mQDUzKNDYEaKDOIJFQpvWoFyoakrFQa8yUW-xcmJLMJRpk3Jv9a1mVf7mw_51tIrwtwjim7ZH-QDG4QwKzCrhSRNGFOOzG/s1600/9.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjKAMt58wLKlWE8CetqHgCUGCyLZ8H5kOnPDOX8NTyyQvrGJ_mQDUzKNDYEaKDOIJFQpvWoFyoakrFQa8yUW-xcmJLMJRpk3Jv9a1mVf7mw_51tIrwtwjim7ZH-QDG4QwKzCrhSRNGFOOzG/s1600/9.JPG" height="290" width="400" /></a></div>
<br />
<b><u>Step#8:</u> </b>Once you create your balancer, it will redirect you to Load Balancer list and now you will see your newly created load balancer in the list. DNS Name column shows newly created DNS Name for your load balancer and you should be able to navigate it with proper port.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLx0KG-fQjkaW_Wx9v0F200YLqPjs4gIIFhusJSoQt4NVlI3D1Jq9Gdk9lWzX8UvuhSRJzzMGSBseYuBVIOPyu7a30rJKxotXcjFDx2zRG3s1bPDa6Q5kEmJvgkMTPp2aHNAAvPz3vEZuB/s1600/10.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiLx0KG-fQjkaW_Wx9v0F200YLqPjs4gIIFhusJSoQt4NVlI3D1Jq9Gdk9lWzX8UvuhSRJzzMGSBseYuBVIOPyu7a30rJKxotXcjFDx2zRG3s1bPDa6Q5kEmJvgkMTPp2aHNAAvPz3vEZuB/s1600/10.JPG" height="175" width="400" /></a></div>
<br />
So for my case, I can navigate my load balancer by using:<br />
<br />
http://helloworld-353060791.us-east-1.elb.amazonaws.com:8080/HelloWorld/login.jsp<br />
<br />
<br />
<b><u><span style="font-size: large;">Sticky Session:</span></u></b><br />
Since you are using Tomcat with load balancer, it's pretty obvious that you might want to enable sticky session with session replication in Tomcat. My web application is a Spring MVC application and it uses Spring Security for all type of authorizations and authentications. If I directly go to the Log-In screen of my load balancer and try to authenticate, it might not work. It's expected as Tomcat gets confused when sending request and response in multiple instances. If I enable sticky session I will not face this issue.<br />
<br />
You can do it with the help of AWS EC2 console. Open the Load Balancer screen and select your newly created load balancer.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhquCOOJEFFbFhCjrVxxtB6A5ILW1dlLHaHrRgNLDwoy4gCUfKjHlO7H5MZKolNS4mcoMHd-GE82D5Yy53sO5uTWF9WvbY9nRrNrTJUvm2XUpJNo4bZd0iB1PwnyzFB4FYaizdMmONAW7iN/s1600/11.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhquCOOJEFFbFhCjrVxxtB6A5ILW1dlLHaHrRgNLDwoy4gCUfKjHlO7H5MZKolNS4mcoMHd-GE82D5Yy53sO5uTWF9WvbY9nRrNrTJUvm2XUpJNo4bZd0iB1PwnyzFB4FYaizdMmONAW7iN/s1600/11.JPG" height="201" width="400" /></a></div>
<br />
If you look carefully at the port configuration part, you will see "<b>Stickiness: Disabled</b>" for all of your ports. By default, stickiness is disabled for all the ports you select for load balancer. Now click on the "<b><i>edit</i></b>" button of the port on where you want to enable stickiness. For my case, it will be port 8080. Once you click on the "edit" button, it will ask you how you want to enable session stickiness. You can either choose Load Balancer Generated Cookie Stickiness or Application Generated Cookie Stickiness. For my simple application, I have selected "Load Balancer Generated Cookie Stickiness" and I entered 86400 as my cookie expiration period which is a day in seconds.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilqjQMfX88gEbRQYTN_uSBiXfApaw6WsEKADsM-pMogGPM3mTmY8Wz0HbM7LhwWyV0xUVWCdrM_RExxTbOfp6PnAs1TaqW0GtnlHtOPgNbb6OQbWnDk77XyP0GK1xJyWuHrCJ4vL3w8FZf/s1600/12.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEilqjQMfX88gEbRQYTN_uSBiXfApaw6WsEKADsM-pMogGPM3mTmY8Wz0HbM7LhwWyV0xUVWCdrM_RExxTbOfp6PnAs1TaqW0GtnlHtOPgNbb6OQbWnDk77XyP0GK1xJyWuHrCJ4vL3w8FZf/s1600/12.JPG" height="197" width="400" /></a></div>
After you enable it, you should be able to test your session stickiness. For my case, now I'm able to successfully authenticate to my application.<br />
<br />
<b style="text-decoration: underline;">Some considerations:</b> Sometimes you might see your load balancer is down or the link is not working or shows no page. In that case, best way to quickly test is to check each of the instance where tomcat is running and check whether you can access them individually (e.g. http://ip.address.instance-1:8080/HelloWorld/login.jsp). If you find that each of the instance is up and running, you can try removing them from your load balancer and add them again. Remember, "Status" section under "Description" tab of your load balancer does not get updated instantly. It takes some time and it waits for the result of the next health check. So wait few minutes until you see "Status: N of N instances in service".<br />
.<br />
<br />
That's pretty much it! This is the very basic AWS Load Balancer example with minimum configuration of Tomcat + Session Stickiness. Once its working for you, you can try other options (highly encouraged) and see how it works for you.<br />
<br />
<br />
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me :)</div>
</div>
Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com3tag:blogger.com,1999:blog-740854994288835243.post-34876581493872745242013-06-01T00:47:00.000-05:002014-01-12T01:09:39.320-06:00Cassandra Performance Tuning<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
In my previous post, I discussed about how to stress test Cassandra. In this post, I will discuss on some easy steps to tune-up its performance. I'm a big fan of Cassandra. It is optimized for very fast and highly available data write. There are so many things you can do to optimize its write and read performance further. But today, I will only discuss on some major and easy tune-up steps which you can apply easily.</div>
<div style="text-align: justify;">
<br />
<br />
<b><u>Dedicated Commit Log Disk:</u></b> I think this is the first tune-up you may want to try as it gives you a significant performance improvement. But before changing commit log destination it would be better to know it gives performance boost. Cassandra write operations are occurred on a commit log on disk and then to an in-memory table structure called Memtable. When thresholds are reached, that Memtable is flushed to a disk in a format called SSTable. So if you separate out Commit Log locations, it will isolate Commit Log I/O traffics from other Cassandra Reads, Memtables and SSTables traffics. Remember, after the flush, the Commit Log is no longer needed and is deleted. So the Commit Log disk doesn't need to be large. It just need to be in the size where it can holds Memtable data before its flushed. You can follow the following steps to change commit log location for Cassandra.<br />
<br />
<u>Step#1:</u> Mount a separate partition for commit log<br />
<script src="http://gist.github.com/8381783.js?file=change-commit-log-location-1"></script>
<u>Step#2:</u> Make sure you give expected ownership and access on that drive<br />
<script src="http://gist.github.com/8381783.js?file=change-commit-log-location-2"></script>
<u>Step#3:</u> Edit Cassandra configuration file which can be found at <b><i>conf/cassandra.yaml</i></b>. You will find a property "CommitLogDirectory", update it based on your mount location. For my case, it will be:<br />
CommitLogDirectory: /mnt/commitlog<br />
<u>Step#4:</u> Restart your Cassandra cluster.<br />
<br />
<br />
<b><u>Increasing Java Heap Size:</u></b> Cassandra runs on JVM. So you might face out of memory issues when you run a heavy load on Cassandra. There is also a rule of thumb about how you want to keep your heap size.<br />
<ul>
<li>Heap Size = 1/2 of System Memory when System Memory < 2GB</li>
<li>Heap Size = 1GB when System Memory >= 2GB and <= 4GB</li>
<li>Heap Size = 1/4 of System Memory(but not more than 8GB) when System Memory >4GB</li>
</ul>
<div>
Remember, just a larger heap size might not give you a performance boost. So a well-tuned Java heap size is very important. To change the Java heap size, you need to update <b><i>cassandra-env.sh</i></b> file and then restart Cassandra cluster again. If you are using Opscenter, you should see the updated heap size on one of the Opscenter's metrics.</div>
<br />
<br />
<b><u>Tune Concurrent Reads and Writes:</u></b> Staged Event-Driven Architecture(<a href="http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf">SEDA</a>) is used for implementing Cassandra. It breaks the application into stages. Concurrent readers and writers control the maximum number of threads allocated to a particular stage. So having an optimal concurrent reads and concurrent writes value will improve Cassandra performance. But raising these values beyond the limit will decrease Cassandra performance. These values are highly tied with CPU cores of the system. As like, Java heap size, there is also a rule of thumb about how to select these values:<br />
<ul>
<li>Concurrent Reads: 4 concurrent reads per processor core</li>
<li>Concurrent Writes: Most of the time you do not need it as write is usually fast. If needed, you can set the value to equal or higher than the concurrent reads.</li>
</ul>
To change the value, you need to update <b><i>conf/cassandra.yaml</i></b> configuration file. There are two parameters present for these two: <b>ConcurrentReaders</b> and <b>ConcurrentWriters</b>. Update those values based on your system and restart Cassandra to take the effect.<br />
<br />
<br />
<u style="font-weight: bold;">Tune-Up Key Cache:</u> For each of the column families, key cache holds the location of row keys in memory. Since keys are usually small, it can store a large cache without using much memory. Each cache hit results in less disk activity. 200000 is the default key cache size of Cassandra and its enabled by default. You can alter the default value by following: <br />
<br />
<script src="http://gist.github.com/8381783.js?file=update_keys_cached"></script>
<br />
<div>
You can monitor key cache performance by using nodetool cfstats command.</div>
<b><u><br /></u></b>
<script src="http://gist.github.com/8381783.js?file=nodetool-cfstats"></script>
<b><u><br /></u></b><br />
<b><u>Tune-Up Row Cache:</u></b> In Cassandra, row cache is disabled by default. Row cache holds the entire content of the date in memory. So a column family with large rows could easily consume system memory and could impact Cassandra performance, that's why its disabled by default and should be remain disabled in most of the cases. But if your column data is too small then using row cache will significantly improve performance as row cache keeps the most accessed rows hot in memory. To enable row cache, you can alter your column family and can pass number of rows for row cache.<br />
<br />
<script src="http://gist.github.com/8381783.js?file=update_rows_cached"></script>
You can also monitor it by using nodetool cfstats command like above (watch for ."Row cache hit rate").<br />
<br />
<b><u><br /></u></b>
<b><u>Conclusion:</u></b> As I said early, these are only some of the tune-up steps, there are more (high performing RAID level, file system optimization, disabling swap memory, memory mapped disk modes and so on). But I gave you something you can start with, once you find out improved Cassandra performance you can try the rest of the tuning. Cassandra is highly scalable and scaling up is done by enhancing each node (more RAM, high network throughput, SSD, disk size, etc). Remember, if you are using AWS EC2 instance do not expect much performance improvement if you are using medium or small type instance as they are not optimized for better I/O or network, use xlarge+ instance instead.<br />
<br />
And finally, DO NOT forget to check the <a href="http://www.slideshare.net/adrianco/cassandra-performance-on-aws">Cassandra Performance and Scalability</a> slides by Adrian Cockcroft. <br />
<br />
<b><br /></b>
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me :)</div>
</div>
Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com1tag:blogger.com,1999:blog-740854994288835243.post-35500213941170096882013-05-25T14:01:00.000-05:002014-01-12T01:04:47.922-06:00Cassandra Stress Test<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
In this post, I will go through how you can quickly stress test your Cassandra performance. Before you go for tuning your Cassandra you might want to see how well its performing so far or where its slowing down. You can definitely write a benchmark tool which inserts some random data and reads it after that and measure performance based on time. When I first asked to stress test Cassandra, I was writing pretty much same kind of tool. But in the middle I found an existing code which stress test Cassandra and which is good enough to start with. It's basically a pom based Java project which uses Hector (my project also use Hector - A Java Client for Cassandra).</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You can directly go here to get more information about how its written and how to run it:</div>
<div style="text-align: justify;">
<a href="https://github.com/zznate/cassandra-stress" style="text-align: left;">https://github.com/zznate/cassandra-stress</a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
But if you just want a quick way to run it, you can follow the following steps:<br />
<br />
<b><u>Step#1: Install It</u></b></div>
<script src="http://gist.github.com/8379250.js?file=stress-test-install"></script>
<br />
<div style="text-align: justify;">
<u><b>Step#2: Run It:</b></u><br />
<script src="http://gist.github.com/8379250.js?file=run-stress-test"></script>
</div>
What the above command doing is:<br />
<ul>
<li>Inserting (-o insert) 1000000 records (-n) into column family StressStandard which has 10 columns (-c)</li>
<li>Using 5 threads (-t) and each batch size is 1000(-b)</li>
<li>So each thread is getting 1000000 / 5 = 200000 inserts, as the batch size is 1000, so each thread is actually inserting 200000 / 1000 = 200 times.</li>
</ul>
After it inserts 1000000, it will show you a brief stat of data insertion performance. For the above test, it took around 3 minutes to insert all records (no optimization), which was 140.87 write request per seconds with bandwidth 15730.39 kb/sec. You can also test read performance, as well as some other Hector's API performance (rangeslice, multiget, etc).<br />
<br />
I played with this stress tool a lot and later I converted it based on my needs(to work with my Cassandra keyspace andcolumn families) and ran it for my stress test. I highly recommend you to use this stress tool, it will serve most of the basic cases.<br />
<br />
<div style="text-align: justify;">
<br />
<br />
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me :)</div>
</div>
Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com1tag:blogger.com,1999:blog-740854994288835243.post-13167932387670821412013-05-17T18:32:00.002-05:002013-10-16T22:51:58.571-05:00Chunk data import / Incremental Import in Sqoop<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Recently I faced an issue while importing data from oracle with <a href="http://sqoop.apache.org/">Sqoop</a>. So far it was working fine till I faced a new requirement. Before discussing about the new requirement, let me quickly write about how it's currently working.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Currently I am running Sqoop from <a href="http://oozie.apache.org/">Oozie </a>but I am not using coordinator job. So I am executing each Oozie job manually from command prompt.</div>
<br />
You can check these links if you want to know how to run Sqoop and Oozie together.<br />
<ul style="text-align: left;">
<li><a href="http://www.tanzirmusabbir.com/2013/02/sqoop-importexport-fromto-oracle.html">Basic Sqoop Import and Export</a></li>
<li><a href="http://www.tanzirmusabbir.com/2013/03/oozie-example-sqoop-actions.html">Simple Oozie workflow for Sqoop</a></li>
</ul>
<div>
In our option parameter file, I have a field something like this below:</div>
<pre class="prettyprint">--where
ID <= 1000000
</pre>
For each run, I used to change that field manually and re-run my Oozie job.<br />
<div>
<br /></div>
<div>
<b><u>New Requirement</u></b></div>
<div>
<br /></div>
<div>
<div style="text-align: justify;">
Now, what I have asked to do is run my Oozie job through coordinator and import block-wise/chunk data from Oracle. Based on the current requirement, what I'm trying to achieve is to import list of rows from M to N. Ideally for each run, I'm targeting to import 15 millions rows from that specific table and Hadoop will process those records and will be ready to process another batch before the following run.</div>
</div>
<div>
<br />
As an example:<br />
1st run: 1 to 20<br />
2nd run: 21 to 40<br />
3rd run: 41 to 60<br />
and so on...<br />
<br />
<div style="text-align: justify;">
First thing which I started exploring is to use "--<b>boundary-query</b>" parameter which comes with sqoop. From their documents: "<i><b>By default sqoop will use query select min(<split-by>), max(<split-by>) from <table name> to find out boundaries for creating splits. In some cases this query is not the most optimal so you can specify any arbitrary query returning two numeric columns using --boundary-query argument.</b></i>"</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
After spending some time on it and discussing in Sqoop mailing list, I came to know that incremental import is not working with chunks. It imports everything since last import (more specifically, everything from <b>--last-value</b> to end).</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Then I decided to create a shell action in Oozie which will update the appropriate parameter after each execution of Sqoop, so that following Sqoop runs will have a new options for its import.</div>
<br />
So I made some changes in my option parameter file (<b>option.par</b>) and here is the new one:<br />
<script src="http://gist.github.com/7018842.js?file=option.par"></script>
To store current index value and chunk-size, I used another property based file <b>import.properties</b>:<br />
<script src="http://gist.github.com/7018842.js?file=import.properties"></script>
My shell script will update the value of <b><i>startIndex </i></b>by the <b><i>chunkSize</i></b>. Here is the script (<b>script.sh</b>) which I wrote for this:<br />
<script src="http://gist.github.com/7018842.js?file=script.sh"></script>
<br />
<div style="text-align: justify;">
I want to add something here is that when you are modifying a file by a script and running through Oozie, a cache version of the file in HDFS actually being updated. That's why I had to copy back those files to my original location of HDFS. Again, behind the scene, a <b><i>mapred </i></b>user is doing the work but I'm running the oozie job as <b><i>ambari_qa</i></b> user (note: I'm using Hortonworks Hadoop, HDP 1.2.0). That's why I had to give back all the permissions on those files to all users.</div>
<br />
Here is my Oozie workflow (<b>workflow.xml</b>):<br />
<script src="http://gist.github.com/7018842.js?file=workflow.xml"></script>
I put everything inside my Oozie application path in HDFS. Here is my folder structure:<br />
<script src="http://gist.github.com/7018842.js?file=folder-structure"></script>
Don't forget to give the "write" permission when you first put it inside HDFS. Now you can run the Oozie workflow by executing this:<br />
<pre class="prettyprint">[ambari_qa@ip-10-0-0-91 ~]$ oozie job -oozie http://ip-10-0-0-91:11000/oozie -config job.properties -run</pre>
Here is the <b>job.properties</b> file:
<script src="http://gist.github.com/7018842.js?file=job.properties"></script>
<br />
<div style="text-align: justify;">
This is it! Now every time you execute the Oozie job, it will import a new chunk of data from Oracle. How I'm running it as a coordinator job, I will put them in another post. Jarcec mentioned in one of the Sqoop user mail threads that Sqoop will have this feature soon but I'm not sure it's time frame. So I had to do this work around. It worked for me, I hope it will work for you too!</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Note: </b>For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
</div>
</div>
Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com3tag:blogger.com,1999:blog-740854994288835243.post-83498458938708474872013-04-06T10:31:00.000-05:002013-10-16T23:12:55.889-05:00Configure Ganglia for multiple clusters in Unicast mode<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
In my previous post I talked about <a href="http://www.tanzirmusabbir.com/2013/03/setting-up-ganglia-in-centos.html">how to: Setting up Ganglia</a> in CentOS environment. At that time, I used only a single cluster for the whole setup. But it's highly unlikely that you have only a single cluster in your development/production environment. Consider you have two clusters - 1. Storm 2. Kafka and you want to monitor all of these cluster nodes through a single Ganglia UI. You do not have to install Ganglia multiple times for that, you just need to configure your Ganglia. It would have been much easier if AWS <a href="http://sourceforge.net/apps/trac/ganglia/wiki/FAQ">supports multicast but as it doesn't support multicast</a>, you need to do a work-around in unicast mode to achieve monitoring multiple clusters in one single Ganglia.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The idea behind this work-around is pretty straightforward. Suppose I have two clusters: cluster#1 - Storm and cluster#2 - Kafka and their respective IP addresses are:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
10.0.0.194 - Storm Cluster (supervisor 1)</div>
<div style="text-align: justify;">
10.0.0.195 - Storm Cluster (supervisor 2)</div>
<div style="text-align: justify;">
10.0.0.196 - Storm Cluster (supervisor 3)</div>
<div style="text-align: justify;">
10.0.0.182 - Storm Cluster (nimbus)</div>
<div style="text-align: justify;">
10.0.0.249 - Kafka Cluster</div>
<div style="text-align: justify;">
10.0.0.250 - Kafka Cluster</div>
<div style="text-align: justify;">
10.0.0.251 - Kafka Cluster</div>
<div style="text-align: justify;">
10.0.0.33 - my client machine</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
What I am going to do is, I will configure each of the cluster to send collected data (<b><i>gmond</i></b>) to one of their specific node only and configure the <b>gmetad </b>daemon in a way that it can collects the data only from a designated node (<b>gmond daemon</b>) from each cluster. Ganglia will categorize each cluster data by their unique cluster name defined in <b><i>gmond.conf</i></b> file.</div>
<div style="text-align: justify;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYQ8w1cPMp_KOIl_THnNT0Ccn-iIxTyS9D-EjP3U9Nzy0aUl4JhdY3LxsphLrHDQ7xrich377JTb21t4AFlKs5ZeGr0WkKkLonoTfKShSm2NgwbuS8qIB-krzW3JrwsKc8b09PKA7-M-7W/s1600/cluster-2.jpeg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiYQ8w1cPMp_KOIl_THnNT0Ccn-iIxTyS9D-EjP3U9Nzy0aUl4JhdY3LxsphLrHDQ7xrich377JTb21t4AFlKs5ZeGr0WkKkLonoTfKShSm2NgwbuS8qIB-krzW3JrwsKc8b09PKA7-M-7W/s400/cluster-2.jpeg" width="361" /></a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
As you can see in the above figure that all Kakfa cluster's data is sending to one specific node - 10.0.0.249 and all Storm cluster's data is sending to one of its node - 10.0.0.182. Client machine (10.0.0.33) is running <b><i>gmetad daemon</i></b> and I will configure that daemon so that it can look for two data sources for two clusters where their source IP addresses will be 10.0.0.249 and 10.0.0.182 for Kafka and Storm respectively.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I'm assuming that you already setup your Ganglia and it's running as expected. So I am not going to discuess about what is gmond.conf and gmetad.conf files. In case if you have not setup yet, you might want to take a look at <a href="http://www.tanzirmusabbir.com/2013/03/setting-up-ganglia-in-centos.html">this post</a>. </div>
<div style="text-align: justify;">
This is my <b><i>gmond.conf</i></b> file (only the part which I modified) which I'm using for all Kafka hosts (this file is unique for each host per cluster):</div>
<script src="http://gist.github.com/7019093.js?file=gmond-kafka.conf"></script>
<div style="text-align: justify;">
And here is my <b><i>gmond.conf</i></b> file for all Storm hosts (this file is unique for each host per cluster):</div>
<script src="http://gist.github.com/7019093.js?file=gmond-storm.conf"></script>
<br />
<div style="text-align: justify;">
You notice that I'm using unique host address for<b><i> udp_send_channel</i></b> for each cluster. Now, I need to tell my gmetad daemon to look for those two host address to collect data from. Here is my <b><i>gmetad.conf</i></b> file:
<br />
<script src="http://gist.github.com/7019093.js?file=gmetad.conf"></script>
You are done! Now restart all gmond daemons and gmetad daemon and wait for few minutes.<br />
<script src="http://gist.github.com/7019093.js?file=restart-cmd"></script>
Once you navigate to your Ganglia UI url you should be able to see your grid and list of your clusters in the drop-down.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEWoUMYNg2t708ey2ggi8djkvriz3PL9N3C8U3CiViQIEvWZ0IHElZPXQjJEflDJ1fu_RjKSKSQSpPY5SmTAjzUm7CX3b29RDpLfYDwDmpV6mAoiwFwIzL0Paa2umnSqmkZ7zXN2LxdCK-/s1600/Capture.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="205" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjEWoUMYNg2t708ey2ggi8djkvriz3PL9N3C8U3CiViQIEvWZ0IHElZPXQjJEflDJ1fu_RjKSKSQSpPY5SmTAjzUm7CX3b29RDpLfYDwDmpV6mAoiwFwIzL0Paa2umnSqmkZ7zXN2LxdCK-/s400/Capture.JPG" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br /></div>
<div style="text-align: justify;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgV45G4djmnPSbJhAP9jaTA_Ngx529Ze_nPBZabTifJ9pb29JGOXOasSzia4oSc61owaJPKnn9D3I9B-hXJCaeeVrNexv1gVLyED7iieoBqoAbWeDRouHSUJ_uU63CdTdB8wX-2We6RnaYN/s1600/Capture-1.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="205" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgV45G4djmnPSbJhAP9jaTA_Ngx529Ze_nPBZabTifJ9pb29JGOXOasSzia4oSc61owaJPKnn9D3I9B-hXJCaeeVrNexv1gVLyED7iieoBqoAbWeDRouHSUJ_uU63CdTdB8wX-2We6RnaYN/s400/Capture-1.JPG" width="400" /></a></div>
<br /></div>
<div style="text-align: justify;">
You can dig further to see each of your host for each cluster:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgX7a60-L-L5w0uWt4y-qUePTinZg72VSD9nX2c1zrT4V-OS7qi5rvZrbMyy6hyLh_gY02miPAonvjEmYM1w6kwk_4IHcchY8z8C98xo0qG9Ze36u0Uzc6GjDRqigzgMlfkgYOLWyNVh5t-/s1600/Capture-2.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="182" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgX7a60-L-L5w0uWt4y-qUePTinZg72VSD9nX2c1zrT4V-OS7qi5rvZrbMyy6hyLh_gY02miPAonvjEmYM1w6kwk_4IHcchY8z8C98xo0qG9Ze36u0Uzc6GjDRqigzgMlfkgYOLWyNVh5t-/s400/Capture-2.JPG" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPJvs3KmFfGYZfcGpIvIREU6qxcncBCKbUbDW5amtiGpF6zDBg4uzr3wGz2I7evehQLPKhcIrBxBU3QK2Prd_RbHojR4gMvkWBDop8bWz2_ZIIsHV-ozPC7AMC7aS0gLnt_6uO36WOmdBi/s1600/Capture-3.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="163" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgPJvs3KmFfGYZfcGpIvIREU6qxcncBCKbUbDW5amtiGpF6zDBg4uzr3wGz2I7evehQLPKhcIrBxBU3QK2Prd_RbHojR4gMvkWBDop8bWz2_ZIIsHV-ozPC7AMC7aS0gLnt_6uO36WOmdBi/s400/Capture-3.JPG" width="400" /></a></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<br />
There is another work-around which you can also try to get a better understanding of Ganglia. In that case you need to use separate port number for each cluster. Here, I'm distinguishing each cluster's data source per IP address, but in that work-around you can have a single IP address for all clusters but multiple port numbers. You can try that work-around as an exercise :).<br />
<br />
<br />
<br />
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.<br />
<div>
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<br /></div>
</div>
</div>
Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com12tag:blogger.com,1999:blog-740854994288835243.post-58436504253558022892013-04-01T21:25:00.000-05:002013-10-16T23:35:40.707-05:00Setting up Ganglia in CentOS<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<i>Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids</i> (<a href="http://ganglia.sourceforge.net/">ref</a>). Installing and configuring Ganglia is very straight-forward. It has two major parts:</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Gmond</b> (Ganglia monitoring daemon)<b>:</b> Runs on every single node and collects the data and sends to meta daemon node.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Gmetad</b> (Ganglia meta daemon)<b>: </b>Runs on a head (or client) node and gathers the data from all monitoring nodes and displays it on UI.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Assume I have 4 nodes cluster and one of the nodes also works as client. So, I will install the Ganglia PHP UI on that machine.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Here are their IP addresses and list of services I am going to install on them:</div>
<ul style="text-align: left;">
<li style="text-align: justify;">10.0.0.33 - client node (gmetad, gmond, ui)</li>
<li style="text-align: justify;">10.0.0.194 - monitoring node (gmond)</li>
<li style="text-align: justify;">10.0.0.195 - monitoring node (gmond)</li>
<li style="text-align: justify;">10.0.0.196 - monitoring node (gmond)</li>
<li style="text-align: justify;"><br /></li>
</ul>
<div style="text-align: justify;">
<b><u>On client node:</u></b></div>
<div style="text-align: justify;">
--> Install meta daemon, monitoring daemon and web UI by executing:</div>
<script src="http://gist.github.com/7019185.js?file=metadaemon"></script>
<div style="text-align: justify;">
--> If they are not available, then you might need to install <a href="http://fedoraproject.org/wiki/EPEL">EPEL</a> repositories to your machine. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b><u>On monitoring node:</u></b></div>
<div style="text-align: justify;">
--> Install monitoring daemon by:</div>
<script src="http://gist.github.com/7019185.js?file=monitoringdaemon"></script>
<div style="text-align: justify;">
<br />
<b><u>Configuration:</u></b><br />
<b><u><br /></u></b></div>
<div style="text-align: justify;">
By this point, everything is installed and now you need to configure your Ganglia.</div>
<ul style="text-align: left;">
<li style="text-align: justify;">/etc/ganglia/gmetad.conf --- configuration file for gmetad daemon</li>
<li style="text-align: justify;">/etc/ganglia/gmond.conf --- configuration file for gmond daemon</li>
</ul>
<div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I have updated only the following part on <b><i>gmond.conf</i></b> file in each monitoring node.</div>
</div>
<script src="http://gist.github.com/7019185.js?file=gmond.conf"></script>
<div>
<div style="text-align: justify;">
Notice that I have commented out <b><i>mcast_join</i></b> and <b><i>bind</i></b> because <a href="http://sourceforge.net/apps/trac/ganglia/wiki/FAQ">multicast is not supported by AWS EC2</a> and unicast is only the option for Ganglia. So, all monitoring nodes are sending collected data to the node (10.0.0.33) which is collecting data (nodes which is running gmetad daemon).</div>
</div>
<div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
On <b><i>gmetad.conf</i></b> file I have updated this:</div>
</div>
<pre class="prettyprint" style="text-align: justify;">data_source "Cloud for Beginners" 60 10.0.0.33:8649
</pre>
<div style="text-align: justify;">
Here I'm telling to meta daemon the name of the cluster (name should be matched to organize list of hosts by cluster) and host's IP address and port from where data will be collected from and duration (collect data in every 60 seconds).</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You are done! Now start monitoring daemon and meta daemon in all nodes.</div>
<script src="http://gist.github.com/7019185.js?file=cmd"></script>
<div style="text-align: justify;">
After 1-2 minutes you should be able to see all your monitoring data through:</div>
<div style="text-align: justify;">
<a href="http://54.236.224.235/ganglia/">http://client.host.public.ip.address/ganglia/</a></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
You might want to change boot configuration so that gmetad and gmond daemons will be started at boot:</div>
<script src="http://gist.github.com/7019185.js?file=bootconfig"></script>
<div style="text-align: justify;">
<br />
<b><u>Common Issue:</u></b> </br>
In case if you are facing that the gmetad is not starting up, you can check the log by:<br />
<script src="http://gist.github.com/7019185.js?file=log"></script>
In log you might see "<i>Please make sure that /var/lib/ganglia/rrds is owned by nobody</i>" error, in that case you need to execute this:<br />
<script src="http://gist.github.com/7019185.js?file=chown-rrds"></script>
<b><br /></b>
<b><br /></b>
<b><br /></b>
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
<br />
<br /></div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com1tag:blogger.com,1999:blog-740854994288835243.post-87480280943537676092013-03-21T10:38:00.000-05:002013-10-17T00:14:14.625-05:00A basic Oozie coordinator job<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Suppose you want to run your workflow in every two hours or once per day, at that point coordinator job comes out very handy. There are several more use cases where you can use Oozie coordinator. Today I'm just showing you how to write a very basic Oozie coordinator job.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I'm assuming that you are already familiar with Oozie and have an workflow ready to be used as coodinator job. For this tutorial, my Oozie workflow is a shell-based action workflow. I want to execute a shell script in every two hours starting from today to next 10 days. My workflow.xml is already inside the a HDFS directory.</div>
<script src="http://gist.github.com/7019407.js?file=folderStructure"></script>
<div style="text-align: justify;">
Without the coordinator, I'm currently running it like this:<br />
<script src="http://gist.github.com/7019407.js?file=cmd"></script>
Here is my <b>job.properties</b> file:<br />
<script src="http://gist.github.com/7019407.js?file=job.properties"></script>
Now I want to run this workflow with coordinator. <b><i>Oozie Coordinator Engine</i></b> is responsible for the coordinator job and the input of the engine is a <b><i>Coordinator App</i></b>. At least two files are required for each Coordinator App:<br />
<ol>
<li>coordinator.xml - Definition of coordinator job is defined in this file. Based on what(time based or input based) your workflow will trigger, how long it will continue, workflow wait time - all of this information need to be written on this coordinator.xml file.</li>
<li>coordinator.properties - Contain properties for coordinator job, behaves same as job.properfiles file.</li>
</ol>
Based on my requirement, here is my <b><i>coordinator.xml</i></b> file:<br />
<script src="http://gist.github.com/7019407.js?file=coordinator.xml"></script>
As I need to pass <b><i>coordinator.properties</i></b> file for a coordinator job, I cannot pass previous <b><i>job.properties</i></b> file at the same time. That's why I need to move all properties from the job.properties file to coordinator.properties file. Remember one thing, coordinator.properties file <b>must have</b> a property which specifics the location of <b><i>coordinator.xml</i></b> file (similar to <span style="text-align: start; white-space: pre-wrap;">oozie.wf.application.path in job.properties)</span>. After moving those properties my <b><i>coordinator.properties</i></b> file became:<br /><br />
<script src="http://gist.github.com/7019407.js?file=coordinator.properties"></script>
As you noticed I mentioned application path <b><i>oozie.coord.application.path</i></b> and that path contains the cooridnator.xml file.<br />
<script src="http://gist.github.com/7019407.js?file=coord-folder"></script>
Now I'm pretty much set. Now if I execute a coordinator job now it will execute the coordinator app located in the coordinator application path. Coordinator app has a tag <b><i><workflow><app-path>.... </app-path></workflow></i></b> which specifics the actual workflow location. At that location, I have my workflow.xml file. So that workflow.xml will be triggered based on how I define the job in coordinator.xml file.<br />
<br />
I'm submitting my coordinator job by:<br />
<script src="http://gist.github.com/7019407.js?file=coordinator-cmd"></script>
If you are running your coordinator job successfully, I highly recommend you to go through this <a href="https://github.com/yahoo/oozie/wiki/Oozie-Coord-Use-Cases">document </a>and try out some other use cases and alternatives.<br />
<br />
<br /></div>
<div style="text-align: justify;">
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
</div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com4tag:blogger.com,1999:blog-740854994288835243.post-7161547299886294092013-03-08T14:42:00.000-06:002013-10-16T21:00:14.154-05:00Oozie Example: Sqoop Actions<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
To run a Sqoop action through Oozie, you at least need two files, a) workflow.xml and b) job.properties. But if you prefer to pass Sqoop options through a parameter file, then you also need to copy that parameter file into your Oozie workflow application folder. I prefer to use a parameter file for Sqoop so I'm passing that file too.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In this tutorial, what I'm trying to achieve is to run a Sqoop action which will export data from HDFS to Oracle. <a href="http://www.tanzirmusabbir.com/2013/02/sqoop-importexport-fromto-oracle.html">On my previous post</a>, I already wrote about how to do import/export between HDFS & Oracle. Before run Sqoop action through Oozie, make sure your Sqoop is working without any errors. Once it's working without Oozie, then try it through Oozie by using Sqoop action.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I'm assuming when you execute the following line, it executes successfully and data loaded into Oracle without any error:</div>
<script src="http://gist.github.com/7013556.js?file=sqoop-cmd"></script>
<div style="text-align: justify;">
A successful Sqoop export should be ended with the following message on the console:
<br />
<script src="http://gist.github.com/7013556.js?file=sqoop-success"></script>
In my Oracle, I have already created the specific table "<b>Emp_Record</b>" which resembles the data present in HDFS (under /user/ambari-qa/example/output/sqoop/emprecord folder). That means, each rows on the HDFS files represent a row in the table. Again, the data in HDFS is tab delimited and each column represents a column in the table "Emp_Record". To know more about this, please check <a href="http://www.tanzirmusabbir.com/2013/02/sqoop-importexport-fromto-oracle.html">my previous post</a> as I'm using the same table and HDFS files here.<br />
<br />
So, here is my <b><i>option.par</i></b> file which I'm using for my Sqoop export:<br />
<script src="http://gist.github.com/7013556.js?file=option.par"></script>
And my <b><i>workflow.xml</i></b> file:<br />
<script src="http://gist.github.com/7013556.js?file=workflow.xml"></script>
As you see, all the Sqoop commands which we generally use on the command line, can be passed as an argument by using <b><i><arg></i></b> tag. If you do not want to use parameter file, then you need to pass each of the command in a separate <arg> tag like:
<script src="http://gist.github.com/7013556.js?file=argExample.xml"></script>
Since I'm using a parameter file for this Sqoop action, I also need to put it inside the Oozie workflow application path and have to mention this file through a <b><i><file></i></b> tag. So, my Oozie workflow application path becomes:
<script src="http://gist.github.com/7013556.js?file=appPath"></script>
<div style="text-align: justify;">
Finally my <b>job.properties</b> file for this workflow:
<br />
<script src="http://gist.github.com/7013556.js?file=job.properties"></script>
Execution command for this Oozie workflow will be same as others:
<script src="http://gist.github.com/7018058.js?file=oozie"></script>
<br /></div>
<div style="text-align: justify;">
<b><u><span style="font-size: large;">A common issue:</span></u> </b></br></br>If your Sqoop job is failing, then you need to check your log. Most of the time(I'm using Hortonworks distribution) you might face this error message on the log:<br />
<script src="http://gist.github.com/7013556.js?file=common-sqoop-issue.log"></script>
This is happens when you do not have required Oracle library file in Oozie's classpath. In that case, you need to manually copy the required <b>ojdbc6.jar</b> file to Sqoop's lib folder "<b><i>/usr/lib/sqoop/lib</i></b>". While running Sqoop through Oozie, you need to do the same thing but in Oozie's Sqoop lib folder. You can do that by executing:
<script src="http://gist.github.com/7013556.js?file=issue-sol.sh"></script>
<br />
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
</div>
</div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com1tag:blogger.com,1999:blog-740854994288835243.post-52021570833329830402013-03-03T12:48:00.000-06:002013-10-17T00:33:31.465-05:00Oozie Example: Java Action / MapReduce Jobs<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Running a Java action through Oozie is very easy. But there are some things you need to consider before you run your Java action. In this tutorial, I'm going to execute a very simple Java action. I have a JAR file TestMR.jar which is a MapReduce application. So this application will be executed on the Hadoop cluster as a Map-Reduce job. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
TestMR.jar file has a class TestMR.java which has a <b><i>public static void main method(String args[])</i></b> that initiates the whole application. To run a Java action, you need to pass the main Java class name through the tag <b><main-class></b>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This is the <b>workflow.xml</b> file for a Java action with minimum number of parameters:</div>
<script src="http://gist.github.com/7019540.js?file=workflow.xml"></script>
<br />
<div style="text-align: justify;">
Your Java action has to be configured with <b><job-tracker></b> and <b><name-node></b>. And as you know, Hadoop will throw exceptions if the output folder is already exists. That's why I'm using <b><prepare></b> tag which will delete the output folder before execution. My jar also takes command line arguments. One of the argument is "-r 6" which means how many reducers I want to use for the MR job. So I'm using "<b><arg></b>" tag to pass command line arguments. You can have multiple <arg> for a single Java action. As like other actions, to indicate a "<b><i>ok"</i></b> transition, the main Java application needs to be completed without any error. If it throws any exception, the workflow will indicate a "<b><i>error</i></b>" transition.</div>
<div style="text-align: justify;">
<br />
Now comes to the folder structure inside HDFS. When Oozie executes any action, it automatically adds all JAR files and native libraries from the "<b><i>/lib</i></b>" folder to its <i><b>classpath</b></i>. Here, "/<b><i>lib</i></b>" folder is a sub-folder inside Oozie workflow application path. So, if "java-action" is the workflow application path then the structure would be:<br />
- java-action<br />
- java-action/workflow.xml</div>
<div style="text-align: justify;">
- java-action/lib</div>
<div style="text-align: justify;">
<br />
In my HDFS, I have:
<br />
<script src="http://gist.github.com/7019540.js?file=folder-structure"></script>
And here is my <b>job.properties</b> file:
<script src="http://gist.github.com/7019540.js?file=job.properties"></script>
<br />
That's pretty much it! Now you can execute your workflow by:
<br />
<script src="http://gist.github.com/7018058.js?file=oozie"></script>
Remember, this is a very basic and simple workflow to run a Java action through Oozie. You can do a lot more than these by using several other options provided by Oozie. Once you are able to run a simple workflow, I would recommend you to go through Oozie documentation and try some workflows with different settings.<br />
<b><u><br /></u></b>
<b><u>Consideration:</u></b> Be careful about what you have inside your "<b>/lib</b>" folder. If the version of the library which you are using for your application conflicts with Hadoop's library file's version, it will throw errors and those type of errors are hard to find. To avoid those kind of errors, better to match your library files with the versions you have inside "/usr/lib/[hadoop/hive/hbase/oozie]/lib" folder on your client node.
</div>
<div style="text-align: justify;">
<b><br /></b>
<b><br /></b>
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
</div>
Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com11tag:blogger.com,1999:blog-740854994288835243.post-15774420785222205132013-03-01T23:13:00.000-06:002013-10-17T00:50:54.033-05:00Oozie Example: Hive Actions<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Running Hive through Oozie is pretty straight-forward and it's getting much simpler day-by-day. 1st time when I used it(old versions) I faced some issues mostly related to classpath though I resolved them. But when I used the recent versions (Hive 0.8+, Oozie 3.3.2+), I only faced 1 or 2 issues at most.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
In this example, I'm going to execute a very simple Hive script through Oozie. I have a Hive table "<b>temp</b>" and it's currently empty. The script will load some data from HDFS to that specific hive table.</div>
<script src="http://gist.github.com/7019600.js?file=hive-cmd"></script>
And here is the content of the <b>script.hql</b>:<br />
<script src="http://gist.github.com/7019600.js?file=script.hql"></script>
<div style="text-align: justify;">
Now you need to setup your Oozie workflow app folder. You need one very important file to execute Hive action through Oozie which is <b>hive-site.xml</b>. When Oozie executes a Hive action, it needs Hive's configuration file. You can provide multiple configurations file in a single action. You can find your Hive configuration file from "<b>/etc/hive/conf.dist/hive-site.xml</b>" (default location). Copy that file and put it inside your workflow application path in HDFS. Here is the list of files that I have in my Oozie Hive action's workflow application folder.</div>
<script src="http://gist.github.com/7019600.js?file=folder"></script>
And here is my <b>workflow.xml</b> file:<br />
<script src="http://gist.github.com/7019600.js?file=workflow.xml"></script>
<div style="text-align: justify;">
Look at the <b><job-xml></b> tag, since I'm putting <b>hive-site.xml</b> in my application path, so I'm just passing the file name not the whole location. If you want to keep that file in some other location of your HDFS, then you can pass the whole HDFS path there too. In older version of Hive, user had to provide the <b>hive-default.xml</b> file by using property key <b>oozie.hive.defaults</b> while running Oozie Hive action, but from now on (Hive 0.8+) it's not required anymore.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Here I'm using another tag <param>, which is not required but I'm using it just to show how to pass parameter among hive script, job properties and workflow. If you are using any parameter variable inside your hive script, it needs to pass through the hive action. So you can do, either:
<br />
<ul>
<li><param>INPUT_PATH=${inputPath}</param> (where inputPath can be passed through job properties) , <b>Or</b></li>
<li><param>INPUT_PATH=/user/ambari-qa/input/temp</param></li>
</ul>
</br>
Inside my HDFS, "<b>/hive-input/temp</b>" folder contains files which need to be loaded to Hive table:</div>
<script src="http://gist.github.com/7019600.js?file=hive-dir"></script>
And here is my <b>job.properties</b> file:
<br />
<script src="http://gist.github.com/7019600.js?file=job.properties"></script>
That's it! You can now run your Hive workflow by executing this on the client node:<br />
<script src="http://gist.github.com/7018058.js?file=oozie"></script>
<b><u><br /></u></b><b><u><span style="font-size: large;">Two common issues:</span></u></b><br />
You might face some issues if the required jar files are not present inside "<b>/user/oozie/share/lib/hive</b>" folder (HDFS). One of the commons issue is not having the hcatalog* jar files in that folder. In that case you will see something like this in the log:<br />
<script src="http://gist.github.com/7019600.js?file=common-issue"></script>
In that case, you need to manually copy those required jar files into that folder. You can do that by following:<br />
<script src="http://gist.github.com/7019600.js?file=issue-res"></script>
<br />
<b>Another common issue you might face is:
</b><br />
<pre class="prettyprint">SemanticException [Error 10001]: Table not found
</pre>
<div style="text-align: justify;">
Even though you can see your table is exists, you might see this error when running through Oozie. Most of the time it happens when your Hive is not properly pointing to the right metastore. Most of the time, the problem goes away when you copy the correct hive-site.xml into hive lib folder inside HDFS. Make sure you check your <b>hive-site.xml</b> file to see all properties are correctly set. Like, "hive.metastore.uris", "javax.jdo.option.ConnectionUR", "javax.jdo.option.ConnectionDriverName". But me and other users (<a href="http://hadoop-common.472056.n3.nabble.com/Hive-Action-Failing-in-Oozie-td4004208.html">Hive action failing in oozie</a>) also found out that the above error message is ambiguous and doesn't give much insight. If the expected jar files are not present in the share lib folder, hive also throws the same error message! So be careful about what you have in the classpath when running hive through Oozie.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
</div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com8tag:blogger.com,1999:blog-740854994288835243.post-25289054145429670492013-02-25T22:39:00.000-06:002013-10-17T01:33:53.237-05:00Setup a Storm cluster on Amazon EC2<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
<b>Storm - Real-time Hadoop!</b> Yes, you can call it in that way. As you know, Hadoop provides a set of general primitives for doing batch processing, Storm also provides a set of primitives for doing real-time computation. It's a very powerful tool and pretty straight forward to setup a Storm cluster. If you want to setup a Strom cluster on Amazon EC2, you should try Nathan's <a href="https://github.com/nathanmarz/storm-deploy/wiki">storm-deploy</a> project first which will auto deploy a Storm cluster on EC2 based on your configurations. But if you want to manually deploy a Storm cluster, you can follow these steps (if you want more detailed information, you can also follow <a href="https://github.com/nathanmarz/storm/wiki/Setting-up-a-Storm-cluster">original documentation</a> of Storm):</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Let me show you my machine's current configuration first:</div>
<div style="text-align: justify;">
</div>
<ul>
<li>Machine type: m1.large (for supervisor) and m1.small (for nimbus)</li>
<li>OS: 64-bit CentOS 6.3</li>
<li>JRE Version: 1.6.0_43</li>
<li>JDK Version: 1.6.0_43</li>
<li>Python Version: 2.6.6</li>
</ul>
<div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitVd6zAe9qU54VVsbAOqVQ0VaY6_H-aSsOdAld9IcezBOtJwS2ZZ9KlfRWCOUItMkY2z0I6WxHRGla9bVIi8R9M1H-CRZApO72jQmbBRUxS3oHwLZGhO4qdsk7MHcpvgPzFy7x5UlU7fpD/s1600/storm-dependency-version.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="225" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEitVd6zAe9qU54VVsbAOqVQ0VaY6_H-aSsOdAld9IcezBOtJwS2ZZ9KlfRWCOUItMkY2z0I6WxHRGla9bVIi8R9M1H-CRZApO72jQmbBRUxS3oHwLZGhO4qdsk7MHcpvgPzFy7x5UlU7fpD/s400/storm-dependency-version.JPG" width="400" /></a></div>
<br />
<br /></div>
<div>
For this tutorial, I am going to setup a 3-node Storm cluster. IP addresses of each hosts and my targetted configurations is:</div>
<div>
<br /></div>
<div>
10.0.0.194 - StormSupervisor1</div>
<div>
10.0.0.195 - StormSupervisor2</div>
<div>
10.0.0.196 - StormSupervisor3</div>
<div>
10.0.0.182 - StormNimbus</div>
<div>
<br /></div>
<div>
Storm depends on Zookeeper for coordinating the cluster. I have already installed Zookeeper in each of those above hosts. Now apply each of the following steps on all Supervisor and Nimbus nodes:</div>
<div>
<br />
<br /></div>
<div>
<b><u>A. Install ZeroMQ 2.1.7</u></b></div>
<div>
<br /></div>
<div style="text-align: justify;">
<b><u>Step #A-1:</u></b> Download zeromq-2.1.7.tar.gz from <a href="http://download.zeromq.org/" style="text-align: left;">http://download.zeromq.org/</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b><u>Step #A-2:</u></b> Extract the gzip file:</div>
<div style="text-align: justify;">
<pre class="prettyprint">[root@ip-10-0-0-194 tool]# tar -zxvf zeromq-2.1.7.tar.gz
</pre>
</div>
<div style="text-align: justify;">
<b><u>Step #A-3:</u></b> Build ZeroMQ and update the library:</div>
<script src="http://gist.github.com/7019936.js?file=install-zeromq"></script>
<div style="text-align: justify;">
<b><u>Note:</u></b> If you are facing <i>"cannot link with -luuid, install uuid-dev."</i> error when you are executing<i> "./configure"</i>, then you need to install it. You can install it by executing <b><i>"yum install libuuid-devel"</i></b>.<br />
<br />
<br />
<b style="text-align: left;"><u>B. Install JZMQ</u></b></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b><u>Step #B-1:</u></b> Get the project from the Git by executing:<br />
<script src="http://gist.github.com/7019936.js?file=git"></script>
<b><u>Step #B-2:</u> </b>Install it:<br />
<script src="http://gist.github.com/7019936.js?file=install-jzmq"></script>
</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b><u>C. Setup Storm</u></b><br />
<br />
<b><u>Step #C-1</u></b>: Download the latest version (for this tutorial, I'm using 0.8.1 version) from <a href="https://github.com/nathanmarz/storm/downloads">https://github.com/nathanmarz/storm/downloads</a>.<br />
<br />
<b><u>Step #C-2</u></b>: Unzip the downloaded zip file:<br />
<pre class="prettyprint">[root@ip-10-0-0-194 tool]# unzip storm-0.8.1.zip
</pre>
<b><u>Step #C-3:</u></b> Now change the configuration based on your environment. Default location of the main Storm configuration file is: "<b><i>/storm/conf/storm.yaml</i></b>". Any setting you write on this file will overwrite default configuration file. Here is what I changed in the <b><i>storm.yaml</i></b> file:<br />
<script src="http://gist.github.com/7019936.js?file=storm.yaml"></script>
<b><u>Note:</u></b> I have created the "storm" folder manually inside <b><i>"/var"</i></b> directory.<br />
<br />
At this point, you are ready to start your Storm cluster. Here, I installed and setup everything on a single instance first (supervisor1 - 10.0.0.194) and then I created AMI from that instance and later created rest of the two supervisors and one nimbus node from that AMI.<br />
<br />
Launch daemons by using the storm script (bin/storm) on each nodes. I started nimbus and UI daemons on the nimbus host and supervisor daemon on each of the supervisor nodes.<br />
<br />
<ul>
<li><b><i>bin/storm nimbus</i></b> on 10.0.0.182</li>
<li><b><i>bin/storm ui</i></b> on 10.0.0.182</li>
<li><b><i>bin/storm supervisor</i></b> on 10.0.0.194,10.0.0.195,10.0.0.196</li>
</ul>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghxoUwmaq7uN16x2o5YlvuGVdNGQHRfoAdbbTTcsCiG0OAgg463517tD_XCL7ejLxCBrRpHP3n6fFBI5jzqvcwLd5_YqNej7qZ9_y-sQT6CNolnKasxqm4tz7ln7ska7PbO7Olcy8v5I5N/s1600/storm.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="216" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghxoUwmaq7uN16x2o5YlvuGVdNGQHRfoAdbbTTcsCiG0OAgg463517tD_XCL7ejLxCBrRpHP3n6fFBI5jzqvcwLd5_YqNej7qZ9_y-sQT6CNolnKasxqm4tz7ln7ska7PbO7Olcy8v5I5N/s400/storm.JPG" width="400" /></a></div>
<div>
<br /></div>
<div>
<br /></div>
<br />
You can see Storm UI by navigating to your nimbus host: http://{nimbus host}:8080. For my case, it was: http://54.208.24.209:8080 (here, 54.208.24.209 is the public IP address of my nimbus host).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiDNKvXhv7IvdvF4unadJTAj77fikpWk9dTzgRTONnJiex6v5uUG0csQoli-jWKs1HQPpMIjk4NG_u7ji_6fXBE2Kpx7OYqnHZc8ImRyyfE1kjObZbx8-mI8TLnNd0j9fGBROdg2h3T-X0/s1600/storm-ui.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="118" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhiDNKvXhv7IvdvF4unadJTAj77fikpWk9dTzgRTONnJiex6v5uUG0csQoli-jWKs1HQPpMIjk4NG_u7ji_6fXBE2Kpx7OYqnHZc8ImRyyfE1kjObZbx8-mI8TLnNd0j9fGBROdg2h3T-X0/s400/storm-ui.JPG" width="400" /></a></div>
<br />
<br />
<b>Note: </b><span style="background-color: white; color: #222222; font-family: Arial, Tahoma, Helvetica, FreeSans, sans-serif; font-size: 13px; line-height: 18px;">For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</span><br />
<br />
<br /></div>
</div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com7tag:blogger.com,1999:blog-740854994288835243.post-24873954176791812032013-02-23T22:43:00.000-06:002013-10-17T01:51:37.155-05:00Install Opscenter in CentOS environment<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
In my <a href="http://www.tanzirmusabbir.com/2013/02/install-cassandra-cluster-within-30.html">previous post</a>, I talked about how to install Cassandra in CentOS environment. This is the follow-up post of my previous post and here I am going to show you how to install <a href="http://www.datastax.com/what-we-offer/products-services/datastax-opscenter">OpsCenter</a> in the same environment. "<i>OpsCenter is a browser-based user interface for monitoring, administering, and configuring multiple Cassandra clusters in a single, centralized management console</i> (<a href="http://www.datastax.com/docs/opscenter/index">ref</a>)". </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Last time, I installed Cassandra cluster on these nodes:</div>
<div style="text-align: justify;">
</div>
<ul>
<li>cassandra node1 -> 10.0.0.57</li>
<li>cassandra node1 -> 10.0.0.58</li>
<li>cassandra node1 -> 10.0.0.59</li>
</ul>
<div style="text-align: justify;">
Now I am going to install OpsCenter agents on these nodes and will treat 10.0.0.57 as my client node (that means, OpsCenter console will be deployed on that host). Before OpsCenter installation, make sure your Cassandra cluster is up and running successfully.<br />
<br />
<b><u>Step #1:</u></b> Create a new yum repository definition for DataStax OpsCenter in 10.0.0.57 node.<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# vim /etc/yum.repos.d/datastax.repo</pre>
<b><u>Step #2:</u></b> Write the edition you want to install in the datastax.repo file (I am installing OpsCenter community edition):<br />
<script src="http://gist.github.com/7020111.js?file=datastax-repo"></script>
<b><u>Step #3:</u></b> Install the OpsCenter pacakge:<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# yum install opscenter-free
</pre>
The above steps will install the most recent OpsCenter community edition in your system. But I want to install a specific version of OpsCenter today (appropriate for the Cassandra version which I installed earlier). So to do that, at first I need to check the list of versions for OpsCenter which are available now:
<br />
<script src="http://gist.github.com/7020111.js?file=ops-version"></script>
I wanted to install 3.0.2-1 version. So, I'm installing it by:<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# yum install opscenter-free-3.0.2-1</pre>
<b><u>Step #4:</u></b> If you do not want to try with the repository and want to install manually any specific version of OpsCenter, in that case you can download the rpm file from <a href="http://rpm.datastax.com/community/noarch/">http://rpm.datastax.com/community/noarch/</a> and can install it by:<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# yum install opscenter-free-3.0.2-1.noarch.rpm</pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq_Ti7tpVzrKRbDdx6ZvGl8naQEZEAAyc2wezIdhAsMlbMZXG9nFXFowTtlkZ4GeeX7W6fgRrUHhiKdGAhhZuDSfo7vqjoDjlGcZD67jf_jsDy103Wqu8YtOTNvjLi1-4M3GOJHWkrp5R5/s1600/opscenter.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="183" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjq_Ti7tpVzrKRbDdx6ZvGl8naQEZEAAyc2wezIdhAsMlbMZXG9nFXFowTtlkZ4GeeX7W6fgRrUHhiKdGAhhZuDSfo7vqjoDjlGcZD67jf_jsDy103Wqu8YtOTNvjLi1-4M3GOJHWkrp5R5/s400/opscenter.JPG" width="400" /></a></div>
<b><u>Step #5:</u></b> Now configure your opscenter configuration file (<b><i>/etc/opscenter/opscenterd.conf</i></b>) to mention your web server's IP address or hostname:<br />
<script src="http://gist.github.com/7020111.js?file=opscenterd.conf"></script>
<b><u>Step #6:</u></b> Now start your OpsCenter by:<br />
<script src="http://gist.github.com/7020111.js?file=ops-start"></script>
<b><u>Step #7:</u></b> Now you can see your OpsCenter console by navigating to http://<opscenter_host>:8888. For my case, it would be: http://54.208.29.59:8888 as it's the public IP address for the host 10.0.0.57.<br />
<b><u><br /></u></b>
<b><u>Step #8:</u></b> Wait, you are not done yet! You still need to install your OpsCenter agents. For the first time when you open your OpsCenter console, it will ask you whether you want to create a new cluster or want to use existing cluster. In <a href="http://www.tanzirmusabbir.com/2013/02/install-cassandra-cluster-within-30.html">previous post</a>, I installed a cluster with the name "Simple Cluster". So I want to install that existing cluster for my OpsCenter. So, I'm selecting "Use Existing Cluster" option.</div>
<div style="text-align: justify;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTgsPvPu7Dn26V_bF9D0Pwgo7C95Z5HlchcreVGcXxzNnVsvaRbqe2BfmQvkuePCED-KvS8kc5IbV_t1PrD46P6Zod_yKAWYHPzWojIsWCOvYH5MXW0mHMSn7QcHFQ6nmdVOXQAQaNSMdY/s1600/opscenter-2.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="188" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgTgsPvPu7Dn26V_bF9D0Pwgo7C95Z5HlchcreVGcXxzNnVsvaRbqe2BfmQvkuePCED-KvS8kc5IbV_t1PrD46P6Zod_yKAWYHPzWojIsWCOvYH5MXW0mHMSn7QcHFQ6nmdVOXQAQaNSMdY/s400/opscenter-2.JPG" width="400" /></a></div>
<br />
<b><u>Step #9:</u></b> Now, you need to pass a list of hostnames or IP address of my cluster in each line at a time (leave other fields as they are):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbopbcBzqnW7YoRU2-D2keYB_cGz9FmObmWpDwHUCtYJTLH88XbTDFvp2NnW1iqcCFiGwxnLo0UlgVMDK6Titq8SPSc_XABPhN1tkWjxA88Lf0ZVGPL_rvqmYh6ZpJQTmFN0R2AnwZqZdW/s1600/opscenter-3.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="237" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbopbcBzqnW7YoRU2-D2keYB_cGz9FmObmWpDwHUCtYJTLH88XbTDFvp2NnW1iqcCFiGwxnLo0UlgVMDK6Titq8SPSc_XABPhN1tkWjxA88Lf0ZVGPL_rvqmYh6ZpJQTmFN0R2AnwZqZdW/s400/opscenter-3.JPG" width="400" /></a></div>
<br />
<b><u>Step #10:</u></b> At this point you should be able to see your OpsCenter console.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrl4iWmrfJB9l12dkYQ3UHw7lJa1Hg1F5mHF89iVsUPt413mhntloZDzpRfvkJyjO1M6bQYjVvJtWgzgklW15DUrfvX35kfVUUoRYLvbKgUpkEXHjO79gz8TYhA_N045ldf_u03QUeSYu0/s1600/opscenter-4.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="201" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhrl4iWmrfJB9l12dkYQ3UHw7lJa1Hg1F5mHF89iVsUPt413mhntloZDzpRfvkJyjO1M6bQYjVvJtWgzgklW15DUrfvX35kfVUUoRYLvbKgUpkEXHjO79gz8TYhA_N045ldf_u03QUeSYu0/s400/opscenter-4.JPG" width="400" /></a></div>
<br />
Note that on top of your console, there is a notification labeled as "<i>0 of 3 agents connected</i>" with a link called "<span style="color: blue;">fix</span>". This is because, none of your OpsCenter agents are installed yet. Click on that link and install agents automatically.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjM6rYTddFY7rY8vn6WtP1DFYj1N5kpmYiu6KJ_3qNPDKmZYvCoVGsWm6gBMNPlzgOLS85q47sE75PgldiqKTWaUXhpjPCV9FvWH5v-Wsip65DOu147bdVFssIau4DMZ0CgC-lQeBgwZuxr/s1600/opscenter-5.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="305" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjM6rYTddFY7rY8vn6WtP1DFYj1N5kpmYiu6KJ_3qNPDKmZYvCoVGsWm6gBMNPlzgOLS85q47sE75PgldiqKTWaUXhpjPCV9FvWH5v-Wsip65DOu147bdVFssIau4DMZ0CgC-lQeBgwZuxr/s400/opscenter-5.JPG" width="400" /></a></div>
<br />
<b><u>Step #11:</u></b> Enter appropriate credentials of your machine. For my case, I'm writing "<b><i>root</i></b>" as my username and pasting my private key (including the commented part, i.e. --BEGIN RSA PRIVATE KEY):<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqgcu6CqujvWW9t7awUPEP86XGfW1S-kJyquN4rd-YGfZUPcr9zwGee1jl8G6QfkgAgIWR-33PYjnvn5PY3dtBoskfWBlX633Dt4IyU3sNdQIIx7D3XSbhdEvSU5MQNLGccvoaezV2IxgM/s1600/opscenter-6.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="341" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgqgcu6CqujvWW9t7awUPEP86XGfW1S-kJyquN4rd-YGfZUPcr9zwGee1jl8G6QfkgAgIWR-33PYjnvn5PY3dtBoskfWBlX633Dt4IyU3sNdQIIx7D3XSbhdEvSU5MQNLGccvoaezV2IxgM/s400/opscenter-6.JPG" width="400" /></a></div>
<br />
Click on "<b><i>Done</i></b>" and finally install it by clicking on the "<b><i>Install on all nodes</i></b>" button.<br />
<br />
<b><u>Step #12:</u></b> Accept all the fingerprints by clicking on the "<b><i>Accept Fingerprints</i></b>" button. Once you click on that button, each of your host will be downloading agent package by connecting to the internet and then it will install and configure agent in the system.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjG1FAlF6Nw2JkxcBWehQBmvPH-TL7OspPnUFt88O5RJESaLBp_39GXX2QbiLMsvq_4FwC101zomJrA3u63AgiTIGz95qOMUpSAgDTLv_4czyzTKogVHhNsT3u-6L3ZNeTUkWH9dZ0-xQc/s1600/opscenter-7.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="261" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjG1FAlF6Nw2JkxcBWehQBmvPH-TL7OspPnUFt88O5RJESaLBp_39GXX2QbiLMsvq_4FwC101zomJrA3u63AgiTIGz95qOMUpSAgDTLv_4czyzTKogVHhNsT3u-6L3ZNeTUkWH9dZ0-xQc/s400/opscenter-7.JPG" width="400" /></a></div>
<br />
At the end of the installation you should be able to see a successful message on your screen:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBnFBUZgUGk6rmIVOrvk-2_5D_T6qRLgiAuYkUn_w-3ASG9p58UNHSsv5JbIf3Xs2YmjBaiXgftsxNsSbbtL_lIpqkrtZ8XS2bpRNNFbqpinjRqyNE7Ev4TqXSlP_o9dBJ1na7vJFI2sSt/s1600/opscenter-8.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="273" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiBnFBUZgUGk6rmIVOrvk-2_5D_T6qRLgiAuYkUn_w-3ASG9p58UNHSsv5JbIf3Xs2YmjBaiXgftsxNsSbbtL_lIpqkrtZ8XS2bpRNNFbqpinjRqyNE7Ev4TqXSlP_o9dBJ1na7vJFI2sSt/s400/opscenter-8.JPG" width="400" /></a></div>
<br />
As you see that each of the host downloads the agent package from internet, it is required that each of your host can talk to internet. If you do not have that setup, you can also install <a href="http://www.datastax.com/docs/opscenter/agent/agent_manual">OpsCenter agent manually</a>.<br />
<br />
At the end of your installation you shouldn't be seen "<i>0 of 3 agents connected"</i> notification anymore.<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiVDo-0Dy7Nfp4tBAN66_nTK66zmbcs5WZ28qp5KdNeLPB8_Awrc4bASluJ-dnn_zpnkbX7WXdbe30Q2kU-KxXhayi5au-IWY434I-9UAbunaKKtCrZKDTGLQD9mhKbM6J5IwxmYKTPUBQ/s1600/opscenter-9.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="196" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgiVDo-0Dy7Nfp4tBAN66_nTK66zmbcs5WZ28qp5KdNeLPB8_Awrc4bASluJ-dnn_zpnkbX7WXdbe30Q2kU-KxXhayi5au-IWY434I-9UAbunaKKtCrZKDTGLQD9mhKbM6J5IwxmYKTPUBQ/s400/opscenter-9.JPG" width="400" /></a></div>
<br />
Finally you are done! There are so many things you can do in OpsCenter. I highly recommend you to play with it and do some experiments by changing/applying different settings and configurations. It also comes with a large set of very good performance metrics, so do not forget to check those metrics too.<br />
<br />
<br />
<br />
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.<br />
<br /></div>
</div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com0tag:blogger.com,1999:blog-740854994288835243.post-49679471668340486552013-02-21T15:07:00.000-06:002013-10-17T01:21:41.643-05:00Sqoop import/export from/to Oracle<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
I love Sqoop! It's a fun tool to work with and it's very powerful. Importing and exporting data from/to Oracle by Sqoop is pretty straightforward. One <b><i>crucial </i></b>thing you need to remember when working with Sqoop and Oracle together, that is using <b><i>all capital letters</i></b> for Oracle table names. Otherwise, Sqoop will not recognize Oracle tables.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
This is my database (Oracle) related information:</div>
<div style="text-align: justify;">
</div>
<ul>
<li>URL: <span style="white-space: pre-wrap;">10.0.0.24</span></li>
<li>db_name (SID): test</li>
<li>Schema: Employee_Reporting</li>
<li>Username: username1</li>
<li>Password: password1</li>
<li>Tablename: employee ( I'm going to export this table to HDFS by Sqoop)</li>
</ul>
<div>
<br /></div>
<div>
<b><u><span style="font-size: large;">Import from Oracle to HDFS:</span></u></b></div>
<div>
Let's go through this option file (option.par):</div>
<script src="http://gist.github.com/7019840.js?file=option.par"></script>
<div>
<div style="text-align: justify;">
You can see most of the parameters are self-explanatory. Notice that I'm providing table name in all capital letters. How you want to see the imported columns in HDFS? For that, we need to use <b><i>--fields-terminated-by</i></b> parameter. Here I'm passing "<b><i>\t"</i></b> for that parameter, which means that the column or field for each rows will be tab delimited after import. Sqoop will generate a class(with a list of setter and getter) to invoke the employee object, the name of that class name is defined by <b><i>--class-name</i></b> parameter. So in this case, it will create a class named <i>Employee.java</i> inside <i>com.example.reporting</i> package. I'm using <b><i>--verbose</i></b> parameter to print out information while Sqoop is working. It's not mandatory and you can ignore it if you want. <b><i>--split-by</i></b> parameter represents the name of the column which I want to use for splitting the import data. Here, <i>ID</i> is the primary key of the table <i>Employee</i>. You can use any WHERE clause for your import, in that case you need to pass that with the <b><i>--where</i></b> parameter. For the above example, it will import all rows from the table <i>Employee</i> where ID is less than or equal to 100000 (e.g. importing 100000 rows). You need to mention a HDFS location which will be used as a destination directory for the imported data (<b><i>--target-dir</i></b> parameter). Remember one thing here is that the target directory should not be existing prior to run import command otherwise Sqoop will throw an error. The last parameter <b><i>-m</i></b> represents the number of map tasks to run in parallel for the entire import job.</div>
<br />
Once you have your option file ready, you can execute the Sqoop import command as:<br />
<pre class="prettyprint">sqoop import --options-file option.par
</pre>
</div>
<div>
<div style="text-align: justify;">
Using option file is not mandatory, I'm just using it for my convenience. You can also pass each of the parameter from your console and execute the import job. Example:</div>
<script src="http://gist.github.com/7019840.js?file=sqoop-import-cmd"></script>
</div>
<div>
<b><u><br /></u></b>
<b><u><span style="font-size: large;">Export from Hive to Oracle:</span></u></b></div>
<div>
<div style="text-align: justify;">
For export, I will be using some of the parameters which I used during import as they are common for both import and export job. Assume I processed(by MR jobs) the data generated by import job and inserted them into <a href="http://hive.apache.org/">Hive</a> tables. Now I want to export those Hive tables to Oracle. If you are familiar with Hive, you may know that Hive moves/copies data to its warehouse folder by default. I'm using <a href="http://hortonworks.com/">Hortonworks's</a> distribution and for my case Hive'e warehouse folder is located at: "<span style="white-space: pre-wrap;"><b><i>/apps/hive/warehouse/emp_record</i></b>". Here, <b><i>emp_record</i></b> is one of the Hive table I want to export from.</span></div>
</div>
<div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I have already created a matching table "<b><i>Emp_Record</i></b>" in my Oracle inside the same schema "Employee_Reporting". To export the Hive table, I'm executing the following command:</div>
<script src="http://gist.github.com/7019840.js?file=sqoop-export-cmd"></script>
</div>
<div>
Notice that instead of using <b><i>--target-dir,</i></b> I'm using <b><i>--export-dir</i></b>, this is the location of the Hive table's warehouse folder and data will be exporting from there.<br />
<br />
Now assume, inside the warehouse directory, I have a file <b><i>00000_1</i></b> (which contains the data of Hive table Emp_Record) and some of its lines are:<br />
<script src="http://gist.github.com/7019840.js?file=file structure"></script>
</div>
<div>
As you can see, each of columns/fields are tab delimited and each of the rows are separated by a new line. Again we see here that there is an entire row which contains <b><i>null</i></b> as their values (Ideally you might not have null values as you might want to filter those values from your M-R jobs). But say we have all kind of values, so we need to tell Sqoop how to treat each of those values. Because of that, I'm using <b>--input-fields-terminated-by</b> parameter to inform that the fields are tab delimited and <b>--input-lines-terminated-by</b> parameter to distinguish rows. Again from Sqoop side, there are two kinds of column - string column and non-string column. Sqoop needs to know what string value is interpreting a null value. Because of that I'm using <b><i>--input-null-string</i></b> and <b><i>--input-null-non-string</i></b> parameters for two column types and passing '<b><i>\\N</i></b>' as their value because for my case '\\N' is null.</div>
<br />
<br />
<b><u><span style="font-size: large;">Wrapping Up:</span></u></b><br />
<div style="text-align: justify;">
Sometimes you will face some issues during export when your Oracle table has a very tight constraint (e.g. not null, time-stamp, expecting value in specific format, etc). In that case, the best idea is to export the Hive table/HDFS data to a temporary Oracle table without any modification to make Sqoop happy :). And then write a SQL script to convert and filter those data to your expected values and load them into your main Oracle table.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
The above two examples are just showing a very basic import and export Sqoop job. There are a lot of setting in Sqoop you can use. Once you are able to run export/import job successfully, I would recommend you to try the same job with different parameters and see how it goes. You can find all available options in the <a href="http://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html">Sqoop user guide</a>.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
<div style="text-align: justify;">
<br /></div>
</div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com11tag:blogger.com,1999:blog-740854994288835243.post-71221025421702035422013-02-13T14:14:00.000-06:002013-10-17T01:11:53.541-05:00Install Cassandra Cluster within 30 minutes<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: justify;">
Installing Cassandra cluster is pretty straight forward and you will find a lot of detailed documentations on DataStax site. But if you still find it overwhelming, you can follow the steps I mention here. It's just a sorted version of all of the steps you need to install a basic setup of Cassandra cluster. But I still recommend you to read DataStax documents for detailed information.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
I am using a EBS backed CentOS AMI for this setup. You can choose different AMI based on your needs.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
Please check the current Cassandra version if you want to setup the most latest one. Today, I am going to install Cassandra 1.2.3 version.</div>
<br />
<u style="font-weight: bold;">Step #1:</u> Create an instance in Amazon AWS. I choose the following one for this setup:<br />
<br />
<a href="https://aws.amazon.com/amis/centos-6-3-ebs-backed">EBS Backed CentOS 6.3</a><br />
<div class="MsoNormal">
<o:p></o:p></div>
<div class="MsoNormal">
<br /></div>
<div class="MsoNormal">
This is an EBS backed 64-bit 6.3 CentOS AMI. For me, the IP address which was assigned for this server: 10.0.0.57.</div>
<div class="MsoNormal">
<br /></div>
<u style="font-weight: bold;">Step #2:</u> Once you created this instance, by default it will not allow you to login as root user. So, login to that server as ec2-user.<br />
<br />
<u style="font-weight: bold;">Step #3:</u> Now allow root login on that machine:<br />
<pre class="prettyprint">Using username "ec2-user".
Authenticating with public key "imported-openssh-key"
[ec2-user@ip-10-0-0-57 ~]$ sudo su
[root@ip-10-0-0-57 ec2-user]# cd /root/.ssh
[root@ip-10-0-0-57 .ssh]# cp /home/ec2-user/.ssh/authorized_keys .
cp: overwrite `./authorized_keys'? y
[root@ip-10-0-0-57 .ssh]# chmod 700 /root/.ssh
[root@ip-10-0-0-57 .ssh]# chmod 640 /root/.ssh/authorized_keys
[root@ip-10-0-0-57 .ssh]# service sshd restart
</pre>
<u style="font-weight: bold;">Step #4:</u> Now you should be able to login as root user on that server. So, login again as root user and check the current Java version:<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# java -version
</pre>
The AMI which I used for this server didn't have Java pre-installed. So, I'm going to install last version of Java 1.6 on this server. It is recommended not to use Java 1.7 for Cassandra.<br />
<br />
<u style="font-weight: bold;">Step #5:</u> Download Java rpm (<span style="font-family: Calibri, sans-serif; font-size: 11pt; line-height: 115%;">jre-6u43-linux-x64-rpm.bin) </span>from the following location:<br />
<br />
<div class="MsoNormal">
<a href="http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive-downloads-javase6-419409.html#jre-6u43-oth-JPR">Oracle jre-6u43-linux-x64-rpm.bin</a><o:p></o:p></div>
<div class="MsoNormal">
<u style="font-weight: bold;"><br /></u>
<u style="font-weight: bold;">Step #6:</u> Copy your downloaded rpm to that server by using <a href="http://winscp.net/eng/index.php">WinSCP</a>. (You can use any other tool if you want).</div>
<div class="MsoNormal">
<u style="font-weight: bold;"><br /></u>
<u style="font-weight: bold;">Step #7:</u> Give required permissions to that rpm and install it:</div>
<div class="MsoNormal">
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# chmod a+x jre-6u43-linux-x64-rpm.bin
[root@ip-10-0-0-57 ~]# ./jre-6u43-linux-x64-rpm.bin
</pre>
</div>
<div class="MsoNormal">
<b><u>Step #8:</u></b> Java should be installed successfully and it should show you expected information:<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# java -version
java version "1.6.0_43"
Java(TM) SE Runtime Environment (build 1.6.0_43-b01)
Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01, mixed mode)
</pre>
</div>
<div class="MsoNormal">
<b><u>Step #9:</u></b> Now download Cassandra 1.2.3 rpm from the following location and copy it to your server by WinSCP:</div>
<br />
<a href="http://rpm.datastax.com/community/noarch/">DataStax Cassandra 1.2.3 rpm</a><br />
<br />
You can also install it by using DataStax repository (check DataStax Cassandra manual for that), I'm just following this way as a preference.<br />
<br />
<b><u>Step #10:</u></b> After you copy the Cassandra rpm to your server, install it:<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# yum install cassandra12-1.2.3-1.noarch.rpm
</pre>
<b><u>Step #11:</u></b> Cassandra should be installed successfully and you can check the status by this (By default, Cassandra server is stopped right after its installed):<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# /etc/init.d/cassandra status
cassandra is stopped
</pre>
<b><u>Step #12:</u></b> During Cassandra installation, it downloads OpenJDK version of Java as a dependency and it will overwrite your java setting. So follow these to set back your Java to use Oracle version:<br />
<script src="http://gist.github.com/7019759.js?file=configure-java"></script>
<br />
<div style="text-align: justify;">
<b><u>Step #13:</u></b> At this point, your Cassandra is ready to start. But before you start your Cassandra, you need to update its configuration file for your cluster. All configuration file are present in the location "<b><i>/etc/cassandra/conf</i></b>" of your server by default if its packaged install. For the cluster, I will change one of the file from there which is <b>cassandra.yaml</b>. The cassandra.yaml is the main configuration file for Cassandra. </div>
<br />
<div style="text-align: justify;">
There are so many properties you can change in the main configuration file and you can find its details here: <a href="http://www.datastax.com/docs/1.2/configuration/node_configuration">http://www.datastax.com/docs/1.2/configuration/node_configuration</a>. But as I am just installing the most basic version of Cassandra cluster, I will change only the following property:</div>
<ul style="text-align: left;">
<li style="text-align: justify;">cluster_name: Name of your cluster. It will be same for all hosts or instances.</li>
<li style="text-align: justify;">initial_token: Used in versions prior to 1.2. For this setup, I will manually set this value. But that is not required and you can setup your cluster by using both initial_token & num_tokens property. You can read more about it on the web and can try out different configs once you are familiar with Cassandra.</li>
<li style="text-align: justify;">partitioner: It determine which node to store the data on. Remember, paritioner cannot be changed without loading your all data. So configure your correct partitioner before initializing your cluster.</li>
<li style="text-align: justify;">seed_provider: A list of comma-delimited IP addresses to use as contact points when a node joins a cluster. This value should be unique for all your hosts.</li>
<li style="text-align: justify;">listen_address: Local IP address of each host.</li>
<li style="text-align: justify;">rpc_address: Listener address for client connections. Make it 0.0.0.0 to listens on all configured interfaces.</li>
<li style="text-align: justify;">endpoint_snitch: Sets which snitch Cassandra uses for locating nodes and routing requests. Since, I'm installing Cassandra cluster on AWS with a single region, I will be using EC2Snitch. Again, this value should be unique for all of your hosts.</li>
</ul>
<div>
<div style="text-align: justify;">
So, after modifying, here is my updated configuration file (reflects only changes which I made):</div>
</div>
<div>
<script src="http://gist.github.com/7019759.js?file=config"></script>
<b><u>Step #14:</u></b><br />
<br />
<div style="text-align: justify;">
<b style="font-weight: bold;">Murmur3Partitioner:</b> This is a new partitioner which is available from Cassandra 1.2 version. <i>initial_token </i>value is depends on the partitioner you are using. To generate <i>initial_token </i>value for Murmur3Partitioner, you can run the following commands<b>:</b></div>
<script src="http://gist.github.com/7019759.js?file=Murmur3Partitioner"></script>
<b>Note</b>: Here, <b>3</b> is the number of nodes which I will be using for my cluster setup. Change that value based on your needs.</div>
<div>
<b style="text-align: justify;"><br /></b>
<b style="text-align: justify;">RandomPartitioner</b><span style="text-align: justify;">: If you want to use RandomParitioner then in that case you can generate your </span><b style="text-align: justify;">initial_token</b><span style="text-align: justify;"> value by using the tooken-generator tool which comes with Cassandra installation:</span><br />
<script src="http://gist.github.com/7019759.js?file=RandomPartitioner"></script>
<b><u>Step #15:</u></b> Now, I am going to create a new AMI from my current instance so that when I create a new instance from that, I will have Cassandra installation ready with expected configuration.</div>
<div>
<br /></div>
Since, I'm installing a 3 node cluster, I will create two new instances from the AMI which I just created. So, I got IP addresses like this:<br />
<br />
- cassandra node1 -> 10.0.0.57<br />
- cassandra node2 -> 10.0.0.58<br />
- cassandra node3 -> 10.0.0.59<br />
<br />
If you are using Amazon Virtual Private Cloud (VPC), then you have the option to choose specific IP address based on your needs.<br />
<br />
<b><u>Step #16:</u></b> Remember, even though you created instances from the AMI, you still need to change some values in <b>cassandra.yaml</b> file which varies based on hosts. Those are:<br />
<b><br /></b>
Node #2:<br />
<pre class="prettyprint">initial_token: -3074457345618258603
listen_address: 10.0.0.58
</pre>
Node #3:<br />
<pre class="prettyprint">initial_token: 3074457345618258602
listen_address: 10.0.0.59
</pre>
<b><u>Step #17:</u></b> You may need to update your "<b><i>/etc/hosts"</i></b> file in case your hostname is not configured. I have updated that file in each server like this:<br />
<script src="http://gist.github.com/7019759.js?file=localhost"></script>
<b><u>Step #18:</u></b> That's it! Now you are ready to start your Cassandra. Execute this in each node to start your cluster:<br />
<pre class="prettyprint">[root@ip-10-0-0-57 ~]# /etc/init.d/cassandra start
Starting cassandra: OK
</pre>
<b><u>Step #19:</u></b> You can check the status of your Cassandra by executing this:<br />
<script src="http://gist.github.com/7019759.js?file=localhost-ring"></script>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghwCGaP4PdDIpl-2pPGYv6FMpaa7UBrQ3i24T4KE78NMgJ6hJS0j3j4sSY9u3ishsglrenmyXZ6SoqeeKAnChYnEb3xKt5najygZ2v50D8yHVWsthC5QYrvl4fthg-VpmAIE9AK8_a3kfI/s1600/status-2.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="185" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEghwCGaP4PdDIpl-2pPGYv6FMpaa7UBrQ3i24T4KE78NMgJ6hJS0j3j4sSY9u3ishsglrenmyXZ6SoqeeKAnChYnEb3xKt5najygZ2v50D8yHVWsthC5QYrvl4fthg-VpmAIE9AK8_a3kfI/s400/status-2.JPG" width="400" /></a></div>
<br />
<div style="text-align: justify;">
Two major things here you need to look on the status, those are "<b>Owns</b>" and "<b>Status</b>" columns. You see here all nodes are up and sharing same percentage of the total ownership. </div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
As I said at the beginning, this is the most basic (or minimum) setup of Cassandra cluster. There are a lot of things you can change, tune and modify based on your needs. Once you are familiar with basic Cassandra, I highly recommend you to do some experiments by using different configurations. The more you try the more you learn!</div>
<div style="text-align: justify;">
<br />
In another post, I will write about how to install <a href="http://www.datastax.com/what-we-offer/products-services/datastax-opscenter">DataStax OpsCenter</a> to monitor this Cassandra Cluster.</div>
<div style="text-align: justify;">
<br /></div>
<div style="text-align: justify;">
<b>Note:</b> For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.</div>
</div>Tanzir Musabbirhttp://www.blogger.com/profile/03929295039758618334noreply@blogger.com1