Sunday, March 3, 2013

Oozie Example: Java Action / MapReduce Jobs

Running a Java action through Oozie is very easy. But there are some things you need to consider before you run your Java action. In this tutorial, I'm going to execute a very simple Java action. I have a JAR file TestMR.jar which is a MapReduce application. So this application will be executed on the Hadoop cluster as a Map-Reduce job. 

TestMR.jar file has a class TestMR.java which has a public static void main method(String args[]) that initiates the whole application. To run a Java action, you need to pass the main Java class name through the tag <main-class>.

This is the workflow.xml file for a Java action with minimum number of parameters:

Your Java action has to be configured with <job-tracker> and <name-node>. And as you know, Hadoop will throw exceptions if the output folder is already exists. That's why I'm using <prepare> tag which will delete the output folder before execution. My jar also takes command line arguments. One of the argument is "-r 6" which means how many reducers I want to use for the MR job. So I'm using "<arg>" tag to pass command line arguments. You can have multiple <arg> for a single Java action. As like other actions, to indicate a "ok" transition, the main Java application needs to be completed without any error. If it throws any exception, the workflow will indicate a "error" transition.

Now comes to the folder structure inside HDFS. When Oozie executes any action, it automatically adds all JAR files and native libraries from the "/lib" folder to its classpath. Here, "/lib" folder is a sub-folder inside Oozie workflow application path. So, if "java-action" is the workflow application path then the structure would be:
- java-action
- java-action/workflow.xml
- java-action/lib

In my HDFS, I have:
And here is my job.properties file:
That's pretty much it! Now you can execute your workflow by:
Remember, this is a very basic and simple workflow to run a Java action through Oozie. You can do a lot more than these by using several other options provided by Oozie. Once you are able to run a simple workflow, I would recommend you to go through Oozie documentation and try some workflows with different settings.

Consideration: Be careful about what you have inside your "/lib" folder. If the version of the library which you are using for your application conflicts with Hadoop's library file's version, it will throw errors and those type of errors are hard to find. To avoid those kind of errors, better to match your library files with the versions you have inside "/usr/lib/[hadoop/hive/hbase/oozie]/lib" folder on your client node.


Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.

6 comments:

  1. As a beginner, i really find your blog very useful thanks for sharing.

    Hadoop training velachery

    ReplyDelete
  2. Hi Admin, I went through your article and it’s totally awesome. You can consider including RSS feed for easy content sharing, So that you can drive huge traffic to your blog. Hadoop Training in Chennai | Big Data Training in Chennai

    ReplyDelete
  3. i got EJ001 error code with "could not locate sharelib. It is compulsory to use sharelib to run mapreduce job with oozie.

    ReplyDelete
    Replies
    1. Yes, you can install sharelib if you haven't yet. If you don't have sharelib, Oozie will not find all required libraries. Follow this link to know more about sharelib: http://blog.cloudera.com/blog/2012/12/how-to-use-the-sharelib-in-apache-oozie/

      Delete



  4. In the site was very excellent,i think this is very best one compare others and then the more information to get after read this post.


    HADOOP Training in Chennai

    ReplyDelete
  5. Hi,

    did you created the .jar file and uploaded to HDFS.? If so please let me know how did you created a jar file.

    ReplyDelete