Thursday, March 21, 2013

A basic Oozie coordinator job

Suppose you want to run your workflow in every two hours or once per day, at that point coordinator job comes out very handy. There are several more use cases where you can use Oozie coordinator. Today I'm just showing you how to write a very basic Oozie coordinator job.

I'm assuming that you are already familiar with Oozie and have an workflow ready to be used as coodinator job. For this tutorial, my Oozie workflow is a shell-based action workflow. I want to execute a shell script in every two hours starting from today to next 10 days. My workflow.xml is already inside the a HDFS directory.
Without the coordinator, I'm currently running it like this:
Here is my file:
Now I want to run this workflow with coordinator. Oozie Coordinator Engine is responsible for the coordinator job and the input of the engine is a Coordinator App. At least two files are required for each Coordinator App:
  1. coordinator.xml - Definition of coordinator job is defined in this file. Based on what(time based or input based) your workflow will trigger, how long it will continue, workflow wait time - all of this information need to be written on this coordinator.xml file.
  2. - Contain properties for coordinator job, behaves same as job.properfiles file.
Based on my requirement, here is my coordinator.xml file:
As I need to pass file for a coordinator job, I cannot pass previous file at the same time. That's why I need to move all properties from the file to file. Remember one thing, file must have a property which specifics the location of coordinator.xml file (similar to in After moving those properties my file became:

As you noticed I mentioned application path oozie.coord.application.path and that path contains the cooridnator.xml file.
Now I'm pretty much set. Now if I execute a coordinator job now it will execute the coordinator app located in the coordinator application path. Coordinator app has a tag <workflow><app-path>.... </app-path></workflow> which specifics the actual workflow location. At that location, I have my workflow.xml file. So that workflow.xml will be triggered based on  how I define the job in coordinator.xml file.

I'm submitting my coordinator job by:
If you are running your coordinator job successfully, I highly recommend you to go through this document and try out some other use cases and alternatives.

Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.

Friday, March 8, 2013

Oozie Example: Sqoop Actions

To run a Sqoop action through Oozie, you at least need two files, a) workflow.xml and b) But if you prefer to pass Sqoop options through a parameter file, then you also need to copy that parameter file into your Oozie workflow application folder. I prefer to use a parameter file for Sqoop so I'm passing that file too.

In this tutorial, what I'm trying to achieve is to run a Sqoop action which will export data from HDFS to Oracle. On my previous post, I already wrote about how to do import/export between HDFS & Oracle. Before run Sqoop action through Oozie, make sure your Sqoop is working without any errors. Once it's working without Oozie, then try it through Oozie by using Sqoop action.

I'm assuming when you execute the following line, it executes successfully and data loaded into Oracle without any error:
A successful Sqoop export should be ended with the following message on the console:
In my Oracle, I have already created the specific table "Emp_Record" which resembles the data present in HDFS (under /user/ambari-qa/example/output/sqoop/emprecord folder). That means, each rows on the HDFS files represent a row in the table. Again, the data in HDFS is tab delimited and each column represents a column in the table "Emp_Record". To know more about this, please check my previous post as I'm using the same table and HDFS files here.

So, here is my option.par file which I'm using for my Sqoop export:
And my workflow.xml file:
As you see, all the Sqoop commands which we generally use on the command line, can be passed as an argument by using <arg> tag. If you do not want to use parameter file, then you need to pass each of the command in a separate <arg> tag like: Since I'm using a parameter file for this Sqoop action, I also need to put it inside the Oozie workflow application path and have to mention this file through a <file> tag.  So, my Oozie workflow application path becomes:
Finally my file for this workflow:
Execution command for this Oozie workflow will be same as others:
A common issue: 

If your Sqoop job is failing, then you need to check your log. Most of the time(I'm using Hortonworks distribution) you might face this error message on the log:
This is happens when you do not have required Oracle library file in Oozie's classpath. In that case, you need to manually copy the required ojdbc6.jar file to Sqoop's lib folder "/usr/lib/sqoop/lib". While running Sqoop through Oozie, you need to do the same thing but in Oozie's Sqoop lib folder. You can do that by executing:
Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.

Sunday, March 3, 2013

Oozie Example: Java Action / MapReduce Jobs

Running a Java action through Oozie is very easy. But there are some things you need to consider before you run your Java action. In this tutorial, I'm going to execute a very simple Java action. I have a JAR file TestMR.jar which is a MapReduce application. So this application will be executed on the Hadoop cluster as a Map-Reduce job. 

TestMR.jar file has a class which has a public static void main method(String args[]) that initiates the whole application. To run a Java action, you need to pass the main Java class name through the tag <main-class>.

This is the workflow.xml file for a Java action with minimum number of parameters:

Your Java action has to be configured with <job-tracker> and <name-node>. And as you know, Hadoop will throw exceptions if the output folder is already exists. That's why I'm using <prepare> tag which will delete the output folder before execution. My jar also takes command line arguments. One of the argument is "-r 6" which means how many reducers I want to use for the MR job. So I'm using "<arg>" tag to pass command line arguments. You can have multiple <arg> for a single Java action. As like other actions, to indicate a "ok" transition, the main Java application needs to be completed without any error. If it throws any exception, the workflow will indicate a "error" transition.

Now comes to the folder structure inside HDFS. When Oozie executes any action, it automatically adds all JAR files and native libraries from the "/lib" folder to its classpath. Here, "/lib" folder is a sub-folder inside Oozie workflow application path. So, if "java-action" is the workflow application path then the structure would be:
- java-action
- java-action/workflow.xml
- java-action/lib

In my HDFS, I have:
And here is my file:
That's pretty much it! Now you can execute your workflow by:
Remember, this is a very basic and simple workflow to run a Java action through Oozie. You can do a lot more than these by using several other options provided by Oozie. Once you are able to run a simple workflow, I would recommend you to go through Oozie documentation and try some workflows with different settings.

Consideration: Be careful about what you have inside your "/lib" folder. If the version of the library which you are using for your application conflicts with Hadoop's library file's version, it will throw errors and those type of errors are hard to find. To avoid those kind of errors, better to match your library files with the versions you have inside "/usr/lib/[hadoop/hive/hbase/oozie]/lib" folder on your client node.

Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.

Friday, March 1, 2013

Oozie Example: Hive Actions

Running Hive through Oozie is pretty straight-forward and it's getting much simpler day-by-day. 1st time when I used it(old versions) I faced some issues mostly related to classpath though I resolved them. But when I used the recent versions (Hive 0.8+, Oozie 3.3.2+), I only faced 1 or 2 issues at most.

In this example, I'm going to execute a very simple Hive script through Oozie. I have a Hive table "temp" and it's currently empty. The script will load some data from HDFS to that specific hive table.
And here is the content of the script.hql:
Now you need to setup your Oozie workflow app folder. You need one very important file to execute Hive action through Oozie which is hive-site.xml. When Oozie executes a Hive action, it needs Hive's configuration file. You can provide multiple configurations file in a single action. You can find your Hive configuration file from "/etc/hive/conf.dist/hive-site.xml" (default location). Copy that file and put it inside your workflow application path in HDFS. Here is the list of files that I have in my Oozie Hive action's workflow application folder.
And here is my workflow.xml file:
Look at the <job-xml> tag, since I'm putting hive-site.xml in my application path, so I'm just passing the file name not the whole location. If you want to keep that file in some other location of your HDFS, then you can pass the whole HDFS path there too. In older version of Hive, user had to provide the hive-default.xml file by using property key oozie.hive.defaults while running Oozie Hive action, but from now on (Hive 0.8+) it's not required anymore.

Here I'm using another tag <param>, which is not required but I'm using it just to show how to pass parameter among hive script, job properties and workflow. If you are using any parameter variable inside your hive script, it needs to pass through the hive action. So you can do, either:
  • <param>INPUT_PATH=${inputPath}</param> (where inputPath can be passed through job properties) , Or
  • <param>INPUT_PATH=/user/ambari-qa/input/temp</param>

Inside my HDFS, "/hive-input/temp" folder contains files which need to be loaded to Hive table:
And here is my file:
That's it! You can now run your Hive workflow by executing this on the client node:

Two common issues:
You might face some issues if the required jar files are not present inside "/user/oozie/share/lib/hive" folder (HDFS). One of the commons issue is not having the hcatalog* jar files in that folder. In that case you will see something like this in the log:
In that case, you need to manually copy those required jar files into that folder. You can do that by following:

Another common issue you might face is:
SemanticException [Error 10001]: Table not found
Even though you can see your table is exists, you might see this error when running through Oozie. Most of the time it happens when your Hive is not properly pointing to the right metastore. Most of the time, the problem goes away when you copy the correct hive-site.xml into hive lib folder inside HDFS. Make sure you check your hive-site.xml file to see all properties are correctly set. Like,  "hive.metastore.uris", "javax.jdo.option.ConnectionUR", "javax.jdo.option.ConnectionDriverName". But me and other users (Hive action failing in oozie) also found out that the above error message is ambiguous and doesn't give much insight. If the expected jar files are not present in the share lib folder, hive also throws the same error message! So be careful about what you have in the classpath when running hive through Oozie.

Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.