Thursday, February 21, 2013

Sqoop import/export from/to Oracle

I love Sqoop! It's a fun tool to work with and it's very powerful. Importing and exporting data from/to Oracle by Sqoop is pretty straightforward. One crucial thing you need to remember when working with Sqoop and Oracle together, that is using all capital letters for Oracle table names. Otherwise, Sqoop will not recognize Oracle tables.

This is my database (Oracle) related information:
  • URL: 10.0.0.24
  • db_name (SID): test
  • Schema: Employee_Reporting
  • Username: username1
  • Password: password1
  • Tablename: employee ( I'm going to export this table to HDFS by Sqoop)

Import from Oracle to HDFS:
Let's go through this option file (option.par):
You can see most of the parameters are self-explanatory. Notice that I'm providing table name in all capital letters. How you want to see the imported columns in HDFS? For that, we need to use --fields-terminated-by parameter. Here I'm passing "\t" for that parameter, which means that the column or field for each rows will be tab delimited after import. Sqoop will generate a class(with a list of setter and getter) to invoke the employee object, the name of that class name is defined by --class-name parameter. So in this case, it will create a class named Employee.java inside com.example.reporting package. I'm using --verbose parameter to print out information while Sqoop is working. It's not mandatory and you can ignore it if you want. --split-by parameter represents the name of the column which I want to use for splitting the import data. Here, ID is the primary key of the table Employee. You can use any WHERE clause for your import, in that case you need to pass that with the --where parameter. For the above example, it will import all rows from the table Employee where ID is less than or equal to 100000 (e.g. importing 100000 rows). You need to mention a HDFS location which will be used as a destination directory for the imported data (--target-dir parameter). Remember one thing here is that the target directory should not be existing prior to run import command otherwise Sqoop will throw an error. The last parameter -m represents the number of map tasks to run in parallel for the entire import job.

Once you have your option file ready, you can execute the Sqoop import command as:
sqoop import --options-file option.par
Using option file is not mandatory, I'm just using it for my convenience. You can also pass each of the parameter from your console and execute the import job. Example:

Export from Hive to Oracle:
For export, I will be using some of the parameters which I used during import as they are common for both import and export job. Assume I processed(by MR jobs) the data generated by import job and inserted them into Hive tables. Now I want to export those Hive tables to Oracle. If you are familiar with Hive, you may know that Hive moves/copies data to its warehouse folder by default. I'm using Hortonworks's distribution and for my case Hive'e warehouse folder is located at: "/apps/hive/warehouse/emp_record". Here, emp_record is one of the Hive table I want to export from.

I have already created a matching table "Emp_Record" in my Oracle inside the same schema "Employee_Reporting". To export the Hive table, I'm executing the following command:
Notice that instead of using --target-dir, I'm using --export-dir, this is the location of the Hive table's warehouse folder and data will be exporting from there.

Now assume, inside the warehouse directory, I have a file 00000_1 (which contains the data of Hive table Emp_Record) and some of its lines are:
As you can see, each of columns/fields are tab delimited and each of the rows are separated by a new line. Again we see here that there is an entire row which contains null as their values (Ideally you might not have null values as you might want to filter those values from your M-R jobs). But say we have all kind of values, so we need to tell Sqoop how to treat each of those values. Because of that, I'm using --input-fields-terminated-by parameter to inform that the fields are tab delimited and --input-lines-terminated-by parameter to distinguish rows. Again from Sqoop side, there are two kinds of column - string column and non-string column. Sqoop needs to know what string value is interpreting a null value. Because of that I'm using --input-null-string and --input-null-non-string parameters for two column types and passing '\\N' as their value because for my case '\\N' is null.


Wrapping Up:
Sometimes you will face some issues during export when your Oracle table has a very tight constraint (e.g. not null, time-stamp, expecting value in specific format, etc). In that case, the best idea is to export the Hive table/HDFS data to a temporary Oracle table without any modification to make Sqoop happy :). And then write a SQL script to convert and filter those data to your expected values and load them into your main Oracle table.

The above two examples are just showing a very basic import and export Sqoop job. There are a lot of setting in Sqoop you can use. Once you are able to run export/import job successfully, I would recommend you to try the same job with different parameters and see how it goes. You can find all available options in the Sqoop user guide.


Note: For privacy purpose, I had to modify several lines on this post from my original post. So if you find something is not working or facing any issues, please do not hesitate to contact me.

8 comments:

  1. This is very informative. We use sqoop import to pull data from oracle tables to our Hive database. However of late we are encountering java heap size errors in the log for sqoop import. Increasing the mappers have worked temporarily but the same problem has resurfaced in the following weeks. Any idea as to what this java heap error with respect to sqoop import be related to ? any insights would be very helpful.

    ReplyDelete
  2. Hi Rohan,
    Thanks for stop-by. What's the size of import you are using? Is it constant for all runs? We also experienced pretty similar issue but in our case decreasing the mappers helped us. We figured out that when the export size is too small and if we use >6 # of mappers, then it happens only (not always though). So the work-around which we followed was to change mapper size dynamically based on input size. Hope it helps.

    ReplyDelete
  3. Hi, Nice Post. I have tried to add a comment mentioning the problem what I am facing, am not sure weather it reached you or not. Here am again publishing it as Anonymous user. Import goes fine CLOB data in Oracle table to HIVE/HBASE. But after import the newline/tabs which are in CLOB data are been imported as '\x0A' and '\x09' which I dont want. Could you please let me know if there is any way to avoid such extra characters..thanks in advance.

    ReplyDelete
    Replies
    1. Hi,
      I haven't faced that issue so far. One thing you can try is - consider that CLOB data as a string and see how it goes to Hive/HBASE. You can do so by using --map-column-java. Like:

      sqoop import --map-column-java $columnName=String

      Delete
  4. Hi did you try importing oracle views and synonyms to hive

    ReplyDelete
  5. Is there a way to preserve new line characters and special characters as is while importing from Oracle to Hive using Sqoop?

    ReplyDelete
  6. It looks like there is no activity on the post for a year, but I decided to try my luck and post my question here anyways :). Hopefully, one of you sqoop experts can guide me in the right direction.

    I have a script to create a schema(TestV100), a table (Xy100) in that schema and export a tab delimited flat file to this oracle table.

    This is the shell script: – ExportOracleTestV100.sh
    #!/bin/bash
    HOST=$1
    USER=$2
    PASS=$3
    SCHEMA=$4
    PORT=$5
    SID=$6
    SQOOP=/usr/bin/sqoop
    JDBC="jdbc:oracle:thin:@$1:$5:$6"
    SQOOP_EVAL="$SQOOP eval --connect $JDBC --username $USER --password $PASS --query"
    #Create Schema and Tables;
    ${SQOOP_EVAL} "CREATE USER \"TestV100\" identified by \"password\""
    ${SQOOP_EVAL} "GRANT CONNECT TO \"TestV100\""
    ${SQOOP_EVAL} "ALTER USER \"TestV100\" QUOTA UNLIMITED ON USERS"
    ${SQOOP_EVAL} "DROP TABLE \"TestV100\".\"Xy100\""
    ${SQOOP_EVAL} "CREATE TABLE \"TestV100\".\"Xy100\"( \"a\" NVARCHAR2(255) DEFAULT NULL, \"x\" NUMBER(10,0) DEFAULT NULL, \"y\" NUMBER(10,0) DEFAULT NULL )"

    ## Load Data into tables; ##
    SQOOP_EXPORT="/usr/bin/sudo -u hdfs $SQOOP export --connect ${JDBC} --username $USER --password $PASS --export-dir"
    ${SQOOP_EXPORT} "/hdfs_nfs_mount/tmp/oracle/TestV100/Xy100.txt" --table "\"\"$SCHEMA\".\"Xy100\"\"" --fields-terminated-by "\t" --input-null-string null -m 1


    Input file: – cat /hdfs_nfs_mount/tmp/oracle/TestV100/Xy100.txt
    c 8 3
    a 1 4

    Execution - sh ./ExportOracleTestV100.sh oracle11 test password TestV100 1521 orcl --verbose

    Output:
    [root@abc-repo-app1 rv]# sh ./ExportOracleTestV100.sh oracle11 test password TestV100 1521 orcl --verbose
    Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
    Please set $ACCUMULO_HOME to the root of your Accumulo installation.
    15/11/02 12:40:07 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5-cdh5.4.1
    15/11/02 12:40:07 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
    15/11/02 12:40:07 INFO oracle.OraOopManagerFactory: Data Connector for Oracle and Hadoop is disabled.
    15/11/02 12:40:07 INFO manager.SqlManager: Using default fetchSize of 1000
    15/11/02 12:40:07 INFO tool.CodeGenTool: Beginning code generation
    15/11/02 12:40:08 INFO manager.OracleManager: Time zone has been set to GMT
    15/11/02 12:40:08 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM "TestV100"."Xy100" t WHERE 1=0
    15/11/02 12:40:08 ERROR sqoop.Sqoop: Got exception running Sqoop: java.lang.IllegalArgumentException: There is no column found in the target table "TestV100"."Xy100". Please ensure that your table name is correct.
    java.lang.IllegalArgumentException: There is no column found in the target table "TestV100"."Xy100". Please ensure that your table name is correct.
    at org.apache.sqoop.orm.ClassWriter.generate(ClassWriter.java:1658)
    at org.apache.sqoop.tool.CodeGenTool.generateORM(CodeGenTool.java:96)
    at org.apache.sqoop.tool.ExportTool.exportTable(ExportTool.java:64)
    at org.apache.sqoop.tool.ExportTool.run(ExportTool.java:100)
    at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
    at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
    at org.apache.sqoop.Sqoop.main(Sqoop.java:236)


    I see a lot of posting online for sqoop import and this error, and solution is to change the table name to UPPERCASE in the command. But, I am running export. Also, the oracle table HAS to be created with mixed case in my environment.

    I am running:
    Sqoop version: 1.4.5-cdh5.4.1
    Oracle Version: 11.1.0.6.0
    Ojdbc Driver: ojdbc6.jar

    ReplyDelete