Tuesday, February 28, 2017

UDF, UDAF and UDTF in Hive

There are two different interfaces you can use for writing UDFs for Apache Hive. 
  • Simple API - org.apache.hadoop.hive.ql.exec.UDF
  • Complex API - org.apache.hadoop.hive.ql.udf.generic.GenericUDF
How to write UDF (User-Defined Functions) in Hive?
  1. Create Java class for User Defined Function which extends ora.apache.hadoop.hive.sq.exec.UDF 
  2. Implement evaluate() method.
package com.xyz.udf;
import org.apache.hadoop.hive.ql.exec.UDF;

public class ArraySum extends UDF {

  public double evaluate(List<Double> value) {
    double sum = 0;
    for (int i = 0; i < value.size(); i++) {
      if (value.get(i) != null) {
        sum += value.get(i);
      }
    }
    return sum;
  }

}

      3. Package your Java class into JAR file
      4.  ADD your JAR in Hive Shell
ADD JAR  Test_UDF-1.0-SNAPSHOT-jar-with-dependencies.jar;

      5. CREATE TEMPORARY FUNCTION in hive which points to your Java class
CREATE TEMPORARY FUNCTION arraySum AS "com.xyz.udf.ArraySum";

      6. Use it in Hive SQL and have fun!
SELECT arraySum(1.0, 2.0, 3.0) FROM table_name; 


How to write UDAF (User-Defined Aggregation Functions)?
  • Create Java class which extends org.apache.hadoop.hive.ql.exec.hive.UDAF;
  • Create Inner Class which implements UDAFEvaluator
  • Implement five methods ()
    • init() – The init() method initalizes the evaluator and resets its internal state. We are using new Column() in code below to indicate that no values have been aggregated yet.
    • iterate() – this method is called everytime there is anew value to be aggregated. The evaulator should update its internal state with the result of performing the agrregation (we are doing sum – see below). We return true to indicate that input was valid.
    • terminatePartial() – this method is called when Hive wants a result for the partial aggregation. The method must return an object that encapsulates the state of the aggregation.
    • merge() – this method is called when Hive decides to combine one partial aggregation with another.
    • terminate() – this method is called when the final result of the aggregation is needed.
  • Compile and Package JAR
  • ADD JAR <JarName>
  • CREATE TEMPORARY FUNCTION in hive CLI
  • Run Aggregation Query – Verify Output!!!
How to write UDTF (User-Defined Table Functions)?

User defined tabular function (UDTF) works on one row as input and returns multiple rows as output. For example, Hive built in EXPLODE() function. Now lets take an array column USER_IDS as 10,12,5,45 then SELECT EXPLODE(USER_IDS) will give 10,12,5,45 as four different rows in output.
  • Create Java class which extends base Class Generic UDTF
  • Override 3 methods 
    •  initialize()
    •  process() 
    •  close()
  • Package your Java class into JAR file 
  • ADD your JAR
  • CREATE TEMPORARY FUNCTION in hive which points to your Java class
  • Use it in Hive SQL



No comments:

Post a Comment