Spark, Hadoop, Hive and Programming: UDF, UDAF and UDTF in Hive

Tuesday, February 28, 2017

UDF, UDAF and UDTF in Hive

There are two different interfaces you can use for writing UDFs for Apache Hive.

Simple API - org.apache.hadoop.hive.ql.exec.UDF
Complex API - org.apache.hadoop.hive.ql.udf.generic.GenericUDF

How to write UDF (User-Defined Functions) in Hive?

Create Java class for User Defined Function which extends ora.apache.hadoop.hive.sq.exec.UDF
Implement evaluate() method.

package com.xyz.udf;

import org.apache.hadoop.hive.ql.exec.UDF;

public class ArraySum extends UDF {

public double evaluate(List<Double> value) {

double sum = 0;

for (int i = 0; i < value.size(); i++) {

if (value.get(i) != null) {

sum += value.get(i);

}

return sum;

}

}

3. Package your Java class into JAR file
4. ADD your JAR in Hive Shell

ADD JAR Test_UDF-1.0-SNAPSHOT-jar-with-dependencies.jar;

5. CREATE TEMPORARY FUNCTION in hive which points to your Java class

CREATE TEMPORARY FUNCTION arraySum AS "com.xyz.udf.ArraySum";

6. Use it in Hive SQL and have fun!

SELECT arraySum(1.0, 2.0, 3.0) FROM table_name;

How to write UDAF (User-Defined Aggregation Functions)?

Create Java class which extends org.apache.hadoop.hive.ql.exec.hive.UDAF;
Create Inner Class which implements UDAFEvaluator
Implement five methods ()

init() – The init() method initalizes the evaluator and resets its internal state. We are using new Column() in code below to indicate that no values have been aggregated yet.
iterate() – this method is called everytime there is anew value to be aggregated. The evaulator should update its internal state with the result of performing the agrregation (we are doing sum – see below). We return true to indicate that input was valid.
terminatePartial() – this method is called when Hive wants a result for the partial aggregation. The method must return an object that encapsulates the state of the aggregation.
merge() – this method is called when Hive decides to combine one partial aggregation with another.
terminate() – this method is called when the final result of the aggregation is needed.

Compile and Package JAR
ADD JAR <JarName>
CREATE TEMPORARY FUNCTION in hive CLI
Run Aggregation Query – Verify Output!!!

How to write UDTF (User-Defined Table Functions)?

User defined tabular function (UDTF) works on one row as input and returns multiple rows as output. For example, Hive built in EXPLODE() function. Now lets take an array column USER_IDS as 10,12,5,45 then SELECT EXPLODE(USER_IDS) will give 10,12,5,45 as four different rows in output.

Create Java class which extends base Class Generic UDTF
Override 3 methods

initialize()
process()
close()

Package your Java class into JAR file
ADD your JAR
CREATE TEMPORARY FUNCTION in hive which points to your Java class
Use it in Hive SQL

Spark, Hadoop, Hive and Programming

Tuesday, February 28, 2017

UDF, UDAF and UDTF in Hive

No comments:

Post a Comment