Wednesday, April 19, 2017

Over and Window in Hive

The OVER clause
  • OVER with standard aggregates:
    • COUNT
    • SUM
    • MIN
    • MAX
    • AVG
  • OVER with a PARTITION BY statement with one or more partitioning columns of any primitive datatype.
  • OVER with PARTITION BY and ORDER BY with one or more partitioning and/or ordering columns of any datatype.
    • OVER with a window specification. Windows can be defined separately in a WINDOW clause. Window specifications support the following formats:
      (ROWS | RANGE) BETWEEN (UNBOUNDED | [num]) PRECEDING AND ([num] PRECEDING | CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
      (ROWS | RANGE) BETWEEN CURRENT ROW AND (CURRENT ROW | (UNBOUNDED | [num]) FOLLOWING)
      (ROWS | RANGE) BETWEEN [num] FOLLOWING AND (UNBOUNDED | [num]) FOLLOWING
      When ORDER BY is specified with missing WINDOW clause, the WINDOW specification defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW.
      When both ORDER BY and WINDOW clauses are missing, the WINDOW specification defaults to ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

PARTITION BY with one partitioning column, no ORDER BY or window specification

SELECT a, COUNT(b) OVER (PARTITION BY c)
FROM T;

PARTITION BY with two partitioning columns, no ORDER BY or window specification

SELECT a, COUNT(b) OVER (PARTITION BY c, d)
FROM T;

PARTITION BY with one partitioning column, one ORDER BY column, and no window specification

SELECT a, SUM(b) OVER (PARTITION BY ORDER BY d)
FROM T;

PARTITION BY with two partitioning columns, two ORDER BY columns, and no window specification

SELECT a, SUM(b) OVER (PARTITION BY c, d ORDER BY e, f)
FROM T;

PARTITION BY with partitioning, ORDER BY, and window specification

SELECT a, SUM(b) OVER (PARTITION BY ORDER BY ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY ORDER BY ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY ORDER BY ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING)
FROM T;
SELECT a, AVG(b) OVER (PARTITION BY ORDER BY ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING)
FROM T;

There can be multiple OVER clauses in a single query. A single OVER clause only applies to the immediately preceding function call. In this example, the first OVER clause applies to COUNT(b) and the second OVER clause applies to SUM(b):
SELECT 
 a,
 COUNT(b) OVER (PARTITION BY c),
 SUM(b) OVER (PARTITION BY c)
FROM T;

Aliases can be used as well, with or without the keyword AS:
SELECT 
 a,
 COUNT(b) OVER (PARTITION BY c) AS b_count,
 SUM(b) OVER (PARTITION BY c) b_sum
FROM T;

WINDOW clause

SELECT a, SUM(b) OVER w
FROM T
WINDOW w AS (PARTITION BY ORDER BY ROWS UNBOUNDED PRECEDING);

Distinct counting for each partition

SELECT a, COUNT(distinct a) OVER (PARTITION BY b)
FROM T;

More Example are available here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics


No comments:

Post a Comment