Monday, June 24, 2019

HBase

Apache HBase is a NoSQL column oriented database which is used to store the sparse data sets. It runs on top of the Hadoop distributed file system (HDFS) and it can store any kind of data.

Use Apache HBase™ when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable: A Distributed Storage System for Structured Data.

Features:

Linear and modular scalability.
Strictly consistent reads and writes.
Automatic and configurable sharding of tables
Automatic failover support between RegionServers.
Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
Easy to use Java API for client access.
Block cache and Bloom Filters for real-time queries.
Query predicate push down via server side Filters
Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
Extensible jruby-based (JIRB) shell
Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX


What are the key components of HBase?

The key components of HBase are Zookeeper, RegionServer and HBase Master. 

Key components of HBase

When would you use HBase?

  • HBase is used in cases where we need random read and write operations and it can perform a number of operations per second on a large data sets.
  • HBase gives strong data consistency.
  • It can handle very large tables with billions of rows and millions of columns on top of commodity hardware cluster.

Define the difference between Hive and HBase?

Apache Hive is a data warehousing infrastructure built on top of Hadoop. It helps in querying data stored in HDFS for analysis using Hive Query Language (HQL), which is a SQL-like language, that gets translated into MapReduce jobs. Hive performs batch processing on Hadoop.
Apache HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase partitions the tables, and the tables are further splitted into column families. 
Hive and HBase are two different Hadoop based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database of Hadoop. We can use them together. Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from HBase to Hive and vice-versa.

Explain the data model of HBase.

HBase comprises of:
  • Set of tables.
  • Each table consists of column families and rows.
  • Row key acts as a Primary key in HBase.
  • Any access to HBase tables uses this Primary Key.
  • Each column qualifier present in HBase denotes attributes corresponding to the object which resides in the cell.

Define column families?

Column Family is a collection of columns, whereas row is a collection of column families.

What is RegionServer?

A table can be divided into several regions. A group of regions is served to the clients by a Region Server.

What is the use of ZooKeeper?

The ZooKeeper is used to maintain the configuration information and communication between region servers and clients. It also provides distributed synchronization. It helps in maintaining server state inside the cluster by communicating through sessions.
Every Region Server along with HMaster Server sends continuous heartbeat at regular interval to Zookeeper and it checks which server is alive and available. It also provides server failure notifications so that, recovery measures can be executed.

Define catalog tables in HBase?

Catalog tables are used to maintain the metadata information.

How HBase Handles the write failure?

Failures are common in large distributed systems, and HBase is no exception.
If the server hosting a MemStore that has not yet been flushed crashes. The data that was in memory, but not yet persisted are lost. HBase safeguards against that by writing to the WAL (Write Ahead Log)  before the write completes. Every server that’s part of the.
HBase cluster keeps a WAL to record changes as they happen. The WAL is a file on the underlying file system. A write isn’t considered successful until the new WAL entry is successfully written. This guarantee makes HBase as durable as the file system backing it. Most of the time, HBase is backed by the Hadoop Distributed Filesystem (HDFS). If HBase goes down, the data that were not yet flushed from the MemStore to the HFile can be recovered by replaying the WAL.

What is a Bloom filter and how does it help in searching rows?

HBase supports Bloom Filter to improve the overall throughput of the cluster. A HBase Bloom Filter  is a space efficient mechanism to test whether a HFile contains a specific row or row-col cell.
Without Bloom Filter, the only way to decide if a row key is present in a HFile  is to check the HFile’s block index, which stores the start row key of each block in the HFile. There are many rows drops between the two start keys. So, HBase has to load the block and scan the block’s keys to figure out if that row key actually exists.
How will you implement joins in HBase?
HBase, not support joins directly but uses MapReduce jobs join queries can be implemented by retrieving data with the help of different HBase tables.
HBase Shell Commands:
ComponentDescription
Region ServerA table can be divided into several regions. A group of regions is served to the clients by a Region Server
HMasterIt coordinates and manages the Region Servers (similar as NameNode manages DataNodes in HDFS).
ZooKeeperZookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.

No comments:

Post a Comment