Big Data Dictionary

HadoopDB

The HadoopDB project (commercialized by Hadapt) is a hybrid system that tries to combine the scalability advantages of Hadoop framework with the performance and efficiency advantages of parallel databases. The basic idea behind HadoopDB is to connect multiple single node database systems (PostgreSQL) using Hadoop as the task coordinator and network communication layer. Queries are expressed in SQL but their execution are parallelized across nodes using the MapReduce framework, however, as much of the single node query work as possible is pushed inside of the corresponding node databases. Thus, HadoopDB tries to achieve fault tolerance and the ability to operate in heterogeneous environments by inheriting the scheduling and job tracking implementation from Hadoop. Parallely, it tries to achieve the performance of parallel databases by doing most of the query processing inside the database engine.

The above illustrates the architecture of HadoopDB which consists of two layers:

A data storage layer or the Hadoop Distributed File System (HDFS).
A data processing layer or the MapReduce Framework.

In this architecture, HDFS is a block-structured le system managed by a central NameNode. Individual les are broken into blocks of a fixed size and distributed across multiple DataNodes in the cluster. The NameNode maintains metadata about the size and location of blocks and their replicas. The Hadoop Framework follows a simple master-slave architecture. The master is a single JobTracker and the slaves or worker nodes are TaskTrackers. The JobTracker handles the runtime scheduling of MapReduce jobs and maintains information on each TaskTracker's load and available resources. The Database Connector is the interface between independent database systems residing on nodes in the cluster and TaskTrackers. The Connector connects to the database, executes the SQL query and returns results as key-value pairs. The Catalog component maintains metadata about the databases, their location, replica locations and data partitioning properties. The Data Loader component is responsible for globally repartitioning data on a given partition key upon loading and breaking apart single node data into multiple smaller partitions or chunks. The SMS planner extends the HiveQL translator and transforms SQL into Hadoop jobs that connect to tables stored as les in HDFS.