General strategy

Big data –> small data –> analysis

“make big data as small as possible as quick as possible” – @EllieMcDonagh


The following quotes are adapted from Wikipedia. Some text was changed.

Apache Hadoop is a set of algorithms (an open-source software framework written in Java) for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures (of individual machines, or racks of machines) are commonplace and thus should be automatically handled in software by the framework.

The core of Apache Hadoop consists of a storage part (Hadoop Distributed File System (HDFS)) and a processing part (YARN). Hadoop splits files into large blocks (default 128MB) and distributes the blocks amongst the nodes in the cluster. To process the data, Hadoop YARN transfers code (specifically Java JAR files) to nodes (slaves) that have the required data, which the nodes then process in parallel. This approach takes advantage of data locality to allow the data to be processed faster and more efficiently via distributed processing than by using a more conventional supercomputer architecture that relies on a parallel file system where computation and data are connected via high-speed networking.

The primary point is data locality: The processing job is sent to the machine with the data. Thus, it is essential that each machine that holds data also be able to run processing tasks. Thus, data should not be kept on a network storage system or otherwise dumb storage media. It is assumed that each slave machine has large disks, powerful processors, and lots of memory.

Hadoop is designed for horizontal scaling rather than vertical scaling. There is very little hierarchy with respect to storage and processing capabilities. More slave nodes may be added without changes to the architecture of the cluster. Each slave node is similar and does not have a special role.


“Yet Another Resource Negotiator” is the Hadoop “operating system.” I.e., it manages jobs and resources.

The following notes are adapted from Hortonworks, which offers a customized version of Hadoop.

YARN is made up of:

The ResourceManager and the NodeManager formed the new generic system for managing applications in a distributed manner. The ResourceManager is the ultimate authority that arbitrates resources among all applications in the system. The ApplicationMaster is a framework-specific entity that negotiates resources from the ResourceManager and works with the NodeManager(s) to execute and monitor the component tasks.

The ResourceManager has a scheduler, which is responsible for allocating resources to the various applications running in the cluster, according to constraints such as queue capacities and user limits. The scheduler schedules based on the resource requirements of each application.

Each ApplicationMaster has responsibility for negotiating appropriate resource containers from the scheduler, tracking their status, and monitoring their progress. From the system perspective, the ApplicationMaster runs as a normal container.

The NodeManager is the per-machine slave, which is responsible for launching the applications’ containers, monitoring their resource usage (cpu, memory, disk, network) and reporting the same to the ResourceManager.

Finally, a definition of containers:

A container is an application-specific process that’s created by a NodeManager on behalf of an ApplicationMaster. The ApplicationMaster itself is also a container, created by the ResourceManager. A container created by an ApplicationMaster can be an arbitrary process […]. This is the power of YARN—the ability to launch and manage any process across any node in a Hadoop cluster. — from Hadoop in Practice, p. 29; typos fixed.


Hadoop was created by Doug Cutting and Mike Cafarella in 2005 in order to crawl the web with their Nutch software. Cutting was employed at Yahoo! at the time, and chose the name “Hadoop” and the elephant logo (his son had a toy elephant he called hadoop).

Hadoop implemented two significant systems developed by Google and published as research papers. One is MapReduce, which has been generalized into YARN, and the Google File System, which has become HDFS.

Cluster design

Some machines have dedicated roles:

Simulation vs. real hardware

In real production clusters there is no server virtualization, no hypervisor layer. That would only amount to unnecessary overhead impeding performance. Hadoop runs best on Linux machines, working directly with the underlying hardware. That said, Hadoop does work in a virtual machine. That’s a great way to learn and get Hadoop up and running fast and cheap. – source

Replication and fault tolerance

You do not want any one slave to be the only place a block of data is kept. Rather, you want data to be replicated (not too much, but also not too little). A replication factor of 3 is the default. This means each block of data gets stored on 3 slaves.

Hadoop can be made “rack aware,” meaning you can describe how the slaves are physically arranged in a rack in a storage facility. Sometimes, entire racks are lost (e.g., power or network outages). So, ideally, the replicated data won’t all be replicated on machines in the same rack. However, machines on the same rack can communicate with each other faster (lower latency), so you want adjacent blocks of data to be stored on machines in the same rack. The rule is “for every block of data, two copies will exist in one rack, another copy in a different rack.” (source)

The NameNode detects and fixes problems with slaves according to the following diagram (source):

NameNode roles

Some of these notes are adapted from this blog post and this blog post.

Several companies produce Hadoop distributions, sometimes with significant enhancements:

Some databases work on Hadoop/HDFS:

Some projects generalize or specialize the processing workflow (i.e., beyond simple MapReduce):

Other tools:

CINF 401 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.