Skip to the content.

Big Data and Hadoop

-

Concepts

-

Why Hadoop

-

Why Hadoop

Larger Datasets take space. Do we:

  1. Keep getting larger and larger harddrives to store the data?
  2. Add additional drives to the server?
  3. Make more servers with additional drives?

-

Why Hadoop - Larger Drives

Pros:

Cons:

-

Why Hadoop - Additional Drives

Pros:

Cons:

-

Why Hadoop - Distributed Servers

Pros:

Cons:

-

The Solution

Hadoop - An ecosystem of technology that distributes data and naturally scales out using distributed nodes. Hadoop will handle resource negotiation so a middle tier service wont have to be modified to search through larger data sets.

-

The Ecosystem

-

HDFS

Hadoop Distributed File System

-

YARN

Yet Another Resource Negotiater

-

MapReduce

-

The Basic System

The distributed file system holds data and keeps track of the sources of that data and allows for data being backedup

YARN solves how data should be processed most efficiently

MapReduce scans for data and figures out the most effective way to compress and send data

-

Hadoop Community

-

Building on Hadoop

Hadoop works as a technology but can be hard to work with. Certain steps (such as resource negotiation) might have a better tool for your specific use case

You may also decide to work with a technology to interact with Hadoop indirectly. Some of these technologies might handle query requests to Hadoop or they could even handle things like failover

-

Mesos

logo

-

TEZ

logo

-

Spark

logo

-

Pig

logo

-

Hive

logo

-

Many More!

There are managers for failover like Zookeeper or Oozie. There are Web UI’s for interacting with your cluster like Apache Ambari. There are Data Ingestion tools like Kafka.

You can also integrate standard data stores like MySQL and MongoDB. The limits of Hadoop have not been found yet!