Big Data-Hadoop 2017-11-17T07:52:01+00:00


Hadoop is an open source, Java-based programming framework that supports the processing and storage of extremely large data sets in a distributed computing environment.


What you learn?

The only Big Data-Hadoop training program where you get in-depth knowledge of all the 10 modules of Big Data-Hadoop with practical hands-on exposure.

Introduction to Big Data

Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organise, process, and gather insights from large datasets.

  • Importance of Data
  • ESG Report on Analytics
  • Big Data & It’s Hype
  • What is Big Data?
  • Structured vs Unstructured data
  • Definition of Big Data
  • Big Data Users & Scenarios
  • Challenges of Big Data
  • Why Distributed Processing?Why do we need to study AI?


Hadoop is an open-source software framework used for distributed storage and processing of dataset of big data using the MapReduce programming model.

  • History Of Hadoop
  • Hadoop Ecosystem
  • Hadoop Animal Planet
  • When to use & when not to use Hadoop
  • What is Hadoop?
  • Key Distinctions of Hadoop
  • Hadoop Components/Architecture
  • Understanding Storage Components
  • Understanding Processing Components
  • Anatomy Of a File Write
  • Anatomy of  a File Read

Understanding Hadoop Cluster

A Hadoop cluster is a special type of computational cluster designed specifically for storing and analysing huge amounts of unstructured data in a distributed computing environment.

  • Handout discussion
  • Walkthrough of CDH setup
  • Hadoop Cluster Modes
  • Hadoop Configuration files
  • Understanding Hadoop Cluster configuration
  • Data Ingestion to HDFS


MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

  • Meet MapReduce
  • Word Count Algorithm – Traditional approach
  • Traditional approach on a Distributed system
  • Traditional approach – Drawbacks
  • MapReduce approach
  • Input & Output Forms of a MR program
  • Map, Shuffle & Sort, Reduce Phases
  • Workflow & Transformation of Data
  • Word Count Code walkthrough
  • Input Split & HDFS Block
  • Relation between Split & Block
  • MR Flow with Single Reduce Task
  • MR flow with multiple Reducers
  • Data locality Optimization
  • Speculative Execution

Advanced MapReduce

  • Combiner
  • Partitioner
  • Counters
  • Hadoop Data Types
  • Custom Data Types
  • Input Format & Hierarchy
  • Output Format & Hierarchy
  • Side Data distribution – Distributed cache
  • Joins
  • Map side Join using Distributed cache
  • Reduce side Join
  • MR Unit – An Unit testing framework


Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java.

  • What is Pig?
  • Why Pig?
  • Pig vs Sql
  • Execution Types or Modes
  • Running Pig
  • Pig Data types
  • Pig Latin relational Operators
  • Multi Query execution
  • Pig Latin Diagnostic Operators
  • Pig Latin Macro & UDF statements
  • Pig Latin  Commands
  • Pig Latin  Expressions
  • Schemas
  • Pig Functions
  • Pig Latin File Loaders
  • Pig UDF & executing a Pig UDF


Hive is a data warehouse software project built on top of Apache Hadoop for providing data summarisation, query, and analysis.

  • Introduction to Hive
  • Pig Vs Hive
  • Hive Limitations & Possibilities
  • Hive Architecture
  • Metastore
  • Hive Data Organisation
  • Hive QL
  • Sql vs Hive QL
  • Hive Data types
  • Data Storage
  • Managed & External Tables
  • Partitions & Buckets
  • Storage Formats
  • Built-in Serdes
  • Importing Data
  • Alter & Drop Commands
  • Data Querying
  • Using MR Scripts
  • Hive Joins
  • Sub Queries
  • Views
  • UDFs


HBase is one of the most popular non-relational databases built on top ofHadoop and HDFS (Hadoop Distributed File system). It is also known as theHadoop database.

  • Introduction to NoSql & HBase
  • Row & Column oriented storage
  • Characteristics of a huge DB
  • What is HBase?
  • HBase Data-Model
  • HBase vs RDBMS
  • HBase architecture
  • HBase in operation
  • Loading Data into HBase
  • HBase shell commands
  • HBase operations through Java
  • HBase operations through MRCategorical Features

Zookeeper & Oozie

Zookeeper and Oozie are the widely used Hadoop admin tools.

  • Introduction to Zookeeper
  • Distributed Coordination
  • Zookeeper Data Model
  • Zookeeper Service
  • Zookeeper in HBase
  • Introduction to Oozie
  • Oozie workflow


Apache Sqoop efficiently transfers bulk data between Apache Hadoop and structured data stores such as relational databases.

  • Introduction to Sqoop
  • Sqoop design
  • Sqoop Commands
  • Sqoop Import & Export Commands
  • Sqoop Incremental load Commands

Hadoop 2.0 and Yarn

YARN – Introduction and Advantage over MapReduce in Hadoop 2.0.

  • Hadoop 1 Limitations
  • HDFS Federation
  • NameNode High Availability
  • Introduction to YARN
  • YARN Applications
  • YARN Architecture
  • Anatomy of an YARN application

Join The 10,000+ Satisfied Trainees!

Connect with us!