about banner




Hadoop 2.0 Introduction
  • This training will introduce attendees to the core concepts of Hadoop. Deep dive into the critical architecture paths of HDFS, MapReduce and HBase.Teach the basics of how to effectively write Pig and Hive scripts.Explain how to choose the correct use cases for Hadoop
Intended Audience:
  • Engineers, Programmers, Networking specialists, Managers, Executives
Key Skills:
  • Advanced Map Reduce Concepts & Algorithms
  • Big Data & Hadoop Ecosystem
  • Hadoop Best Practices & Tip and Techniques Importing and exporting data
  • Hadoop Distributed File System – HDFS To use Map Reduce API and write common algorithms.
  • Best practices for developing and debugging map reduce programs
  • Explore a dataset of products, reviews and images
  • The attendees will learn:Managing and Monitoring Hadoop Cluster
Prerequisites:
  • The participants should have basic understanding or knowledge of java and linux.
Instructional Method:
  • This is an instructor led course which provides lecture topics and the practical application of Hadoop and the underlying technologies. It pictorially presents most concepts and there is a detailed case study that strings together the technologies, patterns and design
Hadoop Introduction
  • Move computation not data.
  • Volunteer Computing
  • Hadoop Releases
  • Hadoop performance and data scale facts.
  • The Apache Hadoop Project.
  • Grid Computing
  • Hadoop in the context of other data stores.
  • The Hadoop Ecosystem.
  • Apache Hadoop and the Hadoop Ecosystem
  • A Brief History of Hadoop
  • Hadoop – an inside view: MapReduce and HDFS.
  • What about NoSQL?
  • RDBMS
  • Comparison with Other Systems
MapReduce
  • Constructing the basic template of a MapReduce program
  • Running a Distributed MapReduce Job
  • Data FlowCombiner Functions
  • Java MapReduceScaling Out
  • Counting things
  • Analyzing the Data with Hadoop
  • Map and Reduce
  • Hadoop Pipes
  • Adapting for Hadoop’s API changes
  • Improving performance with combiners
  • Hadoop Streaming (Ruby and phython)
  • Streaming in Hadoop
  • Streaming with key/value pairs
  • Streaming with Unix commands
  • Streaming with the Aggregate package
  • Streaming with scripts
Distributing Data with HDFS
  • Interfaces
  • Hadoop Filesystems
  • The Design of HDFS
  • Using Hadoop Archives
    • Limitations
  • Parallel Copying with distcp
    • Keeping an HDFS Cluster Balanced
    • Hadoop Archives
  • Data Flow
    • Anatomy of a File Write
    • Anatomy of a File Read
    • Coherency Model
  • The Command-Line Interface - Basic Filesystem Operations
    • The Java Interface
    • Querying the Filesystem
    • Reading Data Using the FileSystem API
    • Directories
    • Deleting Data
    • Reading Data from a Hadoop URL
    • Writing Data
Understanding Hadoop I/O
  • File-Based Data Structures
    • MapFile
    • SequenceFile
  • Serialization
  • Implementing a Custom Writable
  • Serialization Frameworks
  • The Writable Interface
  • Writable Classes
  • Avro
  • Compression
    • Codecs
    • Using Compression in MapReduce
    • Compression and Input Splits
  • Data Integrity
    • ChecksumFileSystem
    • LocalFileSystem
    • Data Integrity in HDFS
Advanced MapReduce
  • Chaining MapReduce jobs
    • Chaining preprocessing and postprocessing steps
    • Chaining MapReduce jobs in a sequence
    • Chaining MapReduce jobs with complex dependency
  • Creating a Bloom filter
    • What does a Bloom filter do?
    • Bloom filter in Hadoop version 0.20+
    • Implementing a Bloom filter
  • Joining data from different sources
    • Reduce-side joining
    • Replicated joins using DistributedCache
    • Semijoin: reduce-side join with map-side filtering
Writing Map-Reduce Applications
  • Hadoop in the Cloud
  • Cluster Setup and Installation
  • Hadoop Configuration
  • YARN Configuration
  • The Configuration API
  • Running Locally on Test Data
  • Configuring the Development Environment
  • Cluster Specs
  • Tuning
  • MapReduce Workflows
  • Monitoring and debugging on a production cluster
  • Tuning for performance
  • Benchmarking a Hadoop Cluster
Map-Reduce Internals
  • Failures
    • Failures in YARN
    • Failures in Classic MapReduce
  • Anatomy of a MapReduce Job Run
    • Classic MapReduce (MapReduce 1)
    • YARN (MapReduce 2)
  • Shuffle and Sort
    • The Reduce Side
    • The Map Side
    • Configuration Tuning
  • Task Execution
    • Skipping Bad Records
    • Output Committers
    • The Task Execution Environment
    • Speculative Execution
    • Task JVM Reuse
    Managing Hadoop
    • Setting permissions
    • Enabling trash
    • Adding DataNodes
    • Managing NameNode and Secondary NameNode
    • Designing network layout and rack awareness
    • Checking system’s health
    • Managing quotas
    • Setting up parameter values for practical use
    • Removing DataNodes
    • Recovering from a failed NameNode
    • Map-Reduce Features
        Counters
      • Sorting
      • Map-Reduce Library
      • Side Data Distribution
    Map-Reduce Ecosystem
    • Hive
      • Installing and configuring Hive
      • HiveQL in details
      • Example queries
      • Hive Sum-up
    • Hbase
      • Intoduction
      • Clients
      • Concepts
      • Hbase vs RDBMS
    • Installing Pig
    • Running Pig
    • Thinking like a Pig
      • Data flow language
      • User-defined functions
      • Data types
    • Speaking Pig Latin
      • Execution optimization
      • Expressions and functions
      • Relational operators
      • Data types and schemas