about banner




Spark Internals
Distributing Data with HDFS
  • Interfaces
  • Hadoop Filesystems
  • The Design of HDFS
Parallel Copying with distcp
  • Keeping an HDFS Cluster Balanced
  • Hadoop Archives
Using Hadoop Archives
  • Limitations
Data Flow
  • Anatomy of a File Write
  • Anatomy of a File Read
  • Coherency Model
The Command-Line Interface
  • Basic Filesystem Operations
The Java Interface
  • Querying the Filesystem
  • Reading Data Using the FileSystem API
  • Directories
  • Deleting Data
  • Reading Data from a Hadoop URL
  • Writing Data
Understanding Hadoop I/O
  • Serialization
  • Implementing a Custom Writable
  • Serialization Frameworks
  • The Writable Interface
  • Writable Classes
  • Avro
Data Integrity
  • ChecksumFileSystem
  • LocalFileSystem
  • Data Integrity in HDFS
ORC Files
  • Large size enables efficient read of columns
  • New types (datetime, decimal)
  • Encoding specific to the column type
  • Default stripe size is 250 MB
  • A single file as output of each task
  • Split files without scanning for markers
  • Bound the amount of memory required for reading or writing.
  • Lowers pressure on the NameNode
  • Dramatically simplifies integration with Hive
  • Break file into sets of rows called a stripe
  • Complex types (struct, list, map, union)
  • Support for the Hive type model
ORC File:Footer
  • Count, min, max, and sum for each column
  • Types, number of rows
  • Contains list of stripes
ORC Files:Index
  • Required for skipping rows
  • Position in each stream
  • Min and max for each column
  • Currently every 10,000 rows
  • Could include bit field or bloom filter
ORC Files:Postscript
  • Contains compression parameters
  • Size of compressed footer
ORC Files:Data
  • Directory of stream locations
  • Required for table scan
Parquet
  • Nested Encoding
  • Configurations
  • Error recovery
  • Extensibility
  • Nulls
  • File format
  • Data Pages
  • Motivation
  • Unit of parallelization
  • Logical Types
  • Metadata
  • Modules
  • Column chunks
  • Separating metadata and column data
  • Checksumming
  • Types
File-Based Data Structures
  • MapFile
  • SequenceFile
  • Compression
  • Codecs
  • Using Compression in MapReduce
  • Compression and Input Splits
Spark Introduction
  • GraphX
  • MLlib
  • Spark SQL
  • Data Processing Applications
  • Spark Streaming
  • What Is Apache Spark?
  • Data Science Tasks
  • Spark Core
  • Storage Layers for Spark
  • Who Uses Spark, and for What?
  • A Unified Stack
  • Cluster Managers
RDDs
  • Lazy Evaluation
  • Common Transformations and Actions
  • Passing Functions to Spark
  • Creating RDDs
  • RDD Operations
  • Actions
  • Transformations
  • Scala
  • Java
  • Persistence
  • Python
  • Converting Between RDD Types
  • RDD Basics
  • Basic RDDs
RDD Internals:Part-1
  • Expressing Existing Programming Models
  • Fault Recovery
  • Interpreter Integration
  • Memory Management
  • Implementation
  • MapReduce
  • RDD Operations in Spark
  • Iterative MapReduce
  • Console Log Minning
  • Google's Pregel
  • User Applications Built with Spark
  • Behavior with Insufficient Memory
  • Support for Checkpointing
  • A Fault-Tolerant Abstraction
  • Evaluation
  • Job Scheduling
  • Spark Programming Interface
  • Advantages of the RDD Model
  • Understanding the Speedup
  • Leveraging RDDs for Debugging
  • Iterative Machine Learning Applications
  • Explaining the Expressivity of RDDs
  • Representing RDDs
  • Applications Not Suitable for RDDs
RDD Internals:Part-2
  • Sorting Data
  • Operations That Affect Partitioning
  • Determining an RDD’s Partitioner
  • Grouping Data
  • Motivation
  • Aggregations
  • Data Partitioning (Advanced)
  • Actions Available on Pair RDDs
  • Joins
  • Creating Pair RDDs
  • Operations That Benefit from Partitioning
  • Transformations on Pair RDDs
  • Example: PageRank
  • Custom Partitioners
Data ingress and egress
  • Hadoop Input and Output Formats
  • File Formats
  • Local/“Regular” FS
  • Text Files
  • Java Database Connectivity
  • Structured Data with Spark SQL
  • Elasticsearch
  • File Compression
  • Apache Hive
  • Cassandra
  • Object Files
  • Comma-Separated Values and Tab-Separated Values
  • HBase
  • Databases
  • Filesystems
  • SequenceFiles
  • JSON
  • HDFS
  • Motivation
  • JSON
  • Amazon S3
Running on a Cluster
  • Scheduling Within and Between Spark Applications
  • Spark Runtime Architecture
  • A Scala Spark Application Built with sbt
  • Packaging Your Code and Dependencies
  • Launching a Program
  • A Java Spark Application Built with Maven
  • Hadoop YARN
  • Deploying Applications with spark-submit
  • The Driver
  • Standalone Cluster Manager
  • Cluster Managers
  • Executors
  • Amazon EC2
  • Cluster Manager
  • Dependency Conflicts
  • Apache Mesos
  • Which Cluster Manager to Use?
Spark Internals
  • Spark:YARN Mode
  • Resource Manager
  • Node Manager
  • Workers
  • Containers
  • Threads
  • Task
  • Executers
  • Application Master
  • Multiple Applications
  • Tuning Parameters
Spark:LocalMode
  • Spark Caching
  • With Serialization
  • Off-heap
  • In Memory
Running on a Cluster
  • Scheduling Within and Between Spark Applications
  • Spark Runtime Architecture
  • A Scala Spark Application Built with sbt
  • Packaging Your Code and Dependencies
  • Launching a Program
  • A Java Spark Application Built with Maven
  • Hadoop YARN
  • Deploying Applications with spark-submit
  • The Driver
  • Standalone Cluster Manager
  • Cluster Managers
  • Executors
  • Amazon EC2
  • Cluster Manager
  • Dependency Conflicts
  • Apache Mesos
  • Which Cluster Manager to Use?
Spark Serialization
  • StandAlone Mode
  • Task
  • Multiple Applications
  • Executers
  • Tuning Parameters
  • Workers
  • Threads
Advanced Spark Programming
  • Working on a Per-Partition Basis
  • Optimizing Broadcasts
  • Accumulators
  • Custom Accumulators
  • Accumulators and Fault Tolerance
  • Numeric RDD Operations
  • Piping to External Programs
  • Broadcast Variables
Spark Streaming
  • Stateless Transformations
  • Output Operations
  • Checkpointing
  • Core Sources
  • Receiver Fault Tolerance
  • Worker Fault Tolerance
  • Stateful Transformations
  • Batch and Window Sizes
  • Architecture and Abstraction
  • Performance Considerations
  • Streaming UI
  • Driver Fault Tolerance
  • Multiple Sources and Cluster Sizing
  • Processing Guarantees
  • A Simple Example
  • Input Sources
  • Additional Sources
  • Transformations
Spark SQL
  • User-Defined Functions
  • Long-Lived Tables and Queries
  • Spark SQL Performance
  • Apache Hive
  • Loading and Saving Data
  • Initializing Spark SQL
  • Parquet
  • Performance Tuning Options
  • SchemaRDDs
  • Caching
  • JSON
  • From RDDs
  • Linking with Spark SQL
  • Spark SQL UDFs
  • Using Spark SQL in Applications
  • Basic Query Example
Tuning and Debugging Spark
  • Driver and Executor Logs
  • Memory Management
  • Finding Information
  • Configuring Spark with SparkConf
  • Key Performance Considerations
  • Components of Execution: Jobs, Tasks, and Stages
  • Spark Web UI
  • Hardware Provisioning
  • Level of Parallelism
  • Serialization Format
Kafka Internals
  • Kafka Core Concepts
  • brokers
  • Topics
  • producers
  • replicas
  • Partitions
  • consumers
Operating Kafka
  • P&S tuning
  • monitoring
  • deploying
  • Architecture
  • hardware specs
Developing Kafka apps
  • serialization
  • compression
  • testing
  • Case Study
  • reading from Kafka
  • Writing to Kafka
Storm Internals
  • Developing Storm apps
  • Case Studies
  • Bolts and topologies
  • P&S tuning
  • serialization
  • testing
  • Kafka integration
Storm core concepts
  • Storm core concepts
  • spouts
  • groupings
  • tuples
  • bolts
  • parallelism
  • Topologies
Operating Storm
  • monitoring
  • Architecture
  • deploying
  • hardware specs