Spark Internals
Spark Internals
Distributing Data with HDFS
- Interfaces
- Hadoop Filesystems
- The Design of HDFS
- Keeping an HDFS Cluster Balanced
- Hadoop Archives
- Limitations
- Anatomy of a File Write
- Anatomy of a File Read
- Coherency Model
- Basic Filesystem Operations
- Querying the Filesystem
- Reading Data Using the FileSystem API
- Directories
- Deleting Data
- Reading Data from a Hadoop URL
- Writing Data
- Serialization
- Implementing a Custom Writable
- Serialization Frameworks
- The Writable Interface
- Writable Classes
- Avro
- ChecksumFileSystem
- LocalFileSystem
- Data Integrity in HDFS
- Large size enables efficient read of columns
- New types (datetime, decimal)
- Encoding specific to the column type
- Default stripe size is 250 MB
- A single file as output of each task
- Split files without scanning for markers
- Bound the amount of memory required for reading or writing.
- Lowers pressure on the NameNode
- Dramatically simplifies integration with Hive
- Break file into sets of rows called a stripe
- Complex types (struct, list, map, union)
- Support for the Hive type model
- Count, min, max, and sum for each column
- Types, number of rows
- Contains list of stripes
- Required for skipping rows
- Position in each stream
- Min and max for each column
- Currently every 10,000 rows
- Could include bit field or bloom filter
- Contains compression parameters
- Size of compressed footer
- Directory of stream locations
- Required for table scan
- Nested Encoding
- Configurations
- Error recovery
- Extensibility
- Nulls
- File format
- Data Pages
- Motivation
- Unit of parallelization
- Logical Types
- Metadata
- Modules
- Column chunks
- Separating metadata and column data
- Checksumming
- Types
- MapFile
- SequenceFile
- Compression
- Codecs
- Using Compression in MapReduce
- Compression and Input Splits
- GraphX
- MLlib
- Spark SQL
- Data Processing Applications
- Spark Streaming
- What Is Apache Spark?
- Data Science Tasks
- Spark Core
- Storage Layers for Spark
- Who Uses Spark, and for What?
- A Unified Stack
- Cluster Managers
- Lazy Evaluation
- Common Transformations and Actions
- Passing Functions to Spark
- Creating RDDs
- RDD Operations
- Actions
- Transformations
- Scala
- Java
- Persistence
- Python
- Converting Between RDD Types
- RDD Basics
- Basic RDDs
- Expressing Existing Programming Models
- Fault Recovery
- Interpreter Integration
- Memory Management
- Implementation
- MapReduce
- RDD Operations in Spark
- Iterative MapReduce
- Console Log Minning
- Google's Pregel
- User Applications Built with Spark
- Behavior with Insufficient Memory
- Support for Checkpointing
- A Fault-Tolerant Abstraction
- Evaluation
- Job Scheduling
- Spark Programming Interface
- Advantages of the RDD Model
- Understanding the Speedup
- Leveraging RDDs for Debugging
- Iterative Machine Learning Applications
- Explaining the Expressivity of RDDs
- Representing RDDs
- Applications Not Suitable for RDDs
- Sorting Data
- Operations That Affect Partitioning
- Determining an RDD’s Partitioner
- Grouping Data
- Motivation
- Aggregations
- Data Partitioning (Advanced)
- Actions Available on Pair RDDs
- Joins
- Creating Pair RDDs
- Operations That Benefit from Partitioning
- Transformations on Pair RDDs
- Example: PageRank
- Custom Partitioners
- Hadoop Input and Output Formats
- File Formats
- Local/“Regular” FS
- Text Files
- Java Database Connectivity
- Structured Data with Spark SQL
- Elasticsearch
- File Compression
- Apache Hive
- Cassandra
- Object Files
- Comma-Separated Values and Tab-Separated Values
- HBase
- Databases
- Filesystems
- SequenceFiles
- JSON
- HDFS
- Motivation
- JSON
- Amazon S3
- Scheduling Within and Between Spark Applications
- Spark Runtime Architecture
- A Scala Spark Application Built with sbt
- Packaging Your Code and Dependencies
- Launching a Program
- A Java Spark Application Built with Maven
- Hadoop YARN
- Deploying Applications with spark-submit
- The Driver
- Standalone Cluster Manager
- Cluster Managers
- Executors
- Amazon EC2
- Cluster Manager
- Dependency Conflicts
- Apache Mesos
- Which Cluster Manager to Use?
- Spark:YARN Mode
- Resource Manager
- Node Manager
- Workers
- Containers
- Threads
- Task
- Executers
- Application Master
- Multiple Applications
- Tuning Parameters
- Spark Caching
- With Serialization
- Off-heap
- In Memory
- Scheduling Within and Between Spark Applications
- Spark Runtime Architecture
- A Scala Spark Application Built with sbt
- Packaging Your Code and Dependencies
- Launching a Program
- A Java Spark Application Built with Maven
- Hadoop YARN
- Deploying Applications with spark-submit
- The Driver
- Standalone Cluster Manager
- Cluster Managers
- Executors
- Amazon EC2
- Cluster Manager
- Dependency Conflicts
- Apache Mesos
- Which Cluster Manager to Use?
- StandAlone Mode
- Task
- Multiple Applications
- Executers
- Tuning Parameters
- Workers
- Threads
- Working on a Per-Partition Basis
- Optimizing Broadcasts
- Accumulators
- Custom Accumulators
- Accumulators and Fault Tolerance
- Numeric RDD Operations
- Piping to External Programs
- Broadcast Variables
- Stateless Transformations
- Output Operations
- Checkpointing
- Core Sources
- Receiver Fault Tolerance
- Worker Fault Tolerance
- Stateful Transformations
- Batch and Window Sizes
- Architecture and Abstraction
- Performance Considerations
- Streaming UI
- Driver Fault Tolerance
- Multiple Sources and Cluster Sizing
- Processing Guarantees
- A Simple Example
- Input Sources
- Additional Sources
- Transformations
- User-Defined Functions
- Long-Lived Tables and Queries
- Spark SQL Performance
- Apache Hive
- Loading and Saving Data
- Initializing Spark SQL
- Parquet
- Performance Tuning Options
- SchemaRDDs
- Caching
- JSON
- From RDDs
- Linking with Spark SQL
- Spark SQL UDFs
- Using Spark SQL in Applications
- Basic Query Example
- Driver and Executor Logs
- Memory Management
- Finding Information
- Configuring Spark with SparkConf
- Key Performance Considerations
- Components of Execution: Jobs, Tasks, and Stages
- Spark Web UI
- Hardware Provisioning
- Level of Parallelism
- Serialization Format
- Kafka Core Concepts
- brokers
- Topics
- producers
- replicas
- Partitions
- consumers
- P&S tuning
- monitoring
- deploying
- Architecture
- hardware specs
- serialization
- compression
- testing
- Case Study
- reading from Kafka
- Writing to Kafka
- Developing Storm apps
- Case Studies
- Bolts and topologies
- P&S tuning
- serialization
- testing
- Kafka integration
- Storm core concepts
- spouts
- groupings
- tuples
- bolts
- parallelism
- Topologies
- monitoring
- Architecture
- deploying
- hardware specs