Ultra Low Latency Systems Designing
Understanding and Designing Ultra Low Latency Systems
- Traditional models of concurrent programming have been around for some time.They work well and have matured quite a bit over the last couple of years. However, every once in a while there low latency and high throughput requirements that cannot be met by traditional models of concurrency and application design. How about handling 400-500 million operations/second per core of a system. When designing such systems one has to throw away the traditional models of application design and think different. In this seminar we discuss some of the approaches that can make this possible. This approach is hardware friendly and a re-look at data from a hardware perspective. It requires logical understanding of how modern hardware works. It also requires the knowledge of tools that can help track down a particular stall and possibly the reason behind it. This may provide pointers for a redesign if required. In the balance then, this training is about an architectures that are Hardware friendly. It also about very specialized data structures that fully exploits the underlying architecture of the processor, cache, memory, disk, filesystem, and network. The design of this training is like a series of understand-hardware/OS-concept, apply-it-to-design, measure-it-with-tool.
- Software Architects
- JVM Tuners
- Technology Officers
- Senior Software Engineers
- Understand the cache coherency protocol
- Architectural understanding of disruptor.
- Processor Micro-architecture refresher
- Understand and measure the effect of thread
- affinity Understand Off heap techniques
- Understand and measure the effect of
- various levels of caches Understand how ultra low latency designs are done.
- Understand and measure the effect of prefetches
- A very good knowledge of java
- Very experienced in software design and architecture.
- A working knowledge of C/C++ will be helpful
- Knowledge of existing data structures in Java
- Slides with pictures to represent concepts
- Instructor Led
- Hand-on Session to gauge the effect of hardware
- Branch mispredictions, Wasted Work, Misprediction Penalties and UOP Flow
- Intel® Xeon™,Sandybridge™,Ivybridge™,Haswell™ Processor
- Uncore Memory Subsystem
- Performance Analysis
- processor Performance Events: Overview
- Performance Analysis and the Intel® Core™ i Processor and Intel® Xeon™
- Basic Intel® Core™ i Processor and Intel® Xeon™ Processor Architecture and
- Core Out of Order Pipeline
- Core Performance Monitoring Unit (PMU)
- Core Memory Subsystem
- Uncore Performance Monitoring Unit (PMU)
- Core Performance Monitoring Unit (PMU)
- CPU Run Queues
- Saturation
- Software
- Priority Inversion
- Terminology
- Word Size
- Concepts
- Instruction Pipeline
- CPU Memory Caches
- Microcode and Exceptions
- Models
- Clock Rate
- Branch Mispredictions
- Front End Events
- Hardware
- User-Time/Kernel-Time
- Compiler Optimization
- CPU Architecture
- Utilization
- Instruction Width
- Multiprocess, Multithreading
- Preemption
- FE Code Generation Metrics
- CPI, IPC
- Architecture
- Lock effects
- Ordering Effects
- Branch Prediction effects
- Cache Line effcts
- Cache effects
- Thread Affinity
- Multi-Core effects
- Prefetcher effects
- Second-Level Cache
- Random versus Sequential I/O
- Models
- Terminology
- Concepts
- Caching
- File System Interfaces
- File System Latency
- Prefetch
- File System Cache
- Configuring Write ahead
- Configuring Read ahead
- Ramfs and tmpfs
- File System latency
- Volumes and Pools
- Access Timestamps
- Memory-Mapped Files
- Synchronous Writes
- Random versus Sequential I/O
- Metadata
- Second-Level Cache
- Write-Back Caching
- Non-Blocking I/O
- Prefetch
- File System I/O Stack
- Logical versus Physical I/O
- File System Interfaces
- VFS
- File System Latency
- Special File Systems
- File System Types
- Terminology
- Caching
- Read-Ahead
- Raw and Direct I/O
- File System Features
- File System Caches
- File System Cache
- Models
- Models
- Terminology
- Caching Disk
- Simple Disk
- Controller
- IOPS Are Not Equal
- Caching
- I/O Size
- I/O Wait
- Concepts
- Storage Type
- Disk Types
- Utilization
- Time Scales
- Synchronous versus Asynchronous
- Read/Write Ratio
- Measuring Time
- Non-Data-Transfer Disk Commands
- Operating System Disk I/O Stack
- Saturation
- Random versus Sequential I/O
- Scaling
- Micro-Benchmarking
- Event Tracing
- Latency Analysis
- Cache Tuning
- Resource Controls
- Static Performance Tuning
- Virtual Memory
- Concepts
- Terminology
- File System Cache Usage
- Swapping
- Utilization and Saturation
- Overcommit
- Allocators
- Process Address Space
- Demand Paging
- Paging
- 9 Word Size
- Ordering fields of DataValueClasses
- Write with Direct Reference
- Off-Heap Data Structures
- Off-Heap Queues
- Write with Direct Instance
- Off-Heap Maps
- Read with Direct Reference
- Network Interface
- Terminology
- Models
- Replication How it works
- Zero copy/Send file
- TCP/IP Throttling
- Multiple Processes on the same server with Replication
- How to setup UDP Replication
- TCP / UDP Background
- TCP / UDP Replication
- Identifier for Replication
- Software
- Latency
- Controller
- Encapsulation
- Protocols
- Buffering
- Packet Size
- Protocol Stack
- Connection Backlog
- Networks and Routing
- Hardware
- Interface Negotiation
- Utilization
- DTrace
- Perf
- Profiling
- Observability Sources
- Solaris Analyzer
- /sys
- Ftrace
- kstat
- Tracing
- SystemTap
- /proc
- JMH
- Tool Types
- Delay Accounting
- Microstate Accounting
- Counters
- Monitoring
- Passive Benchmarking
- Sanity Check
- Ramping Load
- Activities
- Replay
- Active Benchmarking
- Workload Characterization
- Micro-Benchmarking
- Background
- Statistical Analysis
- Benchmarking Types
- Custom Benchmarks
- Effective Benchmarking
- Methodology
- CPU Profiling
- Industry Standards
- Simulation
- Benchmarking Sins
- OpenHFT Architecture
- Compiler
- Generating Off heap classes for on heap interfaces
- Affinity Thread Factory
- isolcpus
- using perf and likwid to measure L1, L2 and, L3 cache
- performance
- PosixJNA Affinity
- Write Buffer
- Lock Inventory
- How much does thread Affinity Matters
- using likwid to measure prefetchers
- Non Forging Affinity Lock
- Affinity Strategies
- Affinity Support
- Cache Architecture
- using mpstat to measure and verify
- Read Buffers
- CPU Layout
- Consumer insensitive
- How does it collect garbage
- Messaging between processes via shared memory
- Synchronous text logging
- High throughput trading systems
- Messaging across systems
- Low latency, high frequency trading
- Supports recording micro-second timestamps across the systems
- Synchronous binary data logging
- Cache friendly
- Functionality is simple and low level by design
- Very fast embedded persistence for Java.
- Replay for production data in test
- Introduction to Chronicle
- Lock-less
- Supports thread affinity
- Shared memory
- Text or binary
- GC free
- Replicated over TCP
- Advanced Off Heap IPC in Java
- Low latency, high throughput software
- OpenHFT Chronicle, low latency logging, event store and IPC. (record / log everything)
- Micro-second latency.
- OpenHFT Collections, cross process embedded persisted data stores. (only need the latest)
- Millions of operations per second.
- HugeHashMap
- SharedHashMap
- Around 8x faster than System V IPC.
- Memory mapped files
- Durable on application restart
- One copy in memory.
- Can be used without serialization / deserialization.
- Thread safe operations across processes
- serializable and deserialization of data
- writing and reading enumerable types with object pooling
- writing and read primitive types in binary and text without any garbage.
- random access to memory in native space (off heap)
- provide the low level functionality used by Java Chronicle
- writing and reading String without creating an object (if it has been pooled).
- Small messages serialization and deserialization in under a
- micro-second.
- Data structure and work flow with no contention.
- Very fast message passing
- Overview of the Disruptor
- Allows you to go truly parallel
- Create your own
- Magical ring buffers
- Single writer principle
- DoubleAdder
- DoubleAccumulator
- LongAdder
- Measuring incremental tangible benefits of hardware aware structures
- ConcurrentAutoTable
- LongAccumulator