Ultra Low Latency Systems Designing Training in Pune

Home

Understanding and Designing Ultra Low Latency Systems

Traditional models of concurrent programming have been around for some time.They work well and have matured quite a bit over the last couple of years. However, every once in a while there low latency and high throughput requirements that cannot be met by traditional models of concurrency and application design. How about handling 400-500 million operations/second per core of a system. When designing such systems one has to throw away the traditional models of application design and think different. In this seminar we discuss some of the approaches that can make this possible. This approach is hardware friendly and a re-look at data from a hardware perspective. It requires logical understanding of how modern hardware works. It also requires the knowledge of tools that can help track down a particular stall and possibly the reason behind it. This may provide pointers for a redesign if required. In the balance then, this training is about an architectures that are Hardware friendly. It also about very specialized data structures that fully exploits the underlying architecture of the processor, cache, memory, disk, filesystem, and network. The design of this training is like a series of understand-hardware/OS-concept, apply-it-to-design, measure-it-with-tool.

Intended Audience:

Software Architects
JVM Tuners
Technology Officers
Senior Software Engineers

Key Skills:

Understand the cache coherency protocol
Architectural understanding of disruptor.
Processor Micro-architecture refresher
Understand and measure the effect of thread
affinity Understand Off heap techniques
Understand and measure the effect of
various levels of caches Understand how ultra low latency designs are done.
Understand and measure the effect of prefetches

Prerequisites:

A very good knowledge of java
Very experienced in software design and architecture.
A working knowledge of C/C++ will be helpful
Knowledge of existing data structures in Java

Instructional Method:

Slides with pictures to represent concepts
Instructor Led
Hand-on Session to gauge the effect of hardware

CPU

Branch mispredictions, Wasted Work, Misprediction Penalties and UOP Flow
Intel® Xeon™,Sandybridge™,Ivybridge™,Haswell™ Processor
Uncore Memory Subsystem
Performance Analysis
processor Performance Events: Overview
Performance Analysis and the Intel® Core™ i Processor and Intel® Xeon™
Basic Intel® Core™ i Processor and Intel® Xeon™ Processor Architecture and
Core Out of Order Pipeline
Core Performance Monitoring Unit (PMU)
Core Memory Subsystem
Uncore Performance Monitoring Unit (PMU)
Core Performance Monitoring Unit (PMU)

CPU Internals

CPU Run Queues
Saturation
Software
Priority Inversion
Terminology
Word Size
Concepts
Instruction Pipeline
CPU Memory Caches
Microcode and Exceptions
Models
Clock Rate
Branch Mispredictions
Front End Events
Hardware
User-Time/Kernel-Time
Compiler Optimization
CPU Architecture
Utilization
Instruction Width
Multiprocess, Multithreading
Preemption
FE Code Generation Metrics
CPI, IPC
Architecture

Factoring in CPU specifics into Design

Lock effects
Ordering Effects
Branch Prediction effects
Cache Line effcts
Cache effects
Thread Affinity
Multi-Core effects
Prefetcher effects

File Systems

Second-Level Cache
Random versus Sequential I/O
Models
Terminology
Concepts
Caching
File System Interfaces
File System Latency
Prefetch
File System Cache

Factoring in FileSystem specifics into Design

Configuring Write ahead
Configuring Read ahead
Ramfs and tmpfs
File System latency

File Systems Internals

Volumes and Pools
Access Timestamps
Memory-Mapped Files
Synchronous Writes
Random versus Sequential I/O
Metadata
Second-Level Cache
Write-Back Caching
Non-Blocking I/O
Prefetch
File System I/O Stack
Logical versus Physical I/O
File System Interfaces
VFS
File System Latency
Special File Systems
File System Types
Terminology
Caching
Read-Ahead
Raw and Direct I/O
File System Features
File System Caches
File System Cache
Models

Disks

Models
Terminology
Caching Disk
Simple Disk
Controller

Disk Internals

IOPS Are Not Equal
Caching
I/O Size
I/O Wait
Concepts
Storage Type
Disk Types
Utilization
Time Scales
Synchronous versus Asynchronous
Read/Write Ratio
Measuring Time
Non-Data-Transfer Disk Commands
Operating System Disk I/O Stack
Saturation
Random versus Sequential I/O

Factoring in Disk specifics into Design

Scaling
Micro-Benchmarking
Event Tracing
Latency Analysis
Cache Tuning
Resource Controls
Static Performance Tuning

Memory

Virtual Memory
Concepts
Terminology

Memory Internals

File System Cache Usage
Swapping
Utilization and Saturation
Overcommit
Allocators
Process Address Space
Demand Paging
Paging
9 Word Size

JVM Memory

Ordering fields of DataValueClasses
Write with Direct Reference
Off-Heap Data Structures
Off-Heap Queues
Write with Direct Instance
Off-Heap Maps
Read with Direct Reference

Network

Network Interface
Terminology
Models

Factoring in Network specifics into Design

Replication How it works
Zero copy/Send file
TCP/IP Throttling
Multiple Processes on the same server with Replication
How to setup UDP Replication
TCP / UDP Background
TCP / UDP Replication
Identifier for Replication

Network Internals

Software
Latency
Controller
Encapsulation
Protocols
Buffering
Packet Size
Protocol Stack
Connection Backlog
Networks and Routing
Hardware
Interface Negotiation
Utilization

Observability Tools

DTrace
Perf
Profiling
Observability Sources
Solaris Analyzer
/sys
Ftrace
kstat
Tracing
SystemTap
/proc
JMH
Tool Types
Delay Accounting
Microstate Accounting
Counters
Monitoring

Benchmarking

Passive Benchmarking
Sanity Check
Ramping Load
Activities
Replay
Active Benchmarking
Workload Characterization
Micro-Benchmarking
Background
Statistical Analysis
Benchmarking Types
Custom Benchmarks
Effective Benchmarking
Methodology
CPU Profiling
Industry Standards
Simulation
Benchmarking Sins

Low Latency and High Performance Libraries and Classes

OpenHFT Architecture
Compiler
Generating Off heap classes for on heap interfaces

Java Affinity

Affinity Thread Factory
isolcpus
using perf and likwid to measure L1, L2 and, L3 cache
performance
PosixJNA Affinity
Write Buffer
Lock Inventory
How much does thread Affinity Matters
using likwid to measure prefetchers
Non Forging Affinity Lock
Affinity Strategies
Affinity Support
Cache Architecture
using mpstat to measure and verify
Read Buffers
CPU Layout

Chronicle

Consumer insensitive
How does it collect garbage
Messaging between processes via shared memory
Synchronous text logging
High throughput trading systems
Messaging across systems
Low latency, high frequency trading
Supports recording micro-second timestamps across the systems
Synchronous binary data logging
Cache friendly
Functionality is simple and low level by design
Very fast embedded persistence for Java.
Replay for production data in test
Introduction to Chronicle

Chronicle:Modes of use

Lock-less
Supports thread affinity
Shared memory
Text or binary
GC free
Replicated over TCP

Huge Collection

Advanced Off Heap IPC in Java
Low latency, high throughput software
OpenHFT Chronicle, low latency logging, event store and IPC. (record / log everything)
Micro-second latency.
OpenHFT Collections, cross process embedded persisted data stores. (only need the latest)
Millions of operations per second.
HugeHashMap
SharedHashMap

How is off heap memory used?

Around 8x faster than System V IPC.
Memory mapped files
Durable on application restart
One copy in memory.
Can be used without serialization / deserialization.
Thread safe operations across processes

Lang

serializable and deserialization of data
writing and reading enumerable types with object pooling
writing and read primitive types in binary and text without any garbage.
random access to memory in native space (off heap)
provide the low level functionality used by Java Chronicle
writing and reading String without creating an object (if it has been pooled).
Small messages serialization and deserialization in under a
micro-second.

The Disruptor Architecture

Data structure and work flow with no contention.
Very fast message passing
Overview of the Disruptor
Allows you to go truly parallel
Create your own

Hardware aware data structures in java

Magical ring buffers
Single writer principle
DoubleAdder
DoubleAccumulator
LongAdder
Measuring incremental tangible benefits of hardware aware structures
ConcurrentAutoTable
LongAccumulator

Ultra Low Latency Systems Designing

Anika Technologies

Technology Consulting

Training Courses

Subscribe