Chaos Engineering Fundamentals

“Site reliability through controlled disruption”


Dates available upon request


Course Description

Chaos Engineering (CE) is pioneered by companies like Netflix and Amazon to proactively test how systems respond in presence of failure, to identify and fix problems before they become outages. Thanks to this approach complex and distributed systems can be more reliable and resilient. 

During this one day course, you will be introduced to Chaos Engineering and be given the tools and techniques to get started with Chaos Engineering within your own organisation.

You will learn to:
- Apply a chaos engineering experiment 
- Get CE on your company roadmap
- Experiment with chaos into Kubernetes
- Improve site reliability - Promote a safe to fail culture

Who should attend

This Chaos Engineering Training is suitable for anyone with the willingness to look at things from a different perspective.
  • basic familiarity with Linux
  • basic familiarity with Python, Go or other high-level language (read example code)
  • basic familiarity with networking (IP, HTTP)
?controls=0" frameBorder="0">


Chaos Engineering Fundamentals
  • Break things on purpose, so that they don’t break on you
    • Site Reliability Engineering (SRE)
    • SLI, SLO, SLA, error budgets
    • Principles of Chaos Engineering
    • Testing systems
    • Blast radius
    • Observability
    • Steady state
    • Hypothesis
    • Killing processes
    • Network slowness
    • Chaos Engineering and Kubernetes

Duration: 1 day
Delivery format: virtual (in-person training available on request)
EUR 650 per seat
GBP 550 per seat
Private training upon request
we accept company purchase orders

?controls=0" frameBorder="0">

About the author 

This course is written and delivered by Mikolaj Pawlikowski, the author of the book Chaos Engineering: Site reliability through controlled disruption (Manning). Mikolaj leads a team of SREs managing Kubernetes at Bloomberg. He first started with CE as a surprisingly effective sleeping aid - the more failures his team simulated during working hours, the fewer outages were happening when they were asleep.