вход по аккаунту



код для вставкиСкачать
2016 IEEE/ACM 38th IEEE International Conference on Software Engineering Companion
Chaos Engineering Panel
Lorin Hochstein
Casey Rosenthal
services often run on cloud computing environments, which
bring additional challenges because the underlying hardware
is commodity-grade and therefore more prone to failure. In
this environment, traditional approaches to achieving high
reliability cannot easily be applied.
In some cases, even traditional integration testing is simply not possible. For service providers such as Netflix whose
software deployment has evolved organically over time, it is
sometimes simply not possible to reproduce the entire production system in an isolated test environment and then run
end-to-end tests.
Several of the larger tech companies have been running
experiments directly on production systems in order to assess the availability of the system under adverse conditions.
Netflix’s Chaos Monkey is likely the most famous example
of such this approach. However, other companies such as
Amazon, Google, Microsoft, Facebook and LinkedIn are also
deliberately injecting failure into production systems.
Chaos Engineering is a discipline emerging from the practitioner community around experimenting on a distributed
system in order to build confidence in the system’s capability to withstand turbulent conditions in production. Chaos
Engineering is based on the following four principles for designing experiments:
One of the most significant changes in the software industry over the past two decades has been the transition from
standalone applications to networked applications, known as
the software-as-a-service model. Whether users are interacting with these services through a web browser or a custom
app, on a laptop, desktop, mobile phone, tablet, or even a
television or other networked device, they are ultimately using a client to connect to a remote server in order to consume
a service provided over the Internet.
A more recent change has been the transition in server
applications from monolithic architectures to microservice
architectures. Where once upon a time the typical server
was implemented as a single application, modern software
services are increasingly implemented as a collection of microservices that communicate among themselves over the
network. One of the advantage of microservice architectures
is that they reduce coupling within a software service, which
makes it easier for multiple teams of engineers to make concurrent changes to the overall system. However, a disadvantage of this architecture is that it increases the distributed
nature of the overall system, making it more difficult to reason about the behavior of the system.
These distributed systems must work reliably despite being developed and operated by fallible human beings while
running atop unreliable distributed infrastructure. High
availability is key for services provided over the Internet,
since every minute the service is down can result in loss
of revenue. However, these systems do not have the same
reliability requirements as safety-critical systems. Internet
service providers typically do not have the time and resources available for applying testing and quality assurance
approaches used in safety-critical domains.
The traditional approach to achieving high availability is
to use hardware and software that have already been proven
to be reliable, and then to avoid making changes. However, for service providers it is not possibly to avoid making
changes. Many such services frequently undergo code and
configuration changes in order to remain competitive. These
• Build a hypothesis around steady state behavior.
• Vary real-world events.
• Run experiments in production.
• Automate experiments to run continuously.
We are striving to build a community of practice around
these concepts, leveraging events such as the proposed panel
to bring together practitioners to share experiences.
The intended audience for the panel is both industry practitioners and academic researchers.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from
While Chaos Engineering principles are being adopted at
multiple tech companies, its usage is currently confined to a
small minority of industry. Anecdotally, many Netflix employees have spoken to engineers at other conferences about
Chaos Engineering approaches, specifically about the wellknown Chaos Monkey. A common refrain from engineers
at other organizations is, “That’s an interesting idea, but it
ICSE ’16 Companion, May 14 - 22, 2016, Austin, TX, USA
c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ISBN 978-1-4503-4205-6/16/05. . . $15.00
would only work at a place like Netflix. It would never work
at my organization.”
One of the goals of the panel is to counteract this type
of response by exposing audience members to practitioners
from a range of organizations, each with unique challenges,
and listen to how multiple organizations saw the value in
this approach and were able to successfully implement it.
Heather Nakama is a Senior Software Engineer at Microsoft with Azure Search, a managed cloud service on the
Azure platform where she works on multiple aspects of the
distributed backend, from scalability and cluster management to fault injection and auto-mitigation systems. Previously, Heather has developed auto-mitigation and deployment tools for Azure Storage and other teams across the
Azure Stack, and on automated testing tools for Windows
Embedded. She has a Bachelor of Arts in Comparative
Religion and a minor in Mathematics from the University
of Washington, and a Master of Software Engineering from
Seattle University.
Academic researchers rarely have the opportunity to be
exposed to the context that large tech companies work within.
One of the goals of holding this panel is to expose researchers
to the nature of the problems that practitioners are facing
when it comes to achieving high availability, and to listen
to how they are solving this problem through Chaos Engineering approaches. In particular, we hope that discussions about outstanding challenges in this area will expose
researchers to a new problem domain.
• Why not just do traditional integration testing in a
test environment?
• What metrics do you use to measure the steady-state
behavior of your system?
• Where has Chaos Engineering been successful in your
Ian Van Hoven, Yahoo
Ian Van Hoven is the Vice President of Engineering for
Yahoo. He leads the Production Engineering and Engineering Services teams for the Publisher Products division, with
global responsibility across 3 screens for release velocity, operability, reliability and performance of numerous marketleading internet experiences. Previously, he was the Director
of Technical Operations at LinkedIn, the Director for Content Delivery Operations at Netflix, the Director of Platform
Operations at Quantcast, the VP of Engineering and Operations for CDNetworks. He was the Co-Founder and Director of Engineering and IT of OpSource. He has a Bachelor’s
Degree in Mathematics and Economics from University of
California, Santa Barbara.
• Have you experienced any failures when trying Chaos
• How do you mitigate the risks of experimenting on a
production system?
• Did you encounter any resistance when introducing it
in the organization, from either engineering or management?
• People often say ”This approach won’t work for me
because...”. Did you have any challenges that were
unique to your organization? How did you overcome
• Were you able to reuse any existing tooling, or did you
have to build your own?
Chris Adams, Uber
Chris Adams graduated Computer Science at Monash University, Australia in 2005, while also co-founding Extreme
LAN and growing it into one of Australia’s largest LAN parties. His software engineering career has spanned all aspects
of software engineering from desktop to web to large-scale
backend integrations. In 2013 he moved to San Francisco to
lead Pivotal’s Cloud Foundry PaaS operations team, then
moved to Uber to manage a team building tools for engineers to improve Uber’s reliability. Chris’s role at Uber has
seen him build and lead the team that built its failure injection test tool, uDestroy, as well as its first chaos injection
tool, uHammer. Chris and his team are making sure Uber
provides you transportation as reliable as running water.
• Have you been able to automate any Chaos Engineering experiments?
• Where would you like to take this approach next, and
what makes it difficult to get there?
Kyle Parrish, Fidelity Investments
Currently a performance architect at Fidelity Investments,
Kyle leads the operations, process and data work behind efforts deliver massive-scale market open testing of Fidelity’s
brokerage systems. Once called “Too Big to Test”, the team
at Fidelity solved the riddle of testing at scale using a combination of production and disaster recovery systems, encompassing everything from the cloud to the exchanges on the
street. Prior to intentionally crashing brokerage systems,
Kyle spent over a decade consulting across industries and
running IT operations in startup and academic settings.
Sample questions we will ask panelists include:
Heather Nakama, Microsoft
Casey Rosenthal, Netflix
Casey is the Traffic and Chaos Engineering Manager at
Netflix, with a mission to fortify availability in anticipation of failures. As an Executive Manager, Senior Architect, and Software Engineer, Casey has managed teams to
tackle Big Data, architect solutions to difficult problems,
and train others to do the same. He leverages experience
with distributed systems, artificial intelligence, translating
novel algorithms and academia into working models. For
fun, he models human behavior using personality profiles in
Ruby, Erlang, Prolog, and Scala.
Без категории
Размер файла
173 Кб
2889160, 2889230
Пожаловаться на содержимое документа