How did this problem come to your attention?
The Department of Homeland Security (DHS) has identified 16 infrastructures deemed critical in the U.S. These include defense, energy, financial, and transportation, amongst others. As technology continues to advance, these infrastructures are becoming more complex, which challenges engineers to develop systems capable of maintaining these infrastructures while persevering through any potential system failures.
The problem I sought to address in my thesis was how system engineers could improve a systems resiliency, or the ability to respond and recover from potential system failures while in use. Specifically, I wanted to address the resiliency of cyber physical systems (CPSs). CPSs include interacting digital, analog, and physical components designed to perform a specific logical function. These systems form the basis of numerous “smart” systems, such as smart grids, smart cities, etc.
How did you use a systems engineering approach to solve this problem?
The process of test and evaluation is the method by which a system is compared against pre-determined requirements and specifications through testing. Traditionally, testing these systems is conducted in pre-application environments. However, due to the complexity of CPSs .it can be difficult, if not impossible, to predict and model potential system failures in test environments.
In my research, I came across a new approach for testing resiliency called “chaos engineering.” Chaos engineering differs from traditional testing methods since it tests systems in production environments compared to traditional tests which require engineers to predict potential system failures, which offers a large margin of error. Instead, chaos engineering seeks to discover potential failures that cannot be anticipated. This is accomplished by strategically executing chaos experiments (or carefully planned system failures) in production environments, observing, and analyzing the outcomes of those experiments in attempt to improve system resiliencies over time. A production environment is where systems are offered in a “live” format.
What is your solution to improve upon current CPSs resiliency testing to better anticipate the unexpected to improve system resiliencies over time?
For my thesis, I chose to focus on one of the 16 critical infrastructures identified by DHS: transportation cyber physical systems (TCPS). Current testing methodologies cannot sufficiently evaluate every potential system failure in this system. My thesis demonstrated how an engineering team would go about developing, executing, observing, and analyzing chaos experiments to improve resiliency for a TCPS.
I configured a test system to represent TCPS, then developed a chaos experiment that would cause a failure in the system. The experiment and the programs resiliency were observed and then the system was returned to its original operating state. This provided a simplified demonstration of how system engineering teams could go about planning, developing, and implementing chaos engineering in a representative system.
Do you think your solution is a realistic option, and might it be developed later in real life?
Chaos engineering is already in use by large technology companies such as Netflix, LinkedIn, Microsoft, and Amazon, and multiple resources have been created to help engineers develop chaos experiments.
As more companies and industries are exposed to chaos engineering and come to understand its unique capability to uncover unknown weakness and failures in systems, the more engineering teams are likely to adopt it as a systems engineering test and evaluation best practice.
In what other circumstances have you been able to present your findings?
In April 2021, at the suggestion of my faculty advisors, I presented my thesis at the 14th INCOSE Great Lakes Regional Conference. Since Chaos Engineering is a relatively new concept, the conference presented an excellent forum to introduce the topic to the systems engineering field and I was honored to play a part in advancing this topic within the community.