Story Template: Jennifer from ABC Energy
Jennifer has a goal to scale up energy storage projects across the US. She is the kind of person who takes every step carefully and methodologically. Think of her as an asset manager with the mind of a finance portfolio manager.
This is an hour from her life in 2017 when she was working with a battery storage operator in California.
I got up to the sound of chimes on my phone. There was an issue at one of our storage projects and the site was not able to dispatch energy into the grid.
This was a bit surprising because these system follow automated processes and require little manual intervention. We checked the logs and it seemed like everything was fine until last night. At some point in time the batteries stopped charging but carried on with dispatches. But once the charge level reached a minimum threshold, the system automatically stopped dispatches too.
While that answered the question as to why the systems stopped, it still didn’t answer the question as to what triggered it in the first place? I sat down with the logs of the system to look at every single event from the last night hoping to find a clue.
We needed to really get our fingers on the exact cause so that we don’t let this occur again. I looked at the dispatch signals, the C-rates for dispatches, and voltage levels and meter output but it all seemed to be normal. After about half an hour of fanatic searching (and some coffee) I came across some logs that looked as if there was a warranty issue. Upon further investigation I found the command to stop charging was triggered by a micro instance of a warranty violation.
What I realized is that the warranty system was configured to monitor instances of extreme values but not the duration or the magnitude of violation. A system may momentarily cross a threshold but that is ok as long as it is not sustained or extreme. This was a gap in the way the system was configured.
We fixed the issue and restarted the system and since then we have never faced such issues.
This led me to believe in the importance of manual oversight over automated systems. An algorithm can only do as it is told to, and sometimes it needs to be retold what to do, and that requires us to dig deeper, ask questions and methodologically search for answers to find the root cause.
Since then Jennifer has adopted a root cause analysis culture in her team where incidents are not just fixed but are investigated using:
- Exception reporting — manually look for exceptions or outliers in the data
- Evidence based decision making — find and report evidences from the data
- 5 Why’s — keep asking why until there are no more answers — what is left is the truth