I got a cancellation notice today for my driver’s license renewal appointment. Reason? “Internet down statewide and will not be resolved today.”
The outage was actually worldwide. You probably saw news reports or were affected by it.
A software update by cybersecurity firm Crowdstrike caused some Windows systems to stop working. Outages didn’t just affect only administrative matters like my license renewal. It also stopped things like television transmissions, airline check-ins, and bank operations. More importantly, it caused safety critical outages in 911 systems, hospitals, and other health services. Some are calling it the largest IT outage in history.
Medical device implications
What does this have to do with medical devices?
A lot.
Medical device manufacturers are increasingly incorporating Internet capabilities and cloud services. This can improve intended use, reduce time to market, reduce device cost, and provide other benefits.
The catch?
Manufacturers are still responsible for the safety and efficacy of their entire device … even if they use third-party systems and software.
Questions to consider
Here are some questions that stand out from the Crowdstrike outage:
? Are you considering the entire “system of systems” required to fulfill intended use?
? Risk management and device design should account for possible outages or failures of third-party components. Just because you didn’t develop it, doesn’t mean you aren’t responsible for it if it is in your device.
? How are you planning to update your devices once in the field?
The failure was caused by a single content automatic update to Crowdstrike systems. The raises many questions including:
- If this was a single content update worldwide, was it not tested? It would seem that something causing a common computer failure would be a basic check.
- Was the file corrupted somehow before, during, or after transmission? That would speak to update check steps missing.
- Was the file incorrectly prepared? If so, was it human or system error? How did it pass verification and validation? Are there appropriate checks before distribution?
- Is your device fragile when it comes to system updates? Will any subsystem data or performance issues cause your device to degrade or completely fail? Are there check mechanisms before faulty data is accepted into the system?
- Does a third-party component have “privileged” access to your device operations? Privileged access should be only included in device architectures/designs if absolutely required. And then only after meticulous review and mitigation of potential problems from that access.
? How do you deal with update and configuration data?
According to Crowdstrike, “It is normal for multiple “C-00000291*.sys files to be present in the CrowdStrike directory – as long as one of the files in the folder has a timestamp of 0527 UTC or later, that will be the active content.” Apparently, the most current configuration/data file is determined via date/time.
While this didn’t seem to factor into the outage, it could be a problem for several reasons, including:
- Issues with the date/time stamp could cause an incorrect file to be used.
- Accidental file deletions could cause an incorrect file to be used.
- Not cleaning up old files of the same name is not normally good engineering practice.
Some of this isn’t “new”
Good engineering practices to help address issues like this have been around for decades. International device standards provide pertinent processes and practices as well. Both, however, require training and discipline in execution to be effective.
Is your team up to the challenge?
It’s early in the outage correction cycle and more information may come to light. If it does, we’ll update this post accordingly.