Shepherdly: A bug prediction & code resilience coverage tool
In the context of software development, Resilience Coverage measures the overall protection a pull request has against bugs that can impact large portions of your user base and/or expose observability gaps. When coupled with a Software Quality Risk Score driven by a predictive model, engineers can have instant guidance on which code changes need further mitigation.
Today, most teams, especially in regulated domains employ repetitive procedures and checklists as part of their overall software quality assurance process. Resilience Coverage embraces this aspect and maps it to a Software Quality Risk Score. The greater the risk, the greater the Resilience Coverage requirement.
The role of predictive scoring in software development
Until now, software defect prediction has largely garnered the attention of Computer Science researchers. It’s an active area of study spanning decades.
One of the main ingredients needed to apply machine learning in this area is having an underlying dataset that exhibits repeatable patterns so that a classifier can be developed. Software development and ultimately the humans behind it throw off a lot of repetitive and quantifiable patterns in the code they create. While not straightforward (as the research indicates), it is possible to apply a predictive model in a practical setting for software engineers.
While there is a broad spectrum of ML approaches, a straightforward way to think of this is creating a “lookalike audience” common to recommendation engines. For example, “people who bought X also bought Y.” Similar methods can also be used for classifying which pull requests look like other pull requests that did or did not cause a bug.
Predictive models also survive team member attrition and can evolve as development patterns change. This provides continuity for quality efforts and also eliminates common human biases.
Behavior Change
So how might one use this Software Quality Risk Score? The most obvious use would be for dynamic changes to the planned software verification activities. For example, a higher risk score (e.g., above a selected threshold) could initiate additional code review that otherwise might not have been performed. In one case study, Shepherdly has been shown, in a statistically significant way, to increase concentration of review activity by nearly 4x towards pull requests that were given a high Software Quality Risk Score.
A case study on the other end of the spectrum showed that Shepherdly enabled a team to decrease cycle time by ~70% for low risk changes.
Why predicting quality risk is important
At the core, a predictive model is an objective means for managing risk. This approach contributes to addressing regulation (21CFR820.30(f)) which states “Design verification shall confirm that the design output meets the design input requirements”. Using a Software Quality Risk Score to drive additional process rigor can provide a manufacturer with greater assurance in the final software quality. Additionally, having an auditable quality risk score and resilience coverage metric at the pull request level will help with adjusting testing and code review focus areas, potentially reducing downstream bugs.
While preventing a bug from occurring before production is always the goal, the next best thing is controlling the magnitude of how a bug could occur. Specifically, how many users are impacted and how long it goes undetected. Bringing both of these factors as close to zero is the primary goal of mitigations. However, since applying this on every change may not be practical, justification of this extra effort is required. This is where a quality risk score can help.
How the model is constructed
Predicting quality risk involves training on data exclusive to your code base. Shepherdly learns the nuanced patterns that lead to bugs vs the ones that don’t. In other words, it’s reflecting back to you what has and has not worked in the past in a simple metric.
This process starts with classifying which changes were fixing bugs. From there, the approach involves looking back into the Git history to find the changes that introduced the lines that caused the bug.
How it works in practice
Every time a pull request is updated, its Resilience Coverage and Quality Risk Score are calculated and posted as a PR comment for the author and reviewers to see.
Since the Quality Risk Score drives the Resilience Coverage requirements, the list of required mitigations will vary depending on how risky the change is. As changes are detected or manually attested to, it increases the coverage metric. This gives developers clear guidelines on what steps they are expected to take and gives credit to folks for making the codebase more resilient.
This activity can be aggregated to better understand how much resilience coverage teams are achieving.
Above, the day-to-day coverage metric is plotted to identify trends and the type of mitigations applied are also aggregated. This can further be filtered by the Quality Risk Score so as to better understand compliance across different risk categories (i.e., low vs high).
Resilience Coverage should be thought of as an SLO because it is effectively a reliability and quality metric, much like uptime and latency.
Conclusion
Engineers have world class tools for observability in production environments, but virtually nothing to tell them if a risky PR has or is about to be merged. While the “shift left” paradigm is still filtering through the developer toolchain, today, it’s primarily geared towards static code analysis and reactive based observability tools. When you pair this reality with the horizon of LLM’s for code generation, you don’t need to be a futurist to surmise that teams will be releasing at an increasing rate to production.
To better adapt, engineers need better tools to quickly tell them what requires a large chunk of their time. Knowing that tenured and senior engineering capacity is relatively a static (and usually scarce) resource, directing them to where they can have the most impact is already necessary, soon it will be critical.
For folks working in highly regulated environments, automating some of these necessary, albeit manual steps, can yield back efficiency for the team and give greater control on deciding how much risk and speed to trade-off.
Mark Greene, Founder | Shepherdly
Note: The author was invited to post on our website for the general interest of our readers, particularly those tasked with software quality assurance. SoftwareCPR® receives no financial incentives from Shepherdly and claims no conflict of interest. Our mission is to raise the overall quality of medical device and HealthIT software.