A new proximity risk calculation for the NHS Test and Trace Covid app, Zühlke

Detecting Covid exposures using Bluetooth is a difficult problem that requires advanced statistics
Translating a research prototype into reliable production code takes significant effort
Sometimes deep domain understanding is necessary to build an effective implementation
This insight is only available in English

The NHS Covid-19 phone app in England and Wales uses Bluetooth to detect if people have been near anyone infected. This is how we helped to implement a second generation, more accurate exposure detection algorithm.

Introduction

One feature of the NHS Test and Trace Covid-19 app in England and Wales is that it alerts users that they might have been near someone who later tested positive for Covid-19. It does this by measuring the strength of Bluetooth signals from smartphones.

The implementation of this detection is provided by the Exposure Notification API from Apple and Google, which is embedded in the operating system for privacy and to minimise battery drain. The first version of the API provides the application with an estimate of exposure risk, which most app developers use, and also some sampled raw signal data.

Now Apple and Google have released a second version of their API that has opened up some additional signal data, and a team at the Alan Turing Institute in London worked closely with them to provide more accurate risk estimates using a different algorithm. They prototyped their approach and then handed over to the Zuhlke engineering team to add it to the Covid-19 app, for both iOS and Android.

This turned out to require more effort than at first thought. It took two rewrites to engineer a solution that would work well on a phone, but the field trials have shown that the new risk scores are more accurate than the first solution.

The mathematics is too complicated to explain here, except to say that the Turing Institute team applied Unscented Kalman Filtering to process the raw Bluetooth data. It’s a technique that compensates for non-linear noise, in both observations and process. Its use of sampling also greatly reduces the calculation effort.

The new risk calculation library has been included in this release of the app and is now being used by millions of people.

Prototypes are not for production

Our original expectation was that we could, more or less, do a direct port of the statisticians’ Python code to Swift and Kotlin. This didn’t work for several reasons, principally that the needs of scientists for exploring a concept are almost entirely at odds with those of an application used by millions of people.

Scientists need interactivity and immediate turnaround so they can explore how the system responds to changes in algorithm and configuration values. They used Julia, for its speed, to write a simulator to generate sample readings, and Python, for its interactivity and mathematical ecosystem, to implement the risk analysis — pulling in third-party code and frameworks to get the job done. We found that their requirements (correctly) did not include making it work on someone else’s machine.

Our first job was to retrofit some engineering rigour to this prototype so that we, and others, could easily install and run it with confidence. This included adding tests that flushed out a few bugs. As the app counts as a medical device, this stage in the project also required that the scientists validate their results, which they did with a parallel implementation using techniques such as particle filtering and sequential importance sampling.

From this point of stability, we ported the Python code more or less mechanically to Swift, as an initial platform, but found that it was too slow to run on a phone, and that without understanding the technique it was impossible to write meaningful tests. We had to learn at least a minimum of the mathematics involved. During a stressful three weeks of study, the breakthrough was discovering Rhudy and Gu’s tutorial on implementing Kalman filters.

We explored the use of existing implementations of Kalman filtering but found them unsuitable for the project. In particular, they were mostly written in C, which would have complicated integration on the phone platforms, and their complexity and generality suggested that they would make the library difficult to support. We decided to write the implementation we needed in Swift. We’ll see later how this turned out to be the right choice.

Prototypes are intended to answer questions quickly, they usually need different tooling from production artefacts. Only the concepts should be ported to the real system.

We all want to ship early, but sometimes the domain takes time to learn, even to begin to communicate with the experts. This cannot be rushed.

Reimplementing a general solution

The next attempt was to implement a “clean” solution, based on our new understanding of the algorithm, in Swift. This allowed us to focus on a clearer implementation, that would be safer to roll out and easier to maintain, and that included a suite of automated tests.

A surprise here was the current lack of support for numerics in the Swift ecosystem, which will be addressed by an upcoming library called (fortuitously) Swift Numerics. This was addressed by bundling some of the numerics from the Boost C++ library; there were other packages available but their licencing was inappropriate. For the Kotlin implementation, we used libraries from Apache commons.

One technique that helped was Property-Based Testing, in which a test specifies properties of the inputs and expected outputs, and the test framework generates values in the specified range looking for failures; We learned the technique from the Haskell community.

For this sort of application, the input properties might require floating points in a given range, which should generate certain types of output—perhaps within a certain range, or with a given relationship to the inputs. Of course, there are often additional tests with known values to check that the code basically works. Here’s an example in Kotlin, which repeatedly generates a number and two lists of numbers to feed into the calculation.

"normalized risk score is bounded above by 1" { checkAll( smallPositiveDoubles, smallPositiveDoubles.nonEmptyList(), smallPositiveDoubles.nonEmptyList() ) { minDistance, durations, distances -> val score = QuadraticRiskScore(minDistance).calculate( durations, distances, shouldNormalize = true ) score shouldBeLessThanOrEqual 1.0 } }

In practice, the property-based tests found cases of inputs incorrectly generating infinite values and negative numbers. They were especially useful for testing during integration where it’s difficult to predict which values might produce unexpected results. Finally, the property tests proved to be a useful format for documentation, as they can express the intent of a unit of code in terms of a formula with quantifiers that it implements.

The property-based tests also proved useful as informal performance tests as they create multiple runs automatically and the test suite started taking longer and longer — even when we wrote a command-line interface to test release builds, which run much faster than development builds. As field tests later proved, this approach turned out to be good enough for performance testing of the final version.

The testing slowdown suggested that this implementation, although improved, was still not appropriate to roll out to millions of users. Taking another look at the algorithm with a profiler, we realised that much of the data being processed was one-dimensional — they could be vectors rather than matrices. This meant that, rather than using generic linear-algebra libraries, much of the code could be simplified and so made much faster.

Sometimes a deep understanding of the domain allows transformational improvements in the implementation. Often, this understanding can only be achieved by working on, and spending time with, a more generic implementation.

Different testing techniques provide different insights.

Third time lucky

The third implementation simplified the implementation to just the needs of the application by using one-dimensional calculations where possible. It’s now faster and easier to maintain and, at about 2000 lines in each app, does not need to carry the overhead of a generic n-dimensional Kalman library.

This is essentially the version that is included in this release of the app. At the time of writing, it was the only alternative risk calculation implementation to that provided by Apple and Google. As open source, it will be interesting to see whether it gets taken up by any of the other contact tracing apps.

One point of effort was validating the final implementation against its Python counterpart. It’s difficult to understand what has happened when the only result is final numbers that differ. As part of the debugging process, we introduced additional functionality to generate and publish intermediate values. This worked, but introduced test functions into the production code that were difficult to structure without breaching module visibility.

Later we refactored this to introduce a listener that provided a cleaner entry point for receiving intermediate results; a default Null implementation means that this feature is only visible in code when needed. Without the intermediate values features, the calculation code simplified to something that more closely resembled the algorithm.

During these refactoring sessions, we learned once again the value of having a comprehensive test suite that runs in seconds—and of running them very frequently. With this kind of code, minor mistakes are easy to make and difficult to spot visually. For what it’s worth, line coverage is about 90%, with the missing coverage being an artefact of the widespread use of value types.

Finally, there are some lessons to learn about the programming culture aspects of this project.

First, it takes tenacity to step into a complex, unfamiliar domain like this and to see it through. As we pointed out earlier, although we develop incrementally, there was a significant level of domain knowledge to acquire. This sort of project needs programmers who have that focus and who can cope with the early uncertainty of not knowing what to do — and they need the right support.

Second, almost all programmers understand a problem by attempting to program it. This means that early versions are often flawed and unfit for release, and that getting to an effective implementation takes multiple revisions. One sign of having stabilised is when new concepts appear that simplify the code. Lines of code is not always a measure of productivity.

Third, there are times when technical depth really counts. In the many years that we have been in the business, there has been a constant refrain that technical skills are secondary to consulting skills. This is true for much of software development — except for when it isn’t. A software organisation of any size needs a range of skills and roles, and to be able to recognise who is appropriate for a particular task. The totally fungible developer is a myth unless your ambitions are minimal.

The source code for this project is available at https://github.com/nhsx/riskscore-swift-public

Contact Paul Cadman: https://www.linkedin.com/in/paul-cadman-05aa75150/

Contact us if you have questions or would like to know more!

Ansprechpartner für Großbritannien

Steve Freeman

Distinguished Consultant

Steve Freeman, Autor von Growing Object Oriented Software, Guided by Tests (Addison-Wesley), war ein Pionier der agilen Softwareentwicklung in Großbritannien. Er hat in vielen Organisationen, von kleinen Anbietern bis hin zu multinationalen Institutionen, Software entwickelt. Vor seinem Engagement bei Zühlke arbeitete er als unabhängiger Berater, für Forschungszentren in Palo Alto und Grenoble sowie für Softwarehäuser, promovierte, schrieb Schrumpfverpackungsanwendungen für IBM und lehrte an mehreren Universitäten.

Kontakt

Steve.Freeman@zuhlke.com

+44 20 7113 5343