Broken backwards compatibility in access token service for client credentials, 20th Dec 2021
All timestamps are in UTC

Summary

When a new release of the Naviga ID access token service was deployed to production on Dec 20th, it broke backwards compatibility for the scope parameter in the fetch token endpoint when using client credentials. Previously the scope parameter had been ignored, but in the new version it was not. This caused clients that set the scope parameter to anything other than an empty string to fail to fetch tokens.

Impact

The bug was present between 12:43 and 16:39 on December 20th 2021. During this time, only two applications belonging to the one customer are known to have experienced the issue. For the customer, this meant that one import flow and one export flow stopped working. The customer received workaround instructions at 15:24. Around one hour later, the problem was resolved when a fix was deployed to the access token service.

Background

Naviga ID supports applications to authenticate themselves using client credentials. In return, an access token is sent that can be used to access Naviga's content creation services. One common use case is import and export applications.
The HTTP endpoint for fetching access tokens is designed to be compatible with the OAuth 2 standard. Apart from the credentials (client_id and client_secret), a parameter named scope is also defined the standard. Up until 12:43 on Dec 20th, the scope parameter had been allowed to be set, but its value had been ignored and carried no functionality. The reason for allowing scope to be set (even though not being in use) was to support tools and libraries that might default to set scope to an empty string unless otherwise configured.
Setting an empty scope variable was both supported and tested in Naviga ID. And, while it was also supported to set it to any string value, there were no test cases implemented for that scenario.
On Dec 20th, support for a new type of scope-based applications were added to Naviga ID. These new applications make use of the scope variable when fetching access tokens. Erroneously, the presence of a non empty string value in the scope parameter was used as an identifier for these new scope-based applications. When the system tried to handle a legacy application as a scope-based application, an internal error was triggered and a 500 HTTP error was returned. Because no test-case for this scenario existed for legacy applications, the problem was not identified during development.
Before the bug was fixed, one customer was identified that had two applications where the scope value was set to the string value "basic". A workaround for the customer was to remove the scope value. The fix in the access token service was to restore the previous behavior of ignoring the scope parameter for legacy applications.

Timeline

Time
Time since last event
Event
December 10th, 12:02
-
Scope-based application support deployed to Naviga ID Stage. Backwards compatibility broken.
December 20th, 12:43
10 days
Scope-based application support deployed to Naviga ID Production. Backwards compatibility broken.
December 20th, 12:44
1min
Elevated 500 error levels but no alarm triggered
December 20th, 15:01
2h, 17min
Customer reports issues with APIs in Slack
December 20th, 15:03
2min
First response by Naviga personnel
December 20th, 15:08
5min
Naviga ID developers notified about the issue. Unfortunately, both developers had clocked out and were not located right by a computer
December 20th, 15:20
12min
Naviga ID developer online and root cause identified
December 20th, 15:24
4min
Customer informed of workaround (remove scope from request)
December 20th, 15:30
6min
Workaround confirm to work by customer
December 20th, 16:29
59min
Fix released and verified in stage.
December 20th, 16:39
10min
Fix released and verified in prod.

Actions

  • Bug fix to restore the previous functionality where scope is ignored for legacy applications. Completed Dec 20th 2021.
  • Add tests to verify that scope behavior is not changed in the future. Completed Dec 20th 2021.
  • Add these scenarios to our live system tests that are run every minute Completed Dec 21th 2021.
  • Add monitoring on increased error rates in the load balancer First iteration of alarms implemented on elevated levels of 500 errors on Dec 21 2021. More to come regarding identify unusal activity.

Other learnings

  • Stage environments are not used in similar enough fashion to prod environments. Had they been, this issue would have been found and fixed before reaching production systems.
  • In future feature development, implement stricter validation from the beginning. Fail fast and early. Catching all backwards compatibility scenarios gets a lot trickier when looser validation is implemented.
  • The impact of issues might seem greater than they are when triggered by one or two traffic intensive clients.