Failed Authentication to Linux Servers running sftd 1.0.2
Incident Report for ScaleFT
Postmortem

ScaleFT May 23, 2016 Incident Report

(Note: All times are in UTC)

Summary: Changes in the ScaleFT Server Agent (sftd) in version 1.0.2 revealed an existing bug in the ScaleFT platform. The bug was in the parsing of “SSH Known Host Keys”. When a user attempted to access a Linux server running sftd version 1.0.2 and if that server had an ed25519 host key, attempts to issue credentials (SSH Certificates) would fail with an HTTP 500 status code. Many common Linux distributions including Ubuntu, Debian, Amazon Linux include support for ed25519 keys in their latest versions.

Impact: From Saturday, May 21 at 03:41 until Monday, May 23 at 08:08 some Linux server builds received sftd version 1.0.2, and all attempts to access them using ScaleFT received an HTTP 500 error code. At 08:08 as mitigation for new server builds, version 1.0.2 was removed from the Linux package repositories. On Monday, May 23, at 09:10 a platform release was deployed to app.scaleft.com to fully resolve the issue.

Detailed Timeline:

  • Saturday, May 21:
    • 03:41: sftd version 1.0.2 is released and distributed to pkg.scaleft.com, begining of the impact window. Monday, May 23:
    • 07:28: Incident Opened, escalated to ScaleFT emergency on-call engineers.
    • 07:36: Incident Acknowledged, by ScaleFT on-call engineers.
    • 07:54: Cause of issue identified by ScaleFT. Mitigations proposed:
    • Revert sftd version 1.0.2 to version 1.0.1
    • Platform Patch
    • Remove ed25519 host key from affected servers
    • 08:08: sftd version 1.0.2 is removed from pkg.scaleft.com and a cache invalidation was pushed to CloudFront, providing a partial mitigation for new server builds.
    • 09:10: ScaleFT Platform 0.20.14 deployed to app.scaleft.com, providing full mitigation..
    • 15:45: sftd version 1.0.2 restored to pkg.scaleft.com.

Metrics: * Time to Detection: 2 days, 3 hours, 47 minutes. * Time to Acknowledgement: 8 minutes * Time to Partial Mitigation: 40 minutes * Time to Full Mitigation: 1 hours, 42 minutes

Technical Details: On startup and as part of server enrollment, the ScaleFT Server Agent (sftd) on Linux systems reads the /etc/ssh/sshd_config file. It scans for the HostKey directive, and attempts to load all referenced files. The software uses a 3rd party library, golang.org/x/crypto/ssh, for parsing the referenced files. If sftd encounters an error reading any one of the files, it skips that file and continues processing the others. This behavior was documented as being desired because OpenSSH is known to add new host key types, and our agent may not know about all possible host keys.

Once a list of host keys is parsed, they are submitted to the Platform as part of the server enrollment in the “device info” data structure. The Platform only validates a subset of the device info on submission, storing some attributes in indexed database fields, while some are stored opaquely for later use.

When the Platform receives a valid and authorized request for SSH Credentials from the ScaleFT Client, it deserializes the target server’s device info in order to assemble an SSH known hosts list for the client. The client uses the SSH known hosts list to prevent man in the middle attacks.

The bug was that when iterating the list of SSH host keys provided by sftd, the Platform would error with an HTTP 500 if it was unable to parse any of the SSH host keys.

The error situation was caused by changes in sftd between version 1.0.1 and 1.0.2, specifically the 3rd party library, golang.org/x/crypto/ssh, was upgraded. This newer version included changes to support ed25519 host keys[1]. When sftd 1.0.1 encountered an ed25519 host key, it would skip it, not including it in the device info submission. When sftd 1.0.2 encountered an ed25519 host key, it would successfully parse it, and include it in device info. The platform used the same golang.org/x/crypto/ssh library, but before Platform version 0.20.14 it used an older version without ed25519 support. When the Platform encountered a Host Key it could not understand, it would fail all attempts to generate the SSH Known Hosts, and return an HTTP status code 500 for attempts to get credentials.

[1] https://github.com/golang/crypto/commit/1e61df8d9ea476e2e1504cd9a32b40280c7c6c7e

The difference in behavior, of sftd being lenient on parsing errors and the Platform being strict obfuscated this issue, until sftd added support for ed25519 host keys in version 1.0.2. Additionally, both sftd and the Platform made assumptions about the parsability of SSH host keys, based on a library that they both use.

The upgraded dependency between sftd between version 1.0.1 and 1.0.2 was not noted in release notes, because it was not viewed as a user-visible change in behavior or bug fix.

The deployed mitigation in 0.20.14 changes the Platform’s logic when processing potential SSH host keys, skipping any it cannot parse, logging if parsing failed, instead of returning an error.

Areas of Improvement: There are several preventative steps that will be taken to prevent future issues, improve the time to detection and resolution, and communicate clearly about the issue.

  • Automated testing: ScaleFT Platform, sftd and the ScaleFT Client have a complex relationship and supported version matrix. Automated testing is only currently occurring inside the latest versions of each component. We will improve this by doing full integration tests between a matrix of Platform, Client and sftd, across multiple operating systems.
  • Improved Version Management: Currently the ScaleFT Client and sftd are released into a single stable channel. We will create an additional “testing” channel with separate Linux package repositories, and periodically releases from testing will be promoted to the stable channel.
  • Improved Dogfooding: ScaleFT uses the ScaleFT product for our own infrastructure, but most of our infrastructure is running an older version of sftd, and was not automatically upgraded to the latest version available. We will change a subset of our infrastructure to be automatically upgraded with the latest version of sftd before it is released.
  • Improved Monitoring: We are not currently alerting on every HTTP status code 500, which may have led to a significantly reduced time to detection of this issue.
  • Improved Release History Documentation: Through an intentional policy of only documenting perceived user visible changes in the change history, there was no mention of the library upgrades in sftd version 1.0.2. We will now include these upgrades in our release history documentation.
  • Improved Release Process and CI: The ScaleFT Platform release process can be made easier, and a newer CI pipeline will be adopted. The ScaleFT client and sftd already use our newer CI pipeline, and the platform will be migrated, enabling faster releases and more confidence during emergency releases.
  • Improved External Communication: The ScaleFT Status page at https://www.scaleftstatus.com/ was not updated during the incident. During future incidents, we will update this status page.
Posted about 3 years ago. May 23, 2016 - 23:01 UTC

Resolved
This incident has been resolved.
Posted about 3 years ago. May 23, 2016 - 09:10 UTC