Postmortem on the Simple In/Out Outage of September 1, 2023
September 2, 2023
On September 1, 2023, just after 9:00 am Central Daylight Time, Simple In/Out experienced an intermittent outage lasting approximately 3 hours. During this time Microsoft Teams presence integrations were also disconnected.
Like the previous outage we suffered this Spring, this one was a confluence of events that ultimately land on us. For this, I’m sorry. While we cannot go back in time, we strive to learn from our mistakes at Simple In/Out and improve as an organization. Our customers deserve zero downtime.
Below is a technical explanation of what happened and our remedies in the interest of complete transparency.
What happened?
We’ve been rapidly shipping updates recently, setting the stage for our next new feature: Single Sign On. While we’re not ready to announce this new capability today, it’s nearing completion. In this rush, a database update caused a problem for a few customers. The database misalignment was not a big deal, but it forced us to look closely at our servers.
Unrelated, we shipped a new safety check to ensure smoothly-operating servers. This safety check noticed the database change and sent us a false negative: it reported servers broken that were running just fine. The false alarms began a chain reaction of bad reactive decisions until we discovered we were chasing non-existent problems.
What can I do?
No action is required unless using Microsoft Teams presence integration. Those users must reconnect their integrations here. Microsoft requires we contact them every 45 minutes concerning every Teams presence integration, so a persistent outage translates to Microsoft canceling all access.
What are we doing to stop this from happening again?
First, we have fixed the false positive check that caused us to react poorly.
Second, we will speed code deployments dramatically, ensuring any rollbacks take far less time. Faster code deployment will help stem further disconnects of the Microsoft Teams presence feature.
Speaking of Teams, we’re also attempting to convince Microsoft to amend their policies regarding Teams presence for more time before disconnecting. Microsoft allows anywhere from 3-30 days for all other resources, so while we’re not optimistic they will extend time regarding Teams presence we believe they can do so safely.
Thank you for reading, trusting us at Simple In/Out, and allowing us to serve you. We may have fell well short yesterday, but we’ll be better tomorrow because of our relentless pursuit in building the best in/out board on the planet.