Partial outages in week of November 26

Incident reviews, also known as ‘postmortems‘ are a common practice in the tech industry to keep a record of outages, bugs or security issues.

They contain lots of technical detail that won’t be relevant for most of the community, but are a good way to keep accountability and share learnings, so I’ve decided to publish it on the forum regardless.

Incident review: partial outages in week of November 26

Starting late in the day (4:00pm UTC) on Wednesday November 26, we made some changes our Content Delivery Network (CDN) configuration, a set of servers that make content on the Street Art Cities platform available world-wide in a performant way.

Specifically, we upgraded from a legacy configuration format to a new format, and enabled some AWS Web Application Firewall (WAF) protections against DDoS attacks and malicious requests.

This caused a series of issues across the following days.

User impact

  • For about 3 hours after this change, a bug in our configuration meant that no POST requests were allowed. These are typically used to take ‘actions’ on the platform, meaning that logging in or liking artworks didn’t work. This was swiftly spotted and fixed.
  • Bot detection and malicious request detection rules were applied too stringently, causing some requests by users to be seen as malicious for invalid reasons (e.g. a user with ‘virus’ in their name signing in, or a user linking 20 artworks to a place - large request payload).
  • Other systems within the Street Art Cities platform were also flagged as bots in certain situations, meaning that the community email system wasn’t able to forward emails to our servers.
    • :backhand_index_pointing_right: Emails sent during this time were not forwarded to the hunters they were meant for. The original sender did get an email telling them the email couldn’t be forwarded for the handful of emails that were impacted by this.

Timeline

Date and time Description
Wednesday November 26th3:52 PM UTC Initial changes were made
Wednesday November 26th6:19 PM UTC Issue with signing in was flagged by Tim
Wednesday November 26th6:49 PM UTC Login and other POST request issues were resolved with a configuration changed that allowed all HTTP request methods
Saturday November 29th12:45 AM UTC User Virusfreak79 flagged issues logging in and taking other actions within the SAC app
Sunday November 30th9:42 PM UTC WAF firewall rules were loosened to limit the number of false negative bot detections
Monday December 1st1:07 AM UTC Confirmation from several users they have regained access to the platform

Learnings and next steps

  • We currently do not have any monitoring in place on the firewall/CDN level of the platform. These should be added to flag when the number of 40x requests (not found or access denied) or 50x requests (internal server error) spikes, rather than relying solely on monitoring of the servers behind them.
  • Proper code review practice was skipped as the change was considered small by me, which was an incorrect judgement. At the very least, more testing should have been done when this new configuration was put into production, especially around different request types.
5 Likes