nShift service interruption 19.07.2024

Incident Report for nShift

Postmortem

Public Post Mortem (times in CEST):

On Friday the 19th of July 2024 parts of our nShift infrastructure was affected by the global CrowdStrike crisis, below we have an incident timeline. Many of our services were unavailable for several hours. There is no data breach or suspicion of a cyber attack related to this incident.

Preventative actions:

Cyber Security is very important for us in nShift, CrowdStrike is a highly trusted cyber security vendor of many companies. Since this was a global IT incident, there was no immediate action we could take that could prevent this scenario. We will have further dialogues with CrowdStrike on how to avoid reoccurence of similar incidents and reduce recovery time.

‌

Improved recovery time actions:

This is the first time we have experienced an outage of this magnitude and we have picked up several observations about our infrastructure. As part of our commitment towards continuous service improvement, we continue to evaluate our current infrastructure to improve our recovery time.

‌

Incident resolution timeline:

06:40 Our first monitoring tools started alerting us of issues with our infrastructure.

06:47 Our 24/7 Emergency Response Team (ERC) started investigating if they were able to recover some instances (Servers hosting our applications)

07:00 ERC escalated to our Problem Manager

07:10 Problem Manager initiated processes to establish a Crisis Team and start the Major Incident Process

07:15 The first Statuspage Notifications was sent.

07:20 Direct Cause confirmed it was due to the Blue Screen of Death (BSOD) bug caused by the CrowdStrike services we are using.

07:20 - 10:00 Crisis Team was in this time period evaluating several recovery strategies. During this time frame we were in contact with AWS Support and CrowdStrike. Several of the recovery options published online were not applicable as they were targeting desktop workstations and not cloud servers.

10:10 Together with AWS we had found a recovery method that we had verified working on some of our instances in AWS.

10:10 - 14:30 The crisis team recovered instances in a prioritized order, all instances were back online at 14:30, however there was some degradation of service due to AWS still having issues (this was fully resolved by 15:00).

15:00 Major Incident was closed.

Posted Jul 23, 2024 - 13:24 CEST

Resolved

System fully operational. We will continue to monitor closely. We are sorry for the inconvenience this has caused our customers and will work closely with Crowdstrike to mitigate future risk.

Post Mortem will follow next week - ETA Tuesday afternoon

Posted Jul 19, 2024 - 15:01 CEST

Monitoring

All services verified and system is operational again! Customers might experience some slowness in the beginning. We will continue to monitor all services closely and make a final update following this one just to re-confirm that everything looks good.
A post-mortem will follow.

Next Update 15:00

Posted Jul 19, 2024 - 14:31 CEST

Update

Recovery still in progress – All instances recovered (50/50). System is partially working but some more configuration is needed.

Next Update 14:30

Posted Jul 19, 2024 - 14:16 CEST

Update

Several critical instances have been fully recovered - we have opened up for traffic but are monitoring closely.
We are still recovering services with less criticality and full traffic should be possible within 30 or so minutes.

Next Update 14:30

Posted Jul 19, 2024 - 14:05 CEST

Update

Recovery still in progress as the fix is being implemented – the issue has highest priority by our team. 35 out of 50 critical instances recovered, when all instances are operational we need to test & verify interoperability. It’s very difficult to estimate how much time this will take but might take several hours.

Next Update 14:00

Posted Jul 19, 2024 - 13:30 CEST

Update

Recovery still in progress - 20 out of 50 critical instances recovered, after the 50 primary instances have been recovered we also need to recover the autoscaling groups.

Next Update 13:30

Posted Jul 19, 2024 - 13:00 CEST

Update

Recovery still in progress - eta next update :13:00

Posted Jul 19, 2024 - 12:30 CEST

Update

Recovery still in progress - we have verified recovery steps (ref. previous status). Before we can activate production again we need to recover the most critical instances. Next Update eta: 12:30

Posted Jul 19, 2024 - 11:56 CEST

Update

We are still working on the issue, we seem to have been able to recover some instances. Next update ETA 12:00

Posted Jul 19, 2024 - 11:31 CEST

Update

We are still working on recovering our infrastructure, eta next update 11:30

Posted Jul 19, 2024 - 10:55 CEST

Update

A workaround for the issue has been identified by CrowdStrike.
We are working with AWS to try to apply this fix across our systems.

This will take some time as we have to go through each machine for the fix, will return with a timeline.

Posted Jul 19, 2024 - 09:59 CEST

Update

We are still having issues - the CrowdStrike software we are using are also in use globally by many companies with similar issues.
AWS and CrowdStrike are assiting us and all their clients having issues.

Posted Jul 19, 2024 - 09:30 CEST

Update

We are still trying to deactivate CrowdStrike - next update is 09:30

Posted Jul 19, 2024 - 09:00 CEST

Update

Issue is still under investigation, AWS and Crowdstrike is aware of the issues. Next Update 09:00

Posted Jul 19, 2024 - 08:35 CEST

Update

We have identified a workaround to disable Crowdstrike that we are trying to apply on our production environment.

Crowdstrike themselves have also just announced they are working on the problem.

Eta next update 08:30

Posted Jul 19, 2024 - 08:03 CEST

Identified

We have identified that our infrastructure have been affected with downtime due to issues with a 3rd party tool Security Tool Crowdstrike.

Next update eta 08:00

Posted Jul 19, 2024 - 07:32 CEST

Investigating

We are currently having some infrastructure issues affecting multiple platforms - we will update this shortly (eta next update 07:30)

Posted Jul 19, 2024 - 07:15 CEST

This incident affected: Ship (Ship Dashboard, Rest API v1, Rest API v2, Ticket, Klarna, Message Hub (Drop Zone)).