Stress Test
As we announced this morning, we are going to bring the servers up for a public test for some time. The stress test will start today at 8 PM CET / 11 AM PST.
To join, all you have to do is wait for the update on Steam and launch the game as you would normally.
Due to the data corruption that occurred in our database, the game will start fresh with a new world and a full wipe.
You can expect some longer queue times, as we have changed things so it doesn't overload our systems. We'll be addressing the waiting times later, but the game should run normally otherwise. For testing purposes, there might be restarts every so often. We will keep you all updated as frequently as possible, and will consider the options we have throughout the test.
The purpose of this test is to verify that the fixes we applied throughout the week will allow us to re-launch the game into Early Access.
We'll be observing the performance of the test closely and decide afterwards how to proceed. If things run well, our plan is to restart the servers tomorrow. Should we encounter more issues, we'll have to take a closer look again.
[h2]What happened and what did we do about it?[/h2]
A lot of things went wrong at launch, and while we could've tackled them one by one, they all happened at the same time. We felt prepared for the launch and wouldn't have released yet otherwise. We ran a beta for over a year and we've been continuously fixing all the issues we've come across. On top of that, we did many large-scale load tests throughout the last year, which showed no problems in any of our systems.
As many of you saw, the first hours after the release went smooth. After some time, we started getting occasional reports of characters being stuck in the world map. Engineers proceeded to look into it, but it didn't seem like a widespread problem at the time, considering a very small number of those reports. Our original plan for the first days after release consisted of splitting the team into a dayshift and a nightshift, fixing problems quickly as they pop up, and introducing updates 24/7 until things ran stable. Soon after, however, we noticed that the number of reports were increasing, the backend became more and more unstable, and that the situation was getting progressively worse.
More and more people failed to connect and a large number of players were stuck in queues. We were pushing quick fixes out, but in such a short period of time, we couldn't have truly understood the problems without proper debugging.
In the middle of it all, the most crucial problem started occurring — servers were continuously shutting off.
To explain what that means, we'd like to dive a bit deeper into how our world system works.
In our game, every oasis represents one server. There is a massive database that makes sure that the world is consistent - a player that leaves one server has to successfully arrive on another server for this consistency to be guaranteed. Because of how the system works, the servers shutting down wasn't as much of a bug on its own as it was a database failsafe. It meant that something happened in the world that interrupted that consistency and whenever that happens, the server shuts down and reboots to make sure the last correct state is resumed (what many of you mostly referred to as a rollback).
At some point, the database became so slow with all the connections that the gameservers couldn't verify their consistency state on time anymore, forcing them to all shut down simultaneously. Again, it was a failsafe to make sure that the world doesn't become inconsistent and breaks the logic of the game in the long run.
The continuous shutdowns revealed another two errors that were previously unnoticed. First, our lobby and joining queues weren't optimized enough for tens of thousands of players connecting simultaneously, as, normally, players join in smaller waves over a period of time. Clients were sending massive numbers of requests to our backend to check for the status, essentially spamming it. Doing so shouldn't have overloaded the gameservers with the verification process, but the sheer amount of those requests did. The system in place couldn't handle the load. On top of that, with so many people stuck in the queue due to another bug, the clients were continuing to send those requests indefinitely.
Our systems were essentially stuck in a loop with multiple issues affecting each other. As all the servers were shutting down and restarting, over 20k people were trying to rejoin at the same time, leading to our queue system failing, which then kept overloading the master server, letting only a few people to join a time until the master server would shut off again and take all servers down with it. And the cycle would continue.
A lot of you were understandably angry about this, so we were trying to get the systems to a working state and fix underlying problems at the same time. It felt like rebuilding a house of cards in the middle of a hurricane. We were hoping to find that magical solution that would interrupt the cycle at the time, waiting to see if the next patch fixes the main issue. Taking a lot of pressure and stress into account mixed with worsened communication due to the quarantine situation made things even more difficult.
At some point, we had to accept the reality that it isn't just one fix away. Too many systems failed simultaneously, which proved to be a much bigger problem for the release than we initially thought. A lot of you brought up that the decision to shut down the servers on Sunday came too late, and we agree. It was very difficult to pick the right time for the most nuclear option, as with so many patches coming in, we were asking ourselves: "What if we will have fixed it all in the next hour?".
After we announced the servers going down for maintenance, most of the team went to get some sleep after days of working non-stop and woke up to something we did not expect: a lot of you expressing understanding and support. We can't tell you how much of a morale boost that was for everyone on the team. It helped us to get our shit together: organize properly, analyze logs and code and fully check what went wrong. Over the last few days, we can easily say we found a lot. We've been refactoring large parts of the code that we've seen cause problems, changing the database structure that was failing, and fixing various integrity issues that appeared. As far as we can tell, all the known issues have been fixed.
The scary part for us, and we'd like to be as upfront as we can, is that we can't guarantee yet that it's fully solved. With so many issues at once causing related issues to appear, we know which bugs happened, but not which exact ones started the chain that led to the terminal problem. To stick with the house of cards example: all we see is that it collapsed, and now we need to reconstruct which card fell first, which we’ll only be able to start on once real players join the servers.
This is why we decided to start a public stress test today and verify all the fixes we did. We've also added additional server logging to make sure that we have better data to analyze the problems should new ones come up.
Some functionalities that we feel need more time to be properly refactored were temporarily disabled for now. We believe they were a large part of the source of the problems. Specifically, server decay, time-based structure packing, and deletion of your own character.
Once things are running smoothly again, we'll address those features as quickly as possible and bring them back in a more stable form.
The immediate goal, however, is to let you all play and enjoy Last Oasis again as soon as we can.
To join, all you have to do is wait for the update on Steam and launch the game as you would normally.
Due to the data corruption that occurred in our database, the game will start fresh with a new world and a full wipe.
You can expect some longer queue times, as we have changed things so it doesn't overload our systems. We'll be addressing the waiting times later, but the game should run normally otherwise. For testing purposes, there might be restarts every so often. We will keep you all updated as frequently as possible, and will consider the options we have throughout the test.
The purpose of this test is to verify that the fixes we applied throughout the week will allow us to re-launch the game into Early Access.
We'll be observing the performance of the test closely and decide afterwards how to proceed. If things run well, our plan is to restart the servers tomorrow. Should we encounter more issues, we'll have to take a closer look again.
[h2]What happened and what did we do about it?[/h2]
A lot of things went wrong at launch, and while we could've tackled them one by one, they all happened at the same time. We felt prepared for the launch and wouldn't have released yet otherwise. We ran a beta for over a year and we've been continuously fixing all the issues we've come across. On top of that, we did many large-scale load tests throughout the last year, which showed no problems in any of our systems.
As many of you saw, the first hours after the release went smooth. After some time, we started getting occasional reports of characters being stuck in the world map. Engineers proceeded to look into it, but it didn't seem like a widespread problem at the time, considering a very small number of those reports. Our original plan for the first days after release consisted of splitting the team into a dayshift and a nightshift, fixing problems quickly as they pop up, and introducing updates 24/7 until things ran stable. Soon after, however, we noticed that the number of reports were increasing, the backend became more and more unstable, and that the situation was getting progressively worse.
More and more people failed to connect and a large number of players were stuck in queues. We were pushing quick fixes out, but in such a short period of time, we couldn't have truly understood the problems without proper debugging.
In the middle of it all, the most crucial problem started occurring — servers were continuously shutting off.
To explain what that means, we'd like to dive a bit deeper into how our world system works.
In our game, every oasis represents one server. There is a massive database that makes sure that the world is consistent - a player that leaves one server has to successfully arrive on another server for this consistency to be guaranteed. Because of how the system works, the servers shutting down wasn't as much of a bug on its own as it was a database failsafe. It meant that something happened in the world that interrupted that consistency and whenever that happens, the server shuts down and reboots to make sure the last correct state is resumed (what many of you mostly referred to as a rollback).
At some point, the database became so slow with all the connections that the gameservers couldn't verify their consistency state on time anymore, forcing them to all shut down simultaneously. Again, it was a failsafe to make sure that the world doesn't become inconsistent and breaks the logic of the game in the long run.
The continuous shutdowns revealed another two errors that were previously unnoticed. First, our lobby and joining queues weren't optimized enough for tens of thousands of players connecting simultaneously, as, normally, players join in smaller waves over a period of time. Clients were sending massive numbers of requests to our backend to check for the status, essentially spamming it. Doing so shouldn't have overloaded the gameservers with the verification process, but the sheer amount of those requests did. The system in place couldn't handle the load. On top of that, with so many people stuck in the queue due to another bug, the clients were continuing to send those requests indefinitely.
Our systems were essentially stuck in a loop with multiple issues affecting each other. As all the servers were shutting down and restarting, over 20k people were trying to rejoin at the same time, leading to our queue system failing, which then kept overloading the master server, letting only a few people to join a time until the master server would shut off again and take all servers down with it. And the cycle would continue.
A lot of you were understandably angry about this, so we were trying to get the systems to a working state and fix underlying problems at the same time. It felt like rebuilding a house of cards in the middle of a hurricane. We were hoping to find that magical solution that would interrupt the cycle at the time, waiting to see if the next patch fixes the main issue. Taking a lot of pressure and stress into account mixed with worsened communication due to the quarantine situation made things even more difficult.
At some point, we had to accept the reality that it isn't just one fix away. Too many systems failed simultaneously, which proved to be a much bigger problem for the release than we initially thought. A lot of you brought up that the decision to shut down the servers on Sunday came too late, and we agree. It was very difficult to pick the right time for the most nuclear option, as with so many patches coming in, we were asking ourselves: "What if we will have fixed it all in the next hour?".
After we announced the servers going down for maintenance, most of the team went to get some sleep after days of working non-stop and woke up to something we did not expect: a lot of you expressing understanding and support. We can't tell you how much of a morale boost that was for everyone on the team. It helped us to get our shit together: organize properly, analyze logs and code and fully check what went wrong. Over the last few days, we can easily say we found a lot. We've been refactoring large parts of the code that we've seen cause problems, changing the database structure that was failing, and fixing various integrity issues that appeared. As far as we can tell, all the known issues have been fixed.
The scary part for us, and we'd like to be as upfront as we can, is that we can't guarantee yet that it's fully solved. With so many issues at once causing related issues to appear, we know which bugs happened, but not which exact ones started the chain that led to the terminal problem. To stick with the house of cards example: all we see is that it collapsed, and now we need to reconstruct which card fell first, which we’ll only be able to start on once real players join the servers.
This is why we decided to start a public stress test today and verify all the fixes we did. We've also added additional server logging to make sure that we have better data to analyze the problems should new ones come up.
Some functionalities that we feel need more time to be properly refactored were temporarily disabled for now. We believe they were a large part of the source of the problems. Specifically, server decay, time-based structure packing, and deletion of your own character.
Once things are running smoothly again, we'll address those features as quickly as possible and bring them back in a more stable form.
The immediate goal, however, is to let you all play and enjoy Last Oasis again as soon as we can.