Ueda Posted March 27, 2023 Report Share Posted March 27, 2023 Dear Ninja, It's been a very turbulent last couple of months, we started having server issues out of nowhere in late November 2022. This was especially surprising and disorienting because at that point, we had not made any code changes to the server since about July 2022. We were sent on a goose chase to figure out why these issues were happening, and without our dear beloved (rest in peace) Robin to come rescue us from issues like this as he had in the past, we had an especially hard time. But I want to write this development log to explain why this issue was so pervasive, evasive, and abrasive, so players understand why this wasn't a simple "fix your server" kind of issue. To understand the rest of this post, you need to know what the following terms refer to: Server Software - Our proprietary, in-house server software we wrote to run authoritative logic for Nin Online and to handle networking aka. keeping players connected and sending data between them. Server Hardware - The actual machine that we rent to run the server software. This hardware belongs to a third-party company. Third Party Service Providers - Services that provide us with databases for the post part in our case, but can refer to any company that provides SaaS. DDoS - Distributed Denial of Service Attack, typically on a server, to prevent normal operations. SYN - Clients requests connection to a server by sending SYN (synchronize) message to the server. ExitLag - A shady company Phase 1: Locating Fault With a software as complicated as Nin Online. There's a lot of places fault can be found. All we knew based on player reports was the following... Quote The server hangs for minutes on end once in awhile. The server hanging gets worse over time. We had conflicting reports there was no hanging, but just people couldn't login. During these "hanging" players were unable to login. The server will kick players out after awhile, and recover without any manual intervention. Eventually, the server will stop functioning completely, disconnect players who are online, and not recover. There are possibilities in third party service providers, client, server hardware, server hardware (OS, Networking), it could come down to almost anything. First I'll talk about what we tried in Phase 1. Restart Server Software Restart Server Hardware Check and reboot all third party service providers (MongoDB, MySQL) So basically all the things you do when you have faulty technological issues - "Have you plugged in out and plugged it in?" "Have you tried restarting your modem?" The next thing we tried was to make sure it wasn't an isolated issue with that specific server. We rented a new VPS server and hosted Nin Online there for awhile to see if it was something that was solvable that way, and if it could be down to Windows Server settings, an issue/change to do with our hosting provider, DNS issues, anything that could be isolated to server hardware/provider. This was not the case, so we moved back to the original server. From this, we diagnosed that the issue must have to do specifically with our server software. Because it happened on multiple different server hardware. A further clue was that Nin Online's Brazil Server was fine, and the Brazil server didn't have a lot of code changes that Nin Online NA had. So it was a good lead. Despite the fact that we didn't make any code changes, it is not impossible that existing code changes between that span of time could've started acting up later. Phase 2: Looking through code changes We first looked at content changes. Nin Online's engine gives a large amount of freedom to developers to create content on the fly. Although no code changes were made. It was completely possible that it was caused by a content change eg. A new map, a new item, a new NPC. But nothing really aligned with the timeline that would cause the bug. There was one thing that stood out... Erox had just launched the Christmas Event, and this year was the first time we had pathfinding changes. This led to the train of thought that perhaps it could've been that a massive amount of NPCs (Zombies) was causing the server to hang for a long amount of time, and during this time, the server could not send any data to players - hence the hanging. The caveat to this is that our pathfinding is threaded. Meaning, even if the pathfinding was hanging, the rest of the server processes should've continued fine. But to be safe, we decided to first disable all A*Pathfinding. We left the server online to see if it stopped, but it persisted. We later went back to the drawing board many times, looking at what content or code could have changed. We investigated if Erox has added any events/npcs/items etc. and forgot about them. (he didn't) The next thing I tried was to look through all the error logs that the server created. There were a dozen or so errors the server was throwing that seemed inconsequential. These could be things like a projectile/jutsu trying to target a player that was already disconnected. The server would normally ignore these errors. But I fixed them just to be sure. This didn't help either. After a few days, after discussing it with Wolf, we thought perhaps the issue could be due to threading pathfinding entirely. Threading it in the first place was a risky idea, even though necessary, because as I said, A*Pathfinding is expensive. So we decided remove threading for pathfinding. This didn't solve the issue, but it did mislead us for a few months. Phase 3: Completely misled A few weeks later, to no avail in solving the issue, we started looking to other data. Sadly, as we'll soon find out data lied. We looked at server performance while the server was having these hanging/spikes/disconnection mass events. We did this by profiling the server, looking into metrics we have been collecting for years, and we found that during increased player activity, the server showed obvious signs of degradation. I'm skimming through weeks of work to collect data, but basically our findings were that was a correlation between these hanging issues and degradation of server stats, namely TPS. The server was running less ticks per second when these issues were occuring! Hoorah, if we can figure out what is causing this, we can solve the issue. We spent weeks figuring out what the issue was that was causing the server performance drop. Clearly something was wrong with it, if we solved it, we would most definitely fix it... right...? So we started looking into the call stack and performance profilers to figure out what was causing the drops in performance. We looked at what changes could have been made around late November that could cause it. (Just note that although the graph looks like it only shows degradation in mid-December, this is only obvious now that we have a lot more data than we did in December. We found that certain packets ran processes on the server that were taking a long time to process. Namely packets/processes that interacted with MongoDB. So I spent a few days moving this these processes to Jobs (basically threading). It was possible that due to these packets not being threaded would cause a long pause where the server was just processing these on the main thread - hence causing the hanging. Unfortunately, this did not resolve the issue either. The server was optimized. There should be nothing left that took this long to process that it was bring the TPS down... except... Pathfinding. We later realized that the reason TPS was down was simply because we stopped threading Pathfinding. Pathfinding was so expensive that it single handedly was bringing TPS down more than anything. We sent ourselves on a goosechase because of what we had did in Phase 2. In hindsight, of course this was the case. But we were trying and doing so many things at once we lost track of what we had changed a little and we forget to go back to basics. We were consumed with the idea that the TPS was causing the login issues, when it wasn't. It was months of stress over trying to figure out what in the server was causing the TPS to drop that much, and it was just pathfinding. I'm glossing over days of me and Wolf diagnosing server performance. Running third party software on the server software to figure out what was causing it to hang. But this was work. Real work. Phase 4: Back to basics We went back to the drawing board and looked to what we knew as fact. The timeline of everything we knew and decided to look into what was happening during one of the disconnection events. We let the server fail in debug mode, so we could look into the internals of the server while it was having disconnections. Up until this point, the server was still functional for most of the day, it was just crashing every 24 or so hours. I was on full-time watch for the server, making sure it went back up when it was down. It had been months of this, it was stressing me out a lot. We noticed that the server had a large amount of connection sockets (TCP connections usually used to send data between client to server and vice versa). We started looking at what code issues within our login system that could be causing them to pile up without clearing. We spent weeks on this, making potential changes and hopeful fixes, to no success. One of the hardest parts of this issue was that we could not recreate it locally, so we had to rely on the live server to debug it. Each time we tried a code change, we had to wait until the next time the server crashed, so there was a lot of time when we could not do anything but wait for another crash. Sometimes code changes we made seemed to work, but really didn't. The bug would not appear for a few days, or even a full week, and then suddenly happen again. So we were constantly being thrown into "Yay we fixed it!" and "Fuck it's back". The only clue we still had was, no matter what we did, these connection sockets leaks were still happening. Phase 5: Player testimony We went back to player testimonies, hearing what people were experiencing and getting footage of what was happening what all was going down. We heard people tell us it was probably to do with Automated Tournaments, Quick Logins and various other features. So we went through rounds of disabling things and re-enabling things until we could find what was wrong. Eventually, the bug for some reason seemed to change it's modus operandi. It started manifesting as log out issues instead of login issues. Players who were logged in, were not having their characters ever log out... Curious. Phase 6: Discovery Not all players were having logout issues when there were little players online. But once there were a ton of players online (around 100) there started to be widespread logout issues. Because players were being stuck on logout, I started investigating why they weren't being logged out, since it mainly happened to a small number of individuals when the server was fresh and not very populated, I started with those users. I found that the players were having ping packets sent even when they told me that they had the Nin Online client closed. That's literally impossible I thought. Without the Nin Online client open, what could possibly be pinging the server...? ExitLag. Phase 7: ExitLag ExitLag around November last year, started using a method of "optimization" that was essentially, on scale, a SYN Flood Attack. The culprit was a third-party software that wants desperately to provide "better ping" for players. So it uses a combination of techniques to do so. One of those includes using multiple relay servers to send the same packets to our server, spamming our server with unnecessary information multiple times. It sends dozens of SYN packets per second to our server through the port our game client uses to connect to the server. It does so through distributed servers across the world. About SYN Flood Attackshttps://www.cloudflare.com/learning/ddos/syn-flood-ddos-attack/https://en.wikipedia.org/wiki/SYN_flood It doesn't even hide it. In the picture above, it shows that it's established multiple connections to our server and is constantly sending and receiving unnecessary data through it. What's scary is (we've not fully investigated this claim) but the software seems to also triple the amount of data our server sends for large packets of data like Map data. IP address/connection slot of someone using ExitLag and their source port number The reason why it was causing login issues was because it was filling up all the temporary slots allocated to TempPlayers (a method we use to verify and give real players a slot in the game) because the server had no choice but to check all these empty packets that was being sent. The reason it was causing logout issues for players not using ExitLag was because it was overwhelming the disconnection system, blocking the disconnection queue and causing a threading leak issue which was slowing down the server. A normal DDoS attack would've been quickly triggered by our DDoS protection we had in place since 2013. But because this was done at the authentication level (it wasn't spamming packets, it was spamming SYN packets) it was creating a lot of new issues. Our DDoS protection was "per connection" whereas this was creating new connections constantly. Another thing that really pissed me and Wolf off is that this isn't the first time ExitLag's methods have caused us issues. It was causing our server to throw errors in the past, and so we actually built workarounds. If only we had straight up banned it then. Lastly, we unfortunately had an issue with timing out TempPlayers. The intervals our KeepAlive packets were being set at was 30000-60000 seconds instead of 30-60 seconds. Which is a dumb mistake on our part. This made the logout issues much worse, but also just aggravated the issue of us being DDoSed. We never found that mistake for 5 years before this because we never had this issue. Phase 8: Resolution Now that the following have been put in place, this should prevent a future SYN flood attack and also ensure players aren't accidentally banned by using ExitLag. We've tweaked Windows Server's provided SYN Flood Protection capabilities to suit Nin Online With the help of ChatGPT, I wrote a new application that checks for SYN Floods and quickly (within a minute) bans IP addresses that are flooding our server We've banned ExitLag from being used with Nin Online, so players don't accidentally get their IPs banned. We fixed a bug that was causing KeepAlive packets to only be sent out every few hours. So even if a SYN Flood Attack happens, it will not cause widespread logout issues. We've contacted ExitLag to remove Nin Online from their listings. This has been one of the hardest 5 months of developing Nin Online. I've been on full time "make sure the server is not malfunctioning" duty for the entire time, and I've been caused severe mental distress by this ordeal. All this to say, I don't like ExitLag. Thank you to Wolf, Delp and all the players who have been helping provide information for the help in solving this issue. Regards, Ueda 7 2 Link to comment Share on other sites More sharing options...
Recommended Posts