Server AP012

Triangle

Server AP012

Every once in a while you run into a problem that simply doesn’t make sense at all, each new piece of information contradicts the previous one while you continue to go down the rabbit hole. One of those situations that makes you completely doubt everything you know and do, well, here’s the tale of server AP012.

Some backstory, I switched jobs a couple of months ago, I now work as a/the devops at medium company but due to the amount of (IT) history I haven’t touched every single server in the company. So a call comes in stating that the helpdesk site/server is slow, we host it ourselves, and so it begins…

It’s an old website running in a docker container with PHP and MySQL on Ubuntu, host as well as the container, which has been on the never ending lists of ToDo’s for years.

Step 1; validate the claim

Is it the user? Internet slow? Is it the server? Infra? Reverse proxy? Database?

First thing you do is check out if it’s actually slow, not because you doubt the customer/reporter but you want to know whether or not it was just a fluke and if you can reproduce the complaint. So I check out the helpdesk website and notice a horribly slow website, 4+ seconds of load time just for the login screen.

Step 2; check the server

Connect to the VPN, SSH into the machine and start checking the stats. So I start with the load, single core at 0.15, that’s not a lot but ok the load isn’t high so that should be ok. Next up RAM, 8GB, again not that much but by the looks of it plenty, swap, 1GB, couple of MBs used so nothing to worry about there. Diskspace? did the logs fill up the disk again? Nope, again all good, disk usage at 61%. So at this point I’m thinking MySQL, so you log on to the MySQL server, check the session manager to see if there are any long running queries, queue the first WTF, no running queries. At this point I go back to the step 1, check if the system is still slow, could have been a query that has finished in the meantime right? Well… nope, still slow….

Step 3; check the container

So at this time I’m starting to wonder what it could be and check network tab of the dev tools in Firefox. The static files are loading fast but not the / request… My first thought was that it’s something with the php code but that would be weird, that hasn’t changed in a (too) long time. Either way, static files are loading fast so it’s most likely PHP and not Apache or Nginx. To rule out the app itself, lets create a phpinfo file in the container. I bash into the container, cd to the htdocs/http root and try to nano a file. nano: command not found… Great… Vi exit hell it is, vi: command not found… wtf? but ok Emacs? nope… Ok what ever, lets install it, apt install nano, apt not found, how old is this thing… apt-get install nano, E: Unable to locate package, ow of course, outdated or cleaned sources, apt-get update -> 404’s all the way, it’s Ubuntu Precise… i.e. 12.04, i.e. 2012… FUCK, well that’s long gone, lets fall back to some Bash thingies. So using echo "<?php phpinfo(); ?>" >> info.php, load it in browser, again over 4+ seconds load time, WTF… At this point I’m thinking it’s something in the config that connects to an external service or so trying to get or version number or something like that. Just as an extra check, echo "<?php print('Hello World'); ?>" >> info.php, again 4+ seconds! How is this possible? Let’s restart the containers, it’s not Windows but hey it might just help right? NOPE… Ok what about the entire VPS? Again nope… At this point I’m thinking it’s not the machine, it must be something between the customer/me and the server.

Step 4; check the infra

To exclude as much as possible I check other services within the network but they do respond quickly. This tells me it’s not likely to be the reverse proxy or anything else on the network. To make sure this is actually the case I curl to the helpdesk directly and set the host in the header manually; curl -w "\n%{time_connect}:%{time_starttransfer}:%{time_total}\n" -o /dev/null --header "Host: helpdesk.example.com" 172.16.1.12, 4+ seconds! WTF!!! If I recall correctly, this is about the time I’m starting to pull my hair out… What the hell is going on? What am I missing?

Now, this can’t be a performance issue, I know the server is far from powerful but; 1, performance issues hardly every sprung up like that out of the blue, they usually build up over time. 2, 4+ seconds over and over again, this is way too consistent be a performance issue, if it was you would expect a 1 second response here and 10+ there, no always 4+ seconds.

Step 5; call in help

With a heavy heart I start calling colleagues who’ve been there way longer than I have and ask them to weigh in. Multiple people take a look, multiple people can’t figure it out. One suggests IO, partition corrupt? PHP corrupt? After checking the host Ubuntu I don’t have any indication for disk corruption, no long waits in top, no weird things with iotop and iostat. Could it be the image of the container? Stopping the container, pull the image again results in OK checksum, still theoretically possible but HIGHLY unlikely the image is corrupt in a way it didn’t affect the checksum.

Again a dead end and the pressure is starting to rise. While it’s great it’s not just me, they are “just devs”, I’m the devops guy, this is my job, I should be able to fix this…

Step 6; call in even more help

I call the provider of the VPSes and beg for help, most of them are infra guys but they do know how to set up a server. They check a lot of things including a bunch I can’t access, namely the bare metal and hypervisor, again all seems well. No issues found on their side, now what?

Step 7; move the container

Lets find another machine, run the container there and see if it’s still slow and… yes, still slow. HOW IS THIS POSSIBLE? At this point I don’t really know what to do anymore, it’s been almost a week, it started to keep me up at night… Since this is a different machine nothing is pointing to this container, think reverse proxy, so I can play with it all I want. After a bunch of changes, changing ever more with each check suddenly BAM! FAST, 50ms on the login screen… WTF did I change? You know what, at this point I don’t care, let me just copy over this container 1 to 1 to the AP012, the machine that should be running it, and slide it inplace there. As I do that and start it on the AP012, guess what… 4+ seconds… you know what, let’s just point the reverse proxy to the new container and be done with it. This is costing me too much time and years of my life for something that should have been replaced years ago.

Stumbling upon it…

Fast-forward a couple of days to a scheduled disaster recovery test where we shut down the machines and move them to a different physical machine in another warehouse. Doing the check to get some proof it actually moved I run tracepath google.com and on the AP012 it takes an awfully long time, ok… wtf? Running tracepath -n google.com is fast, so DNS? Pinging dns.google.com takes seconds… pinging 8.8.4.4 takes milliseconds, so yeah it’s DNS… Checking the /etc/resolv.conf it has a bunch of 10.0.x.x addresses, something provided by the VPS provider. Going over them one by one seeing if I can ping them and right off the bat I can’t ping the first one… the others worked just fine. So I remove the entry, restart the resolv service and the tracepath seems fine again. Just in case I check the website but no luck there… I finished up the tests, reverted the machines, updated the config again, applied it but still 4+ seconds. Still I had a feeling… what if? I can’t imagine this having anything to do with loading hello world but you never know… A full ping test before test was 7 seconds, not the ping itself but the entire test, after the fix it was 3 seconds, AGAIN WITH THE 4… Just in case of some caching thingy somewhere I restart the container and lo and behold, 50ms load time…

But how… and why was this on the other machine too? Checking the resolv.conf on the other machine it turns out there was the same IP but in 3rd place rather than first.

I still have no clue how this ties together, why would Apache take 4+ seconds to load, due to DNS, when there should be no DNS lookup at all? I really don’t know where that happens, I’m just glad it’s fixed, I can finally get a decent night’s sleep again and push the feelings of self-doubt down again (for a while).


See also