DevOps, you're doing it wrong.

Somewhere there is a highly experienced systems engineer unable to find work because everyone is looking for a developer to maintain their systems.

Being a Sysadmin is a funny pursuit. You are put in charge of systems that more often than not you only had a small hand in designing and someone else is actively trying to break, from hackers trying to use your server to Ion-cannon LolCatz to a dev that accidentally enabled debugging on all prod servers and is filling up the drives with core dumps faster than services can be restarted. You are constantly looking for repeatable tasks to stop repeating and not because the work stops being done but because you stop doing it and some script has replaced a skill you used to have. Stored on a GitHub account is a pantheon of scripts that I have written over the years, some that will compress old logs, delete the even older ones and HUP processes so open files can be compressed. I could probably only repeat 70% of what they do even though I wrote them not so long ago. I have filled the space the syntax to pipe awk into sed and sort in a meaningful way used to take up in my head with other things. With a little bit of tweaking, each of these scripts they do as they are told and what they need to do. These little snippets of code are not the poetry a career developer hopes to birth into the world. They are more akin to scribbles on a bathroom wall, they do what they need to do, nothing more nothing less. They are pretty in their own way.

Let’s take an example of 10 servers needing to be built. The same engineer clicking through steps down a vnc tunnel on a vm cloned from a 6-month-old template and running sysprep. Perhaps this time they put the subnet in wrong, 254 instead of 255. Perhaps one server they forget to add /generalize to the sysprep and just like that a time bomb is placed. Someone will have to spend time figuring out why a server won’t join the domain. These servers are then thrust into the world to toil away, each requiring love and care, each demanding unique attention by an ops engineer who has been getting paged every 30 min since 2 am for a 90% /var warning. They now get to do the murder mystery hunt: is the problem /var/log/mysql? Is it /var/spool? Nope, dammit, du -hs, too many entries, du -hs | grep G and so it goes, all the while thinking why this new server can’t join the domain, just as a dev walks over to the busy ops engineer and gets told to log a ticket for his redis cluster that was built a few months ago. See they actually need sentinel setup because as they read on a .io site that it will solve all the problems.

In a utopian world you don’t care that /var on server is almost full, you run a bit of ansible to clean it or worst case change a variable in a config, delete the server and automation recreates it new and fresh, re-adds it to the cluster and you have the time to have a conversation with the developer about sentinel and the problem he or she is actually trying to solve rather than getting a ticket in your backlog that in 8 weeks time makes no sense and has your scrum master breathing down your neck because the dev team is now blocked and has fallen behind on a deadline. Now, this isn’t always possible, you can’t destroy and recreate a SAN every time it gets full, or extend an MS Exchange MTA cluster by one server because today is a little busy and then delete the server tomorrow to save cost. This is why there are and always will be well-paid sysadmins out there who have no desire to automate things or engage with the larger more complicated problems.

Enter the idea of DevOps.

DevOps was born from developers, sysadmins, ops, infra engineers most of whom had spent a lot of time at the coal face and realized there has got to be an easier way than the 3am panic phone calls because code has been thrown over the wall, production is down and everyone will be unemployed by that afternoon unless SELinux is disabled on all the production servers right now. These engineers started embedding themselves in dev teams, building and using tooling to help stop the shock of production being different to staging, smooth the path to production and overall increase the speed at which this can all happen. The concept, however, has changed over the 10+ years, the tooling is built and pretty efficient. 20 min worth of Google and pretty much anyone can copy paste the terraform instructions on getting an AWS kubernetes cluster up. So naturally the role has had to change to be more dev than ops and if that kubernetes cluster develops say a NAT issue with a few service endpoints they can log a ticket and someone at AWS will help them out or better yet destroy it and re-run the code that created it. Boom! Problem solved, right? Sure, just like that you get stuck in a world of, * */15 * * * root apachectl restart. For the innocent amongst us that cron entry is a way of keeping a badly written PHP site alive without having to understand why it is failing by forcing garbage collection every 15 min. To anyone that has PTSD from that cron entry, I am deeply sorry.

So knowing all these things and having seen the dark heart of complex systems held together with 2>&1 > /dev/null you live in fear and drink heavily, you have this vacant position on your team and you know lurking out there in a sea of garbage is an engineer who lives on hashicorp and breathes ansible, is faster than a speeding ceph outage, can leap the hybrid cloud divide in a single bound and how you hope to find this person is by offering a large salary and using a coding test in any language they so choose to filter out the weak. I’m sorry to break it to you but I can with confidence say that the ability to solve rot-n in under 10 lines of readable code doesn’t help one bit when your kubernetes cluster is in panic because etcd is trying to resolve split brain due to latency in the AWS inter-az network and all the masters are flapping while an election is held every 1000ms. Test real-world scenarios, give them a laptop and access to AWS and ask them to deploy something, better yet give them a broken something and work with them to fix it, some of the best engineers I have ever worked with couldn’t write SOLID code but they damn sure knew the mechanics of why limits.conf would screw up your life if you forgot it was there and more than that they knew what could be ignored and what couldn’t.

Ideal DevOps, as the name suggests, lives between Development and Operations, but weirdly not in either. It’s a culture formed by people that aren’t focused on the source of application development or the dedup efficiency of the storage but the application as a whole, from the pipeline that takes the code, molds it, tests it and verifies it will run on production. They understand how it all works, how the microservices fit, where the pain points are and if they are good they have strong convictions on how those need to be fixed, that’s where they work. Not pushing the poetry to the light but fixing the grammar on the bathroom walls so the next line of perfect code doesn’t have to.

Written on March 26, 2019