Modern Sysadmins Need To Fail Fast

December 7, 2024

I’ve been on course this week for work. I’m taking subsequently more challenging Amazon Web Services (AWS) courses and certifications and it’s been an eye-opening experience. Tech is fast-moving and nearly everyone working in it is under-trained and barely up to speed on whatever the new thing is, so when I actually get loaded on a training course it’s a pretty big deal. This week I’ve discovered the widening delta between what my role as a Linux Sysadmin used to be, and what it entails today.

Most of you reading this will associate the company Amazon with buying stuff online. That’s OK, I get that. I have an ongoing Amazon “Prime” bill just like the rest of us. Amazon is a giant in the retail world and the model for how to build an online business. But the AWS product category is a profit juggernaut that powers a huge chunk of the internet. Fun fact: AWS accounted for almost 70% of Amazon’s operating profit in Q4 2019. Some of you are unaware of the specifics of AWS, but you’ll likely be familiar with the term “the cloud”. Some technologists like to quip that the cloud is really just someone else’s computer which is true, but it vastly over-simplifies the benefits of the cloud. It’s in those benefits where the current set of Sysadmin skills diverge from the skills needed in days of yore.

The old job

Prior to wide-spread adoption of the cloud, we used hardware components. Systems folks bought insanely expensive servers and switches, and we paid very expensive monthly fees for data center co-location fees and backbone internet transfer. There are still reasons to do that; the cloud isn’t suitable for every work load, but it’s a good choice for many. I work in a hybrid environment where we have a fleet of physical servers deployed globally, and also a sizeable AWS cloud footprint.

Physical hardware has two main problems. The initial cost-outlay is the primary one and AWS eliminates that. Anyone can create an AWS account for free and won’t start incurring fees until they launch something. Even then, AWS uses a “pay as you go” model where you’ll pay a few cents an hour to run a server or a load balancer, or a few cents a GB to store data in Amazon’s cloud, but nothing at all up-front. Contrast that with a data center deployment that can very easily eclipse the $500K cost in equipment, space leases, and internet transit contracts before the equipment is even deployed to start paying for itself.

The second issue with physical hardware is that stuff breaks. Sometimes stuff breaks because it is old and worn down, and sometimes stuff breaks because it was just built poorly. Regardless of the reason, replacing physical hardware in various countries around the world has a lot of logistical problems. It is hard to import expensive stuff into many countries because of protectionist tariffs that can sometimes cost more than the equipment itself (cough Brazil). There’s also a wildly varying degree of technologist skills across the globe, so unless you have your own staff on-site everywhere, you’re going to have issues with the local data center technicians trying to install the replacement equipment on your behalf.

The traditional Sysadmin and SRE role has been to avoid throwing away expensive stuff.

For these reasons and more, the traditional Sysadmin and SRE (Site Reliability Engineer) role has been to avoid away this expensive stuff. I don’t mean that we try to keep hardware running as long as humanly possible. That would be irresponsible. Hardware has warranty limits and lifespans that tell us when we should start to consider it for replacement to avoid sudden catastrophic failures. But I do mean that if an SSD drive has a 5-year warranty, we want to make sure we buy a model of SSD drive that matches our workload so that we get all 5 of those years out of them.

All that care and feeding comes at a cost. I have been in this game a long time and while there are certainly a lot of people that know more than me, I know a lot. I have abilities ranging from diagnosing RAM issues, to understanding TCP ports, to wiring patch panels, to writing code to keep things running. I am a “full-stack sysadmin”, to coin a phrase. Senior Sysadmins are expensive in general, but when you have a $25K server down, it makes sense for a Sysadmin to spend hours, perhaps days, to nurse it back to life.

The new job

The cloud changes all of that. Because computing, storage, and network devices are all virtual, throwing them out and starting over costs nothing. One popular AWS service is named Elastic Cloud Compute (EC2). EC2 instances are servers and they are a direct replacement for a hardware server humming away in a data center. AWS users can spin up a new EC2 instance in seconds and have a fully functional server working within minutes. The same quick deployment applies to other AWS services such as Elastic Load Balancers (ELB), database servers, Simple Storage Service (S3), and all other AWS offerings.

Because computing resources are now disposable, there’s no longer a need to keep them running to the end of their warranty period. Indeed, even the concept of a warranty period does not apply to the users of the cloud. There’s no need to ship equipment through winding delivery routes and pay fees to get it through customs in the destination country. There’s no need to source suppliers for replacement hardware and there’s never a shortage of resources such as the recent global shortage of SSD drives.

The new job is to “fail fast”

The new job is to “fail fast”; throw away broken servers and just spin up a new one. Unlike Past Sysadmin who has spent a lot of time diagnosing and cajoling a server back to life, Future Sysadmin won’t be doing much of that. Future Sysadmin is going to spend 5 minutes checking if the issue is trivial and if not, she’s going to spin up a new instance, terminate the broken one, and then go for drinks.

Even more likely, Future Sysadmin isn’t even going to know that an instance failed because the AWS Auto Scale function will see it is down, spin up a new one and terminate the old one without any human intervention at all.

Either way, we’re going for drinks.

The mindset

This winding path has finally brought me to the point of this post: our jobs are not about the daily tasks we do. The solider’s job is to carry out the political will of the country. The firefighter’s job is to prevent death and property damage. The police officer’s job is to keep the peace.

Sometimes, in order to do the job, the solider has to perform the task of shooting the weapon. Sometimes the firefighter needs to put out the fire and sometimes the police officer needs to arrest someone. But those are tasks; they are not the job. The underlying tasks can change and still achieve the objective of the job.

This needs to be the mindset of the modern Sysadmin or SRE. The job is to keep the systems running properly. The tasks comprising this job used to be things like pouring through DMESG output looking for faults or sourcing hardware that meets a certain spec, or flying to random countries to install and maintain gear. Now the tasks are things like writing cloud deployment templates, developing auto-scale plans, and working with developers to make sure the proper code or configuration is applied when a new instance is brought to life.

The job is still to keep the systems running, but the underlying tasks are changing rapidly.

my shorter content on the fediverse: https://the.mayhem.academy/@jdw