On Building Software Bots

December 7, 2024

image courtesy of pixabay.com

I work for one of those very rare companies that lets me try to do things I’ve never done before. I’m many years into this Linux sysadmin thing (La Cosa Linuxstra) and you don’t get very far into this gig without having some scripting chops. Sysadmin code is generally only as elegant as “whatever stops this emergency right now” and I’m the champ of hackey code that keeps stuff running. What I have less experience with is designing complex, easily maintainable code.

Despite that shortcoming, I’ve been entrusted to write some software automation. I’ve spent the last several months creating a bot that takes loosely formatted customer input and tries to produce useful responses. The challenges have been great all throughout the stack. At the bottom, I’ve had to become more familiar with developer stuff like pull requests (wha?) Somewhere in the middle I’ve had to shelve some really powerful concepts because the underlying infra doesn’t exist. At the top, I’ve spent some tense moments trying not to cringe with embarrassment as my fledgling bot gives an absolute asinine answer to a fairly straightforward request.

Bots are like kids. They’re never really grown up. But you can still learn a lot along the way. Here’s some of the things I’ve learned on this ride.

Automation should not be the goal

You’d be hard pressed to find an organization operating at scale that isn’t employing some type of automation somewhere. Whether it’s the mouth of the dreaded sales funnel , the deployment of services, or customer service, there’s a robot or two in there somewhere plugging away. It’s easy to fall into the mindset that you’re working towards automation because that’s just what we do , right?

Automation should never be the goal. Automation is a tool that solves some problems; it’s not a panacea that solves all problems. A robot won’t solve your insensitive advertising, it won’t impress customers with canned responses that don’t solve their issues, and it won’t make you any money if it’s a poorly conceived Frankensteinian monster staggering around your infra wrecking other things.

If you’ve already developed a good system and you’re pushing the limits of it, automation may be your solution. Automation can solve problems in three primary areas:

Consistency

Problems and solutions are a many-to-many relationship (sorry, old school DBAs). Some solutions solve many problems and some problems have many solutions. Different people and different situations probably carve different paths through the problem. If that’s presenting a problem, then you may be able to employ automation to help.

Automation doesn’t say “Problem A can only be solved with Solution B”. Automation says “We’ve made a decision to solve Problem A with Solution B even though we know there are other solutions”.

Thresholds

One of the particular problems our automation is attempting to address is the ability to do more in less time. If your current system is simply overwhelmed, then automation may be able to help.

Analysis of our current customer input flow showed us that a certain percentage of customer requests are repetitive and computationally trivial to answer. Relieving our fellow human support staff of answering those requests not only makes good business sense but, quite frankly, it’s our duty to the human race. If bots can’t uplift us from drudgery, then who can?

Accuracy

There is almost no end to the amount of information a person may need to comprehend and process during the course of their day. Support staff in particular are the unsung heroes of our current age. Each call or ticket requires a different mindset, a different set of information, and a different set of solutions. It can be difficult to remember every specification about every product at every moment.

Bots suffer no such frailties. Bots can query databases for product specifications, correlate that with the model information provided by the customer and clearly and definitely state “No, sir. That product will definitely break if you do THAT.”

Customers are not to blame for automation failures

One of the challenges we faced with customer service automation is the quality of input. Much of the data we get from customers is free-form text which is notoriously difficult to deal with. In our initial analysis to create the first case models we took all the existing customer data we had and derived a set of cases that our bot should look for. This method of inductive reasoning is a supremely logical way to start out. It wasn’t until our bot was old enough to digest actual customer data that we realized a little deduction would help a lot.

The inductive method scrapes a dead frog from the road, examines the degree of flatness, and says “Based on the evidence before me, I think a truck ran over this frog”. The deductive method says “I envision a frog run over by a truck would look like this. And look slightly different when dropped from a plane. And different again after being trampled by an elk. Therefore, I think this frog was dropped from a plane”. The best solutions come from a combination of both types of reasoning.

No two people will describe the same problem in the same way

Two customers of similar technical expertise will use different terms to describe their website being down. One may say “the site is down” whereas another may say “domain.com has not been available for 20 minutes”. Another may say “Hyperspin has been alerting since 1pm today”. There’s literally no end to the variety of ways a person may describe a problem.

Language is heavily dependent on technical expertise

A technical person may provide the test results of a traceroute, or DNS lookups to describe the issue they’re seeing. Whereas, a non-technical person may simply say “Firefox won’t show my site”. Both descriptions are correct, but they look nothing alike.

Some customers don’t care

There’s a certain population of customers who just simply don’t give a crap. They’ll literally give you nothing at all to work with other that “doesn’t work”. It’s easy to pass judgement that this customer is THE problem, but you don’t know who they are. Maybe they’re the accountant who never wanted this “website gig” anyhow at their firm but it was foisted on them. Or maybe they’re very technically adept but they recommended a competing service, was overruled, and think you suck.

It’s extremely tempting to get into the mindset that the automation isn’t working because the customers aren’t describing their problems correctly.

“If only customers would give us the right input the automation would do so much better.”

If you find yourself saying that, you’ve fallen into the trap of making automation the goal. Robots don’t read like humans. If customers are not giving you information the robot understands, that’s not their fault. You need to find ways to elicit that information.

Automation touches everything

It’s usually not possible to build a discrete piece of automation that runs all by its lonesome in some dark corner of your network. Bots are pass-through processes. They take information and they produce information. The things on the other ends that give and take information will probably need to be modified to play nice with the bot.

In our case, we know that the front-end UI which collects customer data can really help our bot. While we can’t completely control humans (yet), we can try to provide an interface that elicits the information our bot needs to do better.

Improving the UI can have problems in and of itself. Creating a wizard that traps users in form hell is going to go badly for everyone. On the other hand, having a single free-form input field is guaranteed to produce unusable input. Somewhere in the middle is a submission process that gives your robot just enough information that it can fill in the blanks.

Your humans have to like the bot

Building this bot is really interesting and fulfilling work for me. I think about it all the time — how can it detect more problems; how can its responses be less robotic; how can it respond faster? These things are always on my mind. But those things are not on everyone’s minds.

Other people in the organization may be thinking about what they’ll do when this bot replaces them, or changes their job into something they didn’t sign up for.

Those concerns are completely legit and can’t be ignored.

It’s important to get folks onboard when a part of their job is becoming automated. I don’t mean “onboard” in the slimy manager way, I mean onboard in the “the bot is gonna take away all that crap you hate doing” way.

I’m not an HR guy, I’m a Linux hacker. I’m not giving this advice because I’m worried about the HR issues in your company, I’m giving this advice because engaged people give really good feedback. Unless you’re going to watch your bot every minute of every day, you need everyone else who sees the bot to feel comfortable telling you what bot-boy did well and when he cratered. The loudest, most annoying person in your company is probably giving the best feedback.

Sharing is caring. Recognize the importance of the humans working alongside the bot and actively solicit information from them about how the bot is doing and how it can be made better.

The “High Level” view is not good enough

I’ve lost count of how many times I’ve written a test knowing in my soul that it would completely nail a customer request, and then I cried myself to sleep when it blew up.

Prior to embarking on this project I had nearly a year of firehose-level exposure to customer input. I considered myself pretty knowledgeable about it. To build on this, some use cases had already been identified and the whole package was pretty complete. Customers ask this, we say this…should work.

Honestly, the first runs did pretty much work. It wasn’t awesmazing, but it also wasn’t a total disaster. I’m pretty proud of how good that first version was, but I also can’t ignore what it taught me. The code I wrote was both inspiring and embarrassing…sometimes at the same time.

The lesson is that building automation needs to be done in the muck. The basic framework can be laid out in a theoretical way (and it should be) but the real lessons come from the trenches. Different organizations have different environment segregation requirements. If it is possible in your organization, de-couple the bot input from the bot output. Take prod input and give QA output. We utilized Slack for this; meaning the bot ate production customer input but spit responses to a Slack channel where we could review and debug while smart humans continued to talk to the customers. If there is any one thing I can point to that allowed us to really fast track this project, it was this decoupling.

Cast your net…narrow?

Going back to the deductive/inductive concepts, there’s an argument for casting a wide net in testing. Grab as much data as you can in order to create opportunities to spot problems. This makes sense and if you have the opportunity to expose your automation to real life without any risk, you should do that. The de-coupled environment we use allows us to do that.

If you’re stuck in a place where you only get fake obfuscated QA data to work with, then my advice is to cast a narrow net when promoting to production. If you make people wait until you have a fully working car before they see anything in prod, then they’ll never get to appreciate the skateboard and bike rides enroute. And your car will also have extremely embarrassing public fireball moments.

It’s best to let your bot deal with a small number of trusted situations in prod, and then iteratively introduce new responses.

Plan to throw the first one away

In order to create good software, Eric S Raymond includes this guideline:

Plan to throw one [version] away; you will, anyhow.

It’s paraphrased from the The Mythical Man Month, but whatever. The point is that you don’t know what you don’t know until you build the thing you think you know about. It’s basically a Buddhist ideal: don’t get attached to your stinky code.

In this project I haven’t needed, nor wanted, to throw away the entire thing but I am fairly sure I’ve deleted more code than I’ve written at this point. From a volume standpoint, that’s essentially the same thing.

Reticence to throw away code should not be a matter of pride. There are many reasons why code gets trounced:

it really was a bad idea, but who knew?
the thing it depended on doesn’t exist or work as expected
it works, but too poorly to keep
management sucks

I am a huge believer in being willing to throw the first one away. I think it pays dividends in some very valuable ways. First, it forces me to write the best modular code that I can. Even if my idea is flawed I may think that this particular bit is useful elsewhere, perhaps in places I don’t even know about yet. Second, it keeps me humble. When my bot screws up and someone points and laughs, I’m more willing to kill off that little bastard (the code block, not the human…?)

I think my biggest takeaway from this whole experience is that building automation requires a very global view of things. The code has to be good, the scenarios the bot is expected to react to have to be realistic, the humans who have to coexist with the bot have to be OK with it, and the customers have to be able to withstand a little weirdness.

Good luck.

my shorter content on the fediverse: https://the.mayhem.academy/@jdw