How I replaced 50 lines of code with a single LLM call

How I replaced 50 lines of code with a single LLM call
Midjourney prompt: blocks of code matching the color pallet of (haihai logo)

This is a guest post from my friend Ben Stein. Ben and I worked together for many years at Twilio, where his roles included leading all of Messaging, overseeing all Twilio SDKs, and creating Studio, Twilio's no-code application builder. Today Ben is the co-founder of QuitCarbon, where he helps homeowners transition off of fossil fuels.

I recently needed to write some code to compare two mailing addresses. Seems easy enough. For instance, it's pretty easy to figure out that these two represent different locations:

"123 Main St, Brooklyn, NY 11217"
"29 St Marks Place, Brooklyn, NY 11217"

But what about these two?

"123 Main St, Brooklyn, NY 11217"
"123 Main Street, Brooklyn, NY 11217"

Or these:

"123 Main St, Brooklyn, NY 11217"
"123 Main St. Brooklyn, NY 11217"
"123 MAIN ST. BROOKLYN, NY 11217"
"123 Main St, Brooklyn, NY 11217-3437"

The edge cases were endless. I spent an entire afternoon writing string-matching heuristics, regular expressions, and even implementing Levenshtein distance to answer the simple question, "is this address the same as that one?"

But then, on a lark, I replaced all that code – 50+ lines in all – with a single call to GPT. And within ten minutes and just a few lines of code, I hit 100% accuracy against my test suite!

That experience raised some super interesting questions:

  • Where can you use LLMs to solve day-to-day programming problems faster, with fewer lines of code?
  • Which problems are best suited for LLM replacement?
  • What are the cost and performance implications of LLM replacement?
  • Can we keep a high bar of operational excellence in production?

This post will dive into these topics and lots more.


Tinkering with LLMs

Like many software developers, I’ve spent 2023 tinkering with LLMs, using every new AI coding tool that comes out, and racing to add generative AI experiences into our products.

But only recently have I started using LLMs to solve day-to-day programming problems, replacing more traditional algorithms, data structures, and heuristics with API calls to language models.

When talking to folks about how they are using LLMs – which I do a lot! – the conversation generally fits into 4 categories:

  1. Standalone LLM interfaces via web/mobile interfaces (e.g. ChatGPT, Midjourney)
  2. AI features integrated into tools (e.g. GItHub Copilot, Adobe’s Generative Fill, Notion’s “Start writing with AI…”)
  3. AI specific products (Twilio’s CustomerAI, Intercom’s Fin support bot)
  4. Analysis and back office tools (e.g. analyze our company’s PDF with Langchain)

In each case, AI functionality is directly exposed to the user and it’s part of the product or feature. But I’ve had surprisingly few conversations about using AI in code. As software developers, LLMs give us a fascinating new tool in our tool belts to solve day-to-day programming problems (even if our company hasn’t yet changed its name from .com to .ai 🧌).

Having started using LLMs to solve traditional programming problems recently, I was surprised at how different a way of thinking it is, and how many of our legacy tools and best practices aren’t working well in this new paradigm.

As such, I’ll be documenting my experiences and learnings as we go. This is the first post in a series that will discuss how we’ve successfully (🤞) used prompts instead of writing business logic in code, how to identify good candidates for such an approach, how to operationalize and maintain such code, limitations and risks, and more.

A Real World Example: Matching Addresses

Rather than rambling on abstractly, let’s jump in with the real world coding example that inspired this post.

Quick context: My company, QuitCarbon, helps homeowners transition off fossil fuels appliances. As part of this process, we analyze how much natural gas (methane) a family uses, which we can determine from their utility bill. But when importing utility bills, we want to make sure that the service address on their utility bill matches the address of their property. We wouldn’t want to import data from one property into another by accident; multi-tenancy bugs are bad news bears.

Let’s consider a home at “123 Main St, Brooklyn, NY 11217”. We want to be sure that the service address on the bill matches. So in our code, we want to call a function like this:

if property.address.matches?( service_address )
  # continue importing data

That should be a trivial string comparison:

def matches?(service_address)
  self.to_s == service_address
end

Hahaha this failed on literally the very first utility bill I checked! It had the address written as “123 MAIN ST, BROOKLYN, NY 11217”. No problem:

def matches?(service_address)
  self.to_s.downcase == service_address.downcase
end

That test passed! Next up, “123 Main Street, Brooklyn, NY 11217”. Notice the “St” vs “Street.” OK so we just need to replace "street" with "st" and repeat for all common abbreviations like Ct, Rd, St, and Ave. Maybe something like:

def matches?(service_address)
  self.to_s.downcase.gsub(‘street’,’st’) == service_address.downcase.gsub(‘street’,’st’)
end

Probably also need to do it for Ter, Cir, Way, Pl, Blvd. Are there others? And annoyingly, what do we do with “123 St Marks Pl”? More heuristics! Zip vs Zip+4? More heuristics!! Brooklyn vs New York? MORE HEURISTICS!!!

OK this is getting tedious and I'll never get it all. My next attempt was to switch to fuzzier string matching via Levenshtein distance. We can compare how closely the address strings match. And if they’re lexicographically close, we can assume they match:

> Levenshtein('123 Main St, Brooklyn, NY 11217', '29 St Marks Place, Brooklyn, NY 11217')
=> 13 # way far off

> Levenshtein('123 Main St, Brooklyn, NY 11217', '123 Main Street, Brooklyn, NY 11217')
=> 4 # quite close

Great! Small differences, say, less than 10 (hella arbitrary), mean that there’s a very good chance they match.

> Levenshtein('123 Main St, Brooklyn, NY 11217', '124 Main St, Brooklyn, NY 11217')
=> 1 # ruh roh

Damn! These are so clearly different properties even though they only differ by a single character! (yes, this was a real world example – we had a homeowner with multiple properties next to one another).

Alright, so it’s now been a couple mind-numbing hours of adding more test cases and more and more heuristics for each edge case. I can keep adding if-statements and keep increasing our test suite, and yeah, I could probably get a pretty darn reliable function written after another hour or two. But, there’s gotta be…

A Better Way!

It’s 2023 after all. So on a lark, I deleted all my code and rewrote it like this:

def matches?(service_address)
  prompt = "I will give you two addresses, potentially in different formats."
  
  prompt << "You will reply 'Yes' if there is a good chance they represent the same property or 'No' if they likely do not."
  
  prompt << "You will not provide any explanations."
  
  prompt << "Here are the two properties I want you to compare: #{self.to_s} and #{service_address}"
  
  OpenAI.chat(prompt).response =~ /Yes/
end

Whoa! I simply told the LLM the logic I wanted. It took fewer than 5 minutes to write and got 90%+ accuracy against our test suite on the first try!

Can we futz around^H^H^H prompt engineer a little bit and improve this? Here was my 2nd attempt:

# Formatting and bullets provided for readability for you, mere mortal

I will give you two addresses, potentially in different formats.

You will reply 'Yes' if there is a good chance they represent the same property or 'No' if they likely do not. You will not provide any explanations.

* For example, given "123 Main St" and "123 Main Street" you would reply "Yes" because one is an abbreviation.
* Given "123 Main St, Brooklyn, NY 11217" and "123 Main St" you would reply "Yes" because one just has more specificity but they are likely the same.
* Given "123 MAIN ST" and "123 Main St" you would reply "Yes" because it's just a case difference..
* Given "123 Main St" and "124 Main St" you would reply "No" because they are different street numbers.
* Given "123 Main St, Brooklyn, NY 11217" and "123 Main St, Baltimore MD 21208" you would reply "No" because they are different cities.
* Given "123 Main St, Brooklyn, NY 11217" and "123 Main St, New York, NY 11217" you would reply "yes" because they have the same street and zip code and Brooklyn is either the same or next to New York.

Here are the two properties I want you to compare: #{self.to_s} and #{service_address}

And BOOM! 100%(!) accuracy against our test suite with just 2 prompt tries and under 10 minutes of programming time!

Third Time’s the Charm

OK so I’m super happy with the accuracy and almost ready to ship it, but the battle-scarred graybeard in me is a little worried that when we start seeing edge cases in production, we won’t be able to determine why we got a given result, which will make future prompt improvements hard.

So we made one small change to the Reply ‘Yes’ or ‘No’ part to be:

You will respond with JSON and only JSON without extra prose.

One key will be 'result' and the possible values are Yes and No.

The other key will be 'reason' and include a free text explanation of why you chose Yes or No.


And check out the final result for “123 Main Street” vs “123 MAIN ST”

{
  result: Yes,
  reason: The addresses are very similar, with only minor differences in formatting (lowercase vs uppercase). The street name 'Main Street' is abbreviated as 'MAIN ST' in the second address. However, the numbers '123' are the same in both addresses, indicating that they likely refer to the same property.”
}

Wawaweewah! We can now add a simple

log.debug response[‘reason’]

for future debugging and our code can check

response[‘result’] =~ /Yes/

And voila we’re done! A fraction of the engineering time AND more accurate results. You got your chocolate in my peanut butter.

Technical details and considerations

Determining viable problem candidates for LLMs

What made this a good candidate for an LLM? First, there was lots of string manipulation. Whenever I find myself doing lots of string interpolation, substitution, and regexing, that’s a good flag to think LLM. Next, I was layering on edge cases and heuristics and if statements; there wasn’t a clear algorithm or science behind my work. That’s likely another good flag that AI could be a good solution.

Using Prompts for Flow Control

Because we’re using this for flow control (if-then-else block), we want deterministic response formats. GPT is quite good at respecting directives like “Only reply Yes or No” and “You will not provide any explanations.” Those are critical to avoid responses like “123 Main St and 321 Main St are not the same”. Once I was confident in the “Yes” or “No” response, it was easy to just check for that string.

To get more determinism, we also want to set the Temperature on the API request. Temperature is a value between 0.1 and 2.0 and controls the desired level of randomness and creativity. Since we do not want any creativity nor randomness in our responses, we set the temperature as low as possible, which helped.

Model Selection

We ran our test suite against both gpt4 and gpt3.5-turbo and found no differences, so we chose the latter for better speed and lower cost.

Performance

This function is called in an asynchronous background job run infrequently, so the performance implications – potentially up to a second or more – isn’t concerning

Cost

Since it’s called only once per customer, the costs are fairly trivial. Although a more interesting perspective is this: if the LLM approach saved just 1 hour of Engineering time, we could process over 100,000(!) utility bills and it would still be cheaper.

Conclusion

The takeaway here is that the product or feature or customer experience in question has nothing to do with AI. It’s a different way to approach traditional programming problems. Just like we can solve a problem differently by using an array or a hash, by swapping out our sort algorithm, or by moving business logic from application code to SQL, LLMs give us a new tool in our tool belts to approach software problems.

Up Next

Writing this simple function and deploying it to production - while keeping a high bar of operational excellence – raised a TON more questions, many of which will be discussed in future posts:

  • Identifying software problems that are good candidates for AI
  • Crafting prompts for application logic and flow control, not for creativity
  • Operational excellence: architecting for reliability
  • Change management for prompts
  • Monitoring response accuracy and quality
  • Tracking costs, token counts, and optimizations
  • How to do test driven development with LLMs
  • Performance implications of using LLMs in production
  • Multisourcing vendors and LLM redundancy

Thanks to Bharat Guruprakash, Umair Akeel, Greg Baugues, Rob Spectre, and Ricky Robinett for reviewing early drafts of this post, and to Sam Harrison and Matt Fornaciari for their code reviews and letting me actually deploy this in production.