LLMs are the new Toilet Paper

Look up “analysis paralysis” in the dictionary and you’ll find a picture of the toilet paper aisle. Dozens of brands and varieties, each with their own marketing promises. What’s the difference between plush, soft, extra soft, and ultra soft? And is that more important than strong, extra strong, or durable? 2-ply better than single ply, but is 3-ply too much? Consumers are overwhelmed by choice, and it’s impossible to differentiate from the packaging. I hate buying toilet paper.

That anxious feeling of choosing toilet paper is the exact same anxious feeling I got today after reading OpenAI’s blog post on their new embeddings and models. I have a new use-case and want to evaluate if an LLM is the right tool, which model should I start with? Here are the options I can get just from OpenAI:

gpt-4
gpt-4-32k
gpt-4-0125-preview
gpt-4-1106-preview
gpt-4-1106-vision-preview
gpt-3.5-turbo-1106
gpt-3.5-turbo-instruct

Now add in Claude 2.0, Claude 2.1, and Claude Instant from Anthropic and popular open source like LLaMA-2 7B, LLaMA-2 13B, LLaMA-2 70B, LLaMA-1 65B, Minstral 7B, Minstrel 8X7B. And this is all before we even consider the half a million(!) open source models on Hugging Face!

Sure, most people in 2024 are going to start with GPT-4 API, especially to validate their use case. But the FOMO and curiosity and feeling like I’m missing something else hits very quickly and is very real. Am I missing opportunities for increased performance? speed? cost savings? Should I multisource? Fine tune? And do I need to refactor the app I wrote last year on gpt-3.5-turbo? How would I even know??

Back to the toilet paper aisle

Here I am. I’m standing dumbfounded in the grocery store aisle. I’m trying to make a decision about Charmin vs Cottonelle vs Quilted Northern. Unfortunately (arguably thankfully), you can’t test toilet paper in the store. You need to buy it first, take it home, and try it out. But with all the FAMILY PACK MEGA ROLLS, that’s certainly not convenient. What I actually want is a sample pack of 10 different mini rolls, each with a different feature (soft vs extra soft vs ultra soft) so I can assess the differences and what I care about. Also need to test with my partner and my kids and understand their preferences.

Are testing different LLM models/features harder or easier than testing toilet paper? Well, it’s easy enough to swap out the model name string in an API call on my laptop. That’s definitely easier than a trip to Costco. But unfortunately it’s really not that simple. The results depend greatly on (a) use case and (b) prompt quality/customization. For example, it may be possible to get gpt-4 quality out of gpt-3.5 model just by improving the prompt. What I would then need is some evaluation rubric that works for my use case, then run samples of N system prompts * M user inputs tested against each model.

(Some quick notes to pre-empt the HN comments:

I’m intentionally glossing over a formal definition of LLM “quality”. Quality is both use-case specific and the subject of myriads of ever-evolving research. In this post, just assume it’s some unitless measure of “good” or “bad”
I’m focusing this on the common use cases where out-of-the-box APIs or open source models are good fits, which is the vast majority of usage today. If you’re training models from scratch or fine tuning, that’s a different beast

)

Estimating Consumption-Based Usage Costs for 2-Ply

Comically, my weird analogy between LLMs and toilet paper extend to pricing as well, so let’s keep running with it. LLM queries are typically priced like $0.0010 / 1,000 tokens and toilet paper is priced $0.51 / 100 sheets. And just like I can’t intuitively reason about how many tokens my customers will use this month without an online calculator / scenario generator, I have no clue how long 12 rolls * 300 sheets will last a family of four and if that’s a lot or a little.

Estimating costs aside, price adds an additional layer of complexity into the buying/selection criteria. I might love chemical-free, quilted, organic 2-ply, but is that worth the extra $0.045 / 100 sheets? And while I’d prefer my LLM to have an average MTEB score of 64.6% and an MMLU of 50+ is that worth an extra $0.0001 / 1K tokens? In both cases, I don’t know!

Can we Learn from History?

When faced with new scenarios like this, I always try to learn from history and look for analogues. Are there similar scenarios that I’ve faced in the past where I had this much choice or technology? And I’m struggling to find any!

Sure, we test & debate Postgres vs MySQL vs NoSQL databases. But realistically there’s just a handful of choices. I guess Javascript frameworks have a ton of choices, but that decision usually comes down to the team’s experience and a preference for new hotness vs tried-and-true. Payment API vendors? Email providers? Linux distros? While there’s always some choice in the market, none of these past scenarios (a) have such a wide variety of options and price points and (b) change by the week(!)

So Now What?

At this point (Feb ‘24), I don’t have a strong opinion about how LLM selection criteria will evolve yet, either for me and my team, or developer best practices in general.

I’ll reiterate that, for most use cases in 2024, developers will start by testing against the latest-and-greatest from OpenAI first to validate feasibility. Then they’ll look to other models to balance latency, alignment, and cost. But I’d wager most of this testing will be fairly ad hoc and lack rigor e.g. “claude 2.1 feels like it hallucinates less” or “gpt-3.5 was good enough in our testing”. I also think there will be more multisourcing in this space than we see in many other domains, but that introduces lots more complexity and likely a subject for another post.

So tell us! How are YOU making your decisions about which models to choose?

Oh, and just to close out the toilet paper thing… turns out most shoppers just buy what our parents bought. If you grew up 2-ply Charmin kid, you’re probably still buying 2-ply Charmin. Same for things like toothpaste (Crest vs Colgate) and dish soap (Dawn vs Ajax).

Sadly, my Mom doesn’t have brand loyalty to any transformer architecture 😂