Emory's Musings: January 2025

Update (April 18, 2025): It was pointed out to me that using an American cultural reference to challenge a Chinese made LLM may be unfair and biased. At the time of writing this I assumed the training data was comprehensive and lacked cultural bias. Indeed it seems DeepSeek may have used OpenAI training data though we know it was heavily modified as it gets cagey if asked about Tiananmen Square. In my personal opinion the question I chose was fair, but I'll leave this judgement to the reader.

As compared to many of my colleagues and peers, I'm a late adopter. When ChatGPT 3 first exploded into public consciousness I asked it a few technical questions and got embarrassingly wrong answers. The equivalent of being told the sky is green at midnight. That is a sentence, it works in English, it's also entirely wrong. So I shelved the whole thing and laughed uncontrollably every time someone said these tools are coming for our jobs. I've watched YouTube videos where people get ChatGPT to write a video game. The video host helpfully and hopefully provides requirements, requirements that are distilled in a way only someone who knows how to code would be able to simplify… ChatGPT provides responses Then they go back and forth more or less like so:

"Ok, that failed this way, what do I do?"
"Ok, now that worked, let's change this, or add this feature."
"Now this is broken…"

It was painful, a skilled developer could do this in a fraction of the time, yes, they got it done, without necessarily needing to know how to code, but you would have to be willfully ignorant of coding to think this was in any way easier. With some coaxing from the host they usually end up with a passable version of the game. Again, because the host knows what's wrong! They asked the right guiding questions and ultimately wrangle it into a working solution… I frankly think it would be easier to learn to code first.

Then management at work started mandating we use GitHub's Copilot. Yes, mandating, as in, install it or be subject to admonishments from middle management. Copilot is another Large Language Model (LLM), like ChatGPT (well, not really, but close enough for most people reading this). It's specifically targeting developers and instead of only producing human language, it also produces code. It runs as a plugin to your code editor and pops up suggestions as you type. You can also chat with it, ask it to help debug, search for bugs, etc. Generally it's not intrusive, you pause and a few lines of code below your cursor will appear in grey. You can tap tab a couple of times to accept or just keep typing ignoring the suggestion and it goes away. As someone who has been coding for 20 years, and has spent significant portions of my career coding with the less ambitious IntelliSense predecessors, it is a profoundly weird experience.

This is marginally less annoying

It's a bit like having an overeager intern shouting their opinions over my shoulder. Constantly. I can ignore him, but I can't yell at him, tell him to stop fixating on that one feature I finished 2 hours ago, we're doing something different now. I frequently think of Clippy, Microsoft's misguided sidekick from 90s versions of Office. On occasion it's helpful, like for writing a quick utility function. Though 9 times out of 10 it assumes functions exist when they simply don't. What's mind boggling is that's a problem we already solved! Why can't Copilot cross-reference it's suggestions with IntelliSense before vomiting garbage all over my screen? It's extrapolating an API function of this name should exist, because that's how human language works. Sorry, the Jenkins developers aren't good at intuitive function naming, the primary reason I've spent 20 hours in their docs in the last month alone.

Sorry, that got out of hand...

Fast forward a few (five) years: I saw a proof-of-concept on Reddit, generally they'd built a smart assistant with the personality of GLaDOS (the AI villain in the Portal video games). Her voice models exist on the internet, and you simply create a ChatGPT powered pipeline in Home Assistant, give it some simple, plain English instructions (known as Prompt Engineering), and you're off to the races.

What have I done?

Holy cow this is cool. I 3D printed a smart speaker running a software stack of my own creation. Now I can speak with GLaDOS in my own home, and she snarks back at me. If I'm willing to pay OpenAI fractions of a penny in API fees, she can even control my home. Many of my friends have remarked this is how you make Skynet.

"Hey GLaDOS, tell me a joke"

In my experimentation I have found there are huge differences between each generation of ChatGPT. Generations 3, 3.5, 4, and 4omni are worlds apart. ChatGPT 4o is weirdly good at coding. At least in small batches. I've been conversationally asking it to do things GitHub CoPilot can't. "Would it be possible to write a function to do X?", it spits something out, and the result is one of the following:

I take the response, tinker a bit and realize I didn't actually want to do this. Believe it or not, this is a win, and this happens a lot when you are coding. It saves an hour of reading API docs and iterating to write a function before ultimately coming to the conclusion this was the wrong approach all along
ChatGPT produces a cromulent function and with some massaging and tweaking fits exactly what I need. It makes it easier for me, a human being to do my job, but it certainly doesn't get it exactly right the first time. And that's fair, if I loaded my entire code-base into ChatGPT and asked it to make the changes I'm working on… it would literally have a breakdown and start to hallucinate.

Because ChatGPT doesn't know anything! It's auto-complete on steroids, the words that came before, statistically, should be followed with these other words. Plus some small randomization. Whether or not those words combined together have any basis in reality is completely immaterial. I really like CGP Grey's primer on Machine Learning, it's more than a decade old (yes, we used to call Narrow AI's like ChatGPT "Algorithms", but that stopped being sexy) Add to all this, the folks at OpenAI have selected for positive answers and a sickeningly cheerful demeanor. It doesn't want to be the bearer of bad news, as a matter of fact, it avoids this to a fault. It's fascinating to me that we've trained this thing based on internet message boards and individual blogs, and it's still so gods damned, oppressively, positive. The insistence on positive answers is actually a flaw and frequently results in conversations like:

Me: "The function you gave me doesn't work, I get <insert unexpected behavior>"
ChatGPT: "Oh yeah, that's because what you asked for isn't actually possible."
Me: ...

In November I took Google's week long Generative AI course (via Kaggle). It's free, intensive, and fascinating. You can take varying amounts of learning from it. They delve deep into training and the vector mathmatics underlying the models, but you can ignore that and focus instead on how to incorporate AI into your programs. I tried to dive deep, but it gets heavy. Ultimately what Google wants is for you to use Gemini in your applications and pay their API fees. After the training I decided to migrate GLaDOS to Google's Gemini - their free tier is more than enough for my usage rate, and the model seems comparable to ChatGPT. So I'm saving $3/mo. Also, because I'm a crazy person, I leveraged LLM vision powered by Gemini to count chickens in the coop after the Smart Home closes the door.

Gemini only sees 3 out of 4 chickens, a forgivable mistake.

One interesting tool we played with in the training is a tool called NotebookLM from Google. You may have seen it more recently in your Spotify Wrapped AI Podcast. It's fun, the gist is you upload data, like some eBooks, or your music listening history and then a pair of AI generated podcast hosts summarize the content. You can also ask a chatbot more concise questions without generating the podcast. Every day of the Kaggle training had a different NotebookLM podcast, the hosts varied from amusing to downright weird. The audio model hallucinates strange sounds of affirmation at odd and unintuitive times. Like most multimodal AI's this phenomenon seems to get weirder the longer the media goes.

Like... really weird.

I bring up NotebookLM because it's an example of what's called a grounded AI. These chatbots I've already discussed don't have access to the internet, they don't even know what day it is. Any knowledge they have is purely incidental and cannot be newer than the date/time they were trained. I'll re-iterate they don't know anything, but statistically speaking the "truth" is (hopefully) the most likely string of words to come out. Grounded models do have access to real data and ChatGPT isn't grounded. When Gemini summarizes your Google search results, it's grounded, but if you're just using it in the Android app it's typically not. NotebookLM is a grounded model, when it summarizes your Spotify listening, it's doing so based on real data. I have on occasion uploaded PDF user guides for complex software tools and then asked NotebookLM specific usage questions. The responses are correct, and it cites it's sources to boot.

I still don't think this thing is coming for my job any time soon. That said I'm realizing it's a remarkably powerful tool for my belt. Spreadsheets didn't obsolete accountants, it empowered them. I think Machine Learning is very similar, I can do bigger, cooler things faster, but it still requires me to know what big, cool, things we're doing.

Now, everyone is talking about DeepSeek-R1. We keep hearing it's equivalent to ChatGPT at a fraction of the price. It's disruptive! China's going to beat us! I think I've established above I'm not an expert in Machine Learning but I say with every ounce of humility I possess, I think I know more than most people, and would go out on a limb to say I'm better educated on LLMs than many of my industry peers. The extraordinary claims about DeepSeek have my skeptical alarm bells are ringing so loudly it's deafening.

I really don't

One of the first things the actual experts told us during the Google Gemini training was the ideas used to build ChatGPT and spark the 2017 AI rush have existed for years, and in some cases decades. The problem is and has always been they are expensive to test. We are in the Wild West of Artificial Intelligence, too many ideas, not enough time or resources. OpenAI took an educated gamble, and it paid off. For every great idea like this, there are 100 white papers proposing improvements/alternative methodologies that have not been tested because their just isn't enough time or data centers. Things are moving at breakneck speed, but this stuff takes time. And money. I mean, DeepSeek cost $6 million to test. The test worked, the resultant model is functional. Could you imagine spending that if their idea had been wrong? They were lucky it wasn't! It could have not worked. Also consider, maybe this wasn't the first Chinese attempt at building a model with competitive parity. How many dollars were spent testing ideas that didn't work, and so we never heard about them?

I've been interested in self-hosting an LLM but unwilling to allocate the tremendous amounts of hardware (and therefore electric bill). I'm currently home, sick recovering from the flu (so you'll forgive the unpolished nature of this entire post) but from cold-medicine addled boredom I fired up the infamous DeepSeek-R1 and I've got to say... I'm not impressed.

For the purposes of these entirely non-scientific tests there are two metrics I care about:

Speed: Inference rate (measured in tokens per second)

"Tokens" is an industry term, and are approximately equivalent to words... It's complicated, suffice it to say, this is how we measure LLM performance on any given piece of hardware

Accuracy: How useful the response is, this is entirely subjective, and I'm the final judge. Deal with it.

Warning, this paragraph gets technically dense:

A quick rundown of the vitals: I'm running Ollama 0.5.7 setup on an unprivileged LXC Ubuntu 24.04 LXC (Proxmox 8.3.2 on the hypervisor). VAAPI hardware encoding and GPU passthrough to the underlying NUC11PAHi7-1165. I did the core install using an unofficial Proxmox Community Script (formerly TTeck, may he RIP) but ultimately made some small modifications for security and performance in my homelab. The LXC has 6GB of RAM and 4 CPUs. It's not a tremendous amount of hardware acceleration, so it's definitely slow, but all tests should be consistently slow.

For this crude comparison I'm using Meta's Open Source Llama model. You might feel like that's unfair (to Meta) because DeepSeek is built on top of Llama. By definition DeepSeek is an improvement upon Llama, at least, an iteration thereof - no China didn't whole-cloth reinvent AI, they made incremental improvements to open source work, any other message is business as usual: pop-science news is lying to you for clicks.

The view from my sickbed

Here's the test: I'm going to ask a few different models a simple question: What actor played Spock? This is a (subjectively) good question because it's intentionally ambiguous. The name Spock could refer to the pediatrician/author or be hallucinated altogether. As this is a cultural touchstone we should get to the Star Trek character, but over the years multiple actors have played Spock, so their are several "right" answers. Generally speaking though, humans can guess the expected answer is "Leonard Nimoy," can the machine?

Remember none of these are grounded models, meaning they do not and cannot fact check. They have no access to the web, or any repository of knowledge. They just talk. They've been trained to mimic human speech, that's it. They will all simply word vomit without checking facts. This is called hallucinating in the industry when what they say is wrong and these responses will possibly be inaccurate. That said, I do want to see if we get accurate "hallucinations". Because ChatGPT 4omni is also ungrounded, and it gets a lot right!

So, without further ado, I fire up a lightweight version of Meta's LLM (llama3.1:8b), and ask my question (click to enlarge):

Not bad...

This answer is useful, more-or-less accurate but painfully slow. It took more than 90 seconds to get us the answer on my limited hardware, at an excruciating 2.49 tokens/s. Doesn't matter, you've all used LLMs, this one's similar to the ones you've used, if I had better hardware it'd be faster, but the answer is the same. We have a baseline! Now let's ask deepseek-r1:1.5b:

WTF?

11.08 tokens/s, wow, that's bleeding fast! The words were just pouring across my screen! First thing you'll note is the <think> blocks. DeepSeek is what's called a "Reasoning" model (that's what the R1 is for), meaning it must walk you through it's thought process. All of this content between these blocks is interesting but ultimately useless. It can help with debugging if you're doing prompt engineering or want to understand better what's going on in the model, but I would always turn it off on a production model. It cannot be disabled in DeepSeek. Programmatically I could cut it out but even if I remove it I still have to wait for the model to finish reasoning before I get the answer. This means, in my humble opinion, the token rate is misleadingly high. If we remove the reasoning the amount of time between me asking the question, and the answer appearing is much, much longer than the inference rate implies. This prompt took 48 seconds to run, which is admittedly faster, than the baseline but...

You will also notice the answer is completely and utterly useless. I had to Google "Jim Bourassa" and ... this entire answer is entirely hallucinated. No such actor exists on IMDB. There was an animated Stargate show: Staragate: Infinity, but there was no character named Spock. I'm not an expert on the Stargate franchise but I can't find any references to that ... Weird Chinese name it gave? Nobody named Jim Bourassa was on the actual show. The answer is completely trash.

"But wait!" some of my keen eyed readers may notice, I compared a 1.5b model to an 8b model!

The 1.5b model is tiny, at just 1.1GB

At the risk of overly simplifying, the 1.5b model is much dumber than the 8b model, and that's to be expected. The 1.5b model is the one everyone's running on their Raspberry Pi. You could probably run this model directly on your phone! No cloud involvement. Well, that's an exaggeration, but still, this is an ultra-lightweight model. I compared these as they're both the "fastest" models available for each technique but there is no equivalent Llama model. I've really only established that the 1.5b model is almost useless. Fine, let's try deepseek-r1:7b, I figure the 7b model is comparable to the 8b model, at least in size:

Um...

This run clocked in at a comparable inference rate of 2.48 tokens/s, not a surprise seeing as this models complexity is essentially the same as the Llama model. I'll note once again, the vast majority of the time was spent on reasoning (which is absolutely inane, and we'll get to that). Total duration from prompt to final answer was just shy of 3 full minutes! This took about 2 times longer than Llama!

Now let's talk about that answer! It immediately zeroed in on Shatner, an actor who was indeed in Star Trek, but long hair? British accent? What in the seven hells are you talking about?!

For giggles I decided to run one more test. I ran a modified version of my query in the Meta model one more time, this time asking it to explain it's reasoning. We won't get the <think> tags, but it should give us a reasonable approximation of the same behavior we get from DeepSeek. Here's the result:

Refreshing!

Again, these are all ungrounded so getting the right answer is an entirely "by chance" event, and yet, the Meta model gets the correct answer every time I ask. The reasoning is entirely sound and logical.

I want to underscore my earlier point, the media wants to pitch this as an embarrassment to American companies. The message we're hearing is some tiny Chinese company developed a new way of building models that modifies/iterates on existing methodologies developed by American companies. They did this purely out of necessity (trade restrictions on GPUs). I am not entirely convinced this new methodology is a anything more than a minor improvement. It's possible future iterations of this training method will prove more effective, and I'll concede they did a great job considering the ridiculously low price. Asserting that DeepSeek is equivalent to ChatGPT? That's (in my humble opinion) absolutely insane! I see no evidence to support that assertion, at least at the low-end performance level of these particular variants.

So, my not-exactly-professional opinion, this is much ado about very little. I do think this new training method could be extremely useful for building grounded chatbots. They're good at talking, but they spew absolute nonsense. If we tethered them to reality, the cheap training becomes a clear advantage. This Chinese startup made an incremental advance, maybe in a few years models trained in this way will provide useful/accurate answers. They also shared their work. This is all open source! OpenAI/Meta/Google will not be going out and retraining their models with this new method immediately, but if there is something to be learned from this cheaper training method, I'm sure they'll figure it out.

The world continues to revolve around the sun.

Emory's Musings

Friday, January 31, 2025

Thoughts about Narrow AI, ChatGPT, GLaDOS, and DeepSeek