By June, “for reasons that are not clear,” ChatGPT stopped showing its step-by-step reasoning.
Potentially hot take: LLMs are reaching a dead end before they could even become remotely useful. The very approach boils down to brute force - you force-feed it more data until the problem goes away… and this works until it doesn’t, and in this case it’s actually breaking stuff.
Based on the output of those models, it’s blatantly obvious that they don’t use the data well at all; the whole thing is a glorified e-parrot, instead of machine learning. And yet, as the text shows, it’s almost impossible to say why - because the whole thing is a blackbox.
Based on the output of those models, it’s blatantly obvious that they don’t use the data well at all; the whole thing is a glorified e-parrot instead of machine learning
I’m curious to understand what you meant by this—specifically about not using the data well, and being ‘a glorified e-parrot instead of machine learning’. Would you not count the techniques being used in LLMs as machine learning?
A parrot is rather good at repeating human words. Some can even sing whole songs. But even when you compare exceptional parrots with young and typical human kids, it’s clear that parrots have a really hard time associating words with concepts; or, in other words, learning instead of just memorising.
And LLMs behave like specially dumb electronic parrots - they’re good repeating human utterances, even grabbing chunks of older utterances to combine into new ones, but they show signs that they do not associate words with concepts.
Here’s an example. If we asked a cooperative human “what’s the difference in behaviour between an orange and a potato?”, what would the person say? Here are some options:
- “…what???”
- “what the hell do you mean by ‘behaviour’?”
- “well, if we’re going to interpret ‘behaviour’ as [insert weird definition], then…”
Why is that? Because humans associate that word with specific concepts, and they know that those concepts don’t apply to non-agent entities like oranges and potatoes, except maybe metaphorically. They learned that word.
Here’s however what Google Bert said, when I asked the same question (originally in Portuguese, I’m translating it here, but feel free to redo it in any other language):
What's the difference in behaviour between an orange and a potato?
Oranges and potatoes are fruits and vegetables, but they behave in completely different ways. Here are some differences main [SIC]:
Movement: Oranges don’t move on their own, while potatoes can roll or slid if not supported.
Growth: Oranges grow on trees, while potatoes grow in the subsoil.
Sensibility: Oranges are not sensible to the environment, while potatoes are sensible to light and temperature.
Reactions: Oranges don’t react to stimuli, while potatoes can react to changes in the environment, like sprouting and rotting.
Purpose: Oranges are eaten as fruits, while potatoes are eaten as vegetables or used as starch.
In general, oranges are more passive and potatoes are more active. Oranges don’t need to do anything to survive, while potatoes need to be able to answer to their environment to growth and reproduce.Here’s a table summing up the main differences in behaviour between oranges and potatoes:
- Feature | Orange | Potato
- Movement | Can’t move on their own | Can roll or slid if not supported
- Growth | Grow on trees | Grow in the subsoil
- Sensibility | Not sensible to the environment | Sensible to light and temperature
- Reactions | Don’t react to stimula | Can react to changes in the environment, like sprouting or rotting
- Purpose | Eaten as fruit | Eaten as veg or used as starch
Based on the above, which are the concepts that Bert associates with the words “behaviour”, “roll”, “slid”, “active”, and “passive”? None. It did not learn the meaning of those words - or any other; it doesn’t associate concepts with words, it associates words with more words. That’s what causes those “hallucinations” (IMO a really poor way to frame deeper issues as if they were just surface oddities.)
And that’s just an example. OP is another example of that, with ChatGPT - now with maths, instead of just language. Can we really claim that it learned maths if further data makes it “unlearn” it?
I totally agree that LLMs don’t currently have a ‘concept’ of what they’re being asked. But it’s also pretty easy to ask questions of young kids that go beyond what they have a mental map of and get nonsense.
Here’s what ChatGPT has to say in response to your question:
An orange is a citrus fruit, typically sweet and juicy, while a potato is a starchy vegetable. They have different flavors, textures, and nutritional profiles. Oranges are typically eaten raw, while potatoes are commonly cooked before consumption, such as boiled, baked, or fried.
Me: But what about their behaviour?
I apologize for the confusion in my previous response. Fruits and vegetables do not exhibit behavior in the same way animals do. Behavior is a characteristic of living organisms that involves actions and responses to stimuli. Fruits and vegetables are plant-based and do not display behavior as animals do. They follow natural processes of growth, ripening, and decay but do not behave in the same sense as living creatures.
So, a smarter parrot? How far until it’s as good as any young kid? For the record, I’ve worked in this field for a long time and am far from claiming any kind of strong AI is even close. The tests the OP did are fragile by design so their result isn’t surprising.
And you still didn’t say anything about not using the data ‘well’. What would you like to see them doing?
I totally agree that LLMs don’t currently have a ‘concept’ of what they’re being asked. But
And yet that’s essential. More on that later.
But it’s also pretty easy to ask questions of young kids that go beyond what they have a mental map of and get nonsense.
Look at which type of nonsense they generate. It shows higher (albeit still imperfect) awareness that utterances are supposed to be internally and externally consistent - i.e. that an utterance shouldn’t contradict itself and the available world knowledge of the child. It’s still leagues above the LLM “hallucinations”.
Also compare the relatively small amount of input that a child receives through her life, versus the large amount of info crammed into those large language models. Even the children are doing something important with the data, that those language models don’t do yet. That’s why I say that LLMs doesn’t use the data well.
Here’s what ChatGPT has to say in response to your question:
I was able to sic ChatGPT to output the same sort of nonsense that Bart gave me. But focusing on your answer: before your second prompt, why was ChatGPT listing things that are clearly unrelated to behaviour? And even in my answer, the same problem persists:
[to ChatGPT, in English] What is the difference in behaviour between a potato and an orange?
Potatoes and oranges are both edible plant products, but they are quite different in terms of their behavior due to their distinct characteristics and properties.
Botanical classification:
·Potato: Potatoes belong to the Solanaceae family and are classified as tubers. The edible part of the potato is an underground stem or tuber that stores nutrients for the plant.
·Orange: Oranges belong to the Rutaceae family and are classified as citrus fruits. The edible part of the orange is the fleshy fruit that surrounds the seeds.Growth and cultivation:
·Potato: Potatoes are usually grown in the ground and harvested after the plant’s foliage has died down. They prefer cool climates and well-drained, loose soil.
·Orange: Orange trees are typically grown in tropical or subtropical regions. They are large, evergreen trees that produce sweet, juicy fruits.Appearance:
·Potato: Potatoes have a rough, brownish skin and are usually oval or oblong in shape.
·Orange: Oranges have a bright orange, smooth, and slightly dimpled skin, and they are round or oval-shaped.Taste and texture:
·Potato: Potatoes have a starchy taste and a dense, solid texture when cooked. They can be boiled, baked, mashed, or fried.
·Orange: Oranges have a sweet and tangy taste with juicy, segmented flesh. They are typically eaten fresh or used to make juices.Nutritional content:
·Potato: Potatoes are a good source of carbohydrates, particularly starch, and they also contain some vitamins and minerals like vitamin C and potassium.
·Orange: Oranges are rich in vitamin C, a powerful antioxidant, as well as dietary fiber, potassium, and various vitamins and minerals.Culinary use:
·Potato: Potatoes are versatile and used in various cuisines worldwide, often as a side dish or a main ingredient in dishes like fries, mashed potatoes, and stews.
·Orange: Oranges are commonly consumed as fresh fruit or used in various beverages, desserts, and savory dishes to add a citrusy flavor.Storage:
·Potato: Potatoes have a relatively long shelf life and can be stored in a cool, dark place for several weeks to several months.
·Orange: Oranges have a shorter shelf life and are best stored in the refrigerator for a few weeks.In summary, the behavior of a potato and an orange differs significantly due to their botanical classification, growth conditions, appearance, taste, nutritional content, culinary use, and storage requirements.
Bart’s “oranges are passive, potatoes are active” output was hilarious but at least Bart listed things that could be creatively interpreted as behaviour. In the meantime, ChatGPT simply ignored the word for your first prompt, until you emphasised it with a second prompt; and for mine, it assigned it to a big, contextually irrelevant info dump, about inherent attributes of both entities that cannot be interpreted as behaviour.
And we might say “it might get right in some situations, depending on the prompt”, but how it reaches those conclusions (right or wrong) matters too. Learning a language is also about the internal process yielding that output. And that is not just theoretical babble; if we can’t model the process in a somewhat decent way, we get inconsistent and unreliable output (as it is now), that’s really bad for a tool. Garbage input → garbage output; but also decent input + garbage algorithm → garbage output.
That’s why I said that concepts are essential. Learning how to handle concepts is an integral part of learning both language “as a faculty” and any instance of language (e.g. Mandarin, English, etc.)
There are more issues than just that, mind you, but I already wrote a big wall of text.
So, a smarter parrot?
Nope - a dumber parrot. Way dumber; I know that I’m the one who brought this comparison up, but in a hindsight it sounds like underestimating parrots by a mile. Parrots show signs of primitively associating things with words, and even handling abstractions like colour.
How far until it’s as good as any young kid?
If “it” = LLM, I do not think that it’ll be as good as a young kid, ever. Brute forcing it with more data won’t do the trick.
If “it” = machine learning, regardless of model: I think that it’s possible that it reaches the level of a young kid in some decades. (Source: I’m guessing it.)
And you still didn’t say anything about not using the data ‘well’. What would you like to see them doing?
I explained it across this comment, but by “using the data well” I mean that a good model should require less data to yield meaningful outputs. GPT3.5 for example had 45TB of data, and it was still not enough.
Ok, I’m not going to go point by point, as this is getting too long. All I’d say is remember where the model for ML came from (McCulloch & Pitts), and that this is the worst AI will ever be.
If this is truly a jump across S-curves in utility, it’s bound to be slightly worse than other methods to begin with. Many of the arguments against the current approach sound like the owners of a hot air balloon business arguing with the Wright brothers.
The whole idea of artificial neurons (from McCulloch and Pitts) sounds for me like modelling a wing-flapping mechanism for airplanes. You can get something fun out of it, but I think that further progress will focus on reserve engineering the software (language as a faculty) instead of trying to mimic the underlying machine (human brains).
that this is the worst AI will ever be.
Probably? I think so, at least. I’m not too eager to make a “hard” statement about future tech, though.
Note that my criticism is not towards the development of language models and natural language processing, but specifically against the current state of art technology (LLM).
Many of the arguments against the current approach sound like the owners of a hot air balloon business arguing with the Wright brothers.
That doesn’t say much about the validity of the arguments. And I bet that a lot of people voicing arguments against Dumont or the Wight brothers were actually correct.
Definitely LLMs have been over promised and/or misrepresented in mainstream media, but even in the last few months their utility is increasing. I’m a big advocate of finding ways to use them to enhance people (thinking partner not replacement for thinking). They are most certainly a tool, and you need to know their limitations and how to use them.
From experience working with naive end users, they are anthropomorphising based on how the models have been reported and that’s definitely not helpful.
As the models get more and more capable (and I’m pretty happy to make that prediction), will they reach a point where they are indistinguishable from the output of a real person? That will give us some challenges. But the interesting thing for me is that when that happens, and the AI can write that report you were paying someone to write, what was the point of the report? You could argue they were some kind of terrible UBI and we’ll end up with just the pointless output without the marginal benefit of someone’s livelihood. That needs a bigger rethink.
Today I learned that potatoes are active
You have no idea how much I’ve laughed when Bart told me that.
And then the fridge horror kicks in.
Someone irresponsible out there is gonna use this to build a bridge or something equally dangerous.
This has already been disproven, due to the fact the method the researchers used to test how well it was doing was flawed to begin with. Here is a pretty good twitter-thread showing why the methods they used were flawed: https://twitter.com/svpino/status/1682051132212781056
TL:DR: They used an approach of only giving it prime numbers, and asking it if they were prime numbers. They didn’t intersperse prime and non-prime numbers to really test it’s capabilities at determining that. Turns out that if you do that, both the early and current versions of GPT4 are equally bad at determining prime numbers, with effectively no change noted between the versions.
I don’t get it. I thought these models were “locked”. Shouldn’t the same input produce near-identical output? I know the algorithm has some fuzzing to help produce variation. But ultimately it shouldn’t degrade, right?
The big pre-training is pretty much fixed. The fine tuning is continuously being tweaked, and as shown, can have dramatic effects on the results.
The model itself just does what it does. It is, in effect, and ‘internet completer’. But if you don’t want it to just happily complete what it found on the internet (homophobia, racism, and all), you have to put extra layers in to avoid that. And those layers are somewhat hand-crafted, sometimes conflicting, and therefore unlikely to give everyone what they consider to be excellent results.
Ok but, regardless, they can just turn back the clock to when it performed better right? Use the parameters that were set two months ago? Or is it impossible to roll that back?
Better for one obscure use case? Or just ‘better’? That’s the real issue here. OpenAI have an agenda (publicly, a helpful assistant, privately, who knows…). They’re not really interested in a system that can identify prime numbers.
This is probably very unlikely and I got no idea what I’m talking about: But what if feeding it even small amounts of its own content, text produced by a chatgpt instance, poisons it? That it gets confused from being fed text that adheres perfectly to its own rules, and locks that text down as perfect and not needing small variations.
I remember some article warning about this in a big scale, and I’m thinking why must it be big? If its only a probability tree, even small changes to the probability would cause issues further up the branches.
But blind speculation.
I don’t know if small amounts of text could do that, but I could imagine if LLMs keeps get trained on data generated by itself and other LLMs (which is likely to become a major source of content on the internet in the future), the quality of output can decrease significantly over time.