The biggest problem about AI is not intrinsic to AI. It’s to do with the fact that it’s owned by the same few people, and I have less and less interest in what those people think, and more and more criticisms of what the effect of their work has been.
The definition of “open source” AI sucks. It could just mean that the generated model weights are shared under an open source license. If you don’t have the code used to train the model under an open source license, or you can’t fully reproduce the model using the code they share and open source datasets, then calling a model “open source” feels weird as hell to me.
At the same time, I don’t know of a single modern model that only used training data that was taken with informed consent from the creators of that data.
tbf the widely used nomenclature for them is “open weights”, specifically to draw that distinction. There are genuinely open source models, in that the training data and everything is also documented, just not as many.
The OSI doesn’t require open access to training data for AI models to be considered “open source”, unfortunately. https://opensource.org/ai/open-source-ai-definition
I agree that “open weights” is a more apt description, though
If you are familiar with the concept of an NP-complete problem, the weights are just one possible solution.
The Traveling Salesman Problem is probably the easiest analogy to make. It’s as though we’re all trying to find the shortest path through a bunch of points (ex. towns), and when someone says “here is a path that I think is pretty good”, that is analogous to sharing network weighs for an AI. We can then all openly test that solution against other solutions and determine which is “best”.
What they aren’t telling you is whether people traveling that path somehow benefits them (maybe they own all the gas stations on that path. Or maybe they’ve hired highway men to rob people on that path). And figuring out if that’s the case in a hyper-dimensional space is non-trivial.
uh sure. My point is that sharing weights is analogous to sharing a compiled binary, not source code.
Yes, and I don’t like the common comparison to binary blobs, and I’m attempting to explain why.
It is inherently safer to blindly run weights than it is to blindly execute a binary. The issues only arrise if you are then blindly trusting the outputs from the AI. But you should already have something in place to sanitize outputs and limit permissions, even for the most trustworthy weights.
It’s basically like hiring someone and wondering if they’re Hydra; no matter how deep your background check is, they could always decide to spontaneously defect and try to sabotage you. But that won’t matter if their decisions are always checked against enough other non-Hydra employees.
I don’t know of a single modern human that only used training data that was taken with informed consent from the creators of that data.
Sorry, just to be clear, are you equating a human learning to an organization scraping creative works as inputs for their software?
Everything is a remix of a copy of derivative works. They learn from other people, from other artists, from institutions that teach entire industries. Some of it had “informed consent”, some of it was “borrowed” from other ideas without anybody’s permission. We see legal battles on a monthly basis as to whether these four notes are too similar to these other four notes. Half of the movies are reboots, and some of them were actually itself another reboot a few times over.
“Good artist copy, great artist steal.”
No human has truly had an original thought in their head ever. It’s all stolen knowledge, whether we realize it’s stolen or not. In the age of the Internet, we steal now more than ever. We pass memes to each other with stolen material. Journalists copy entire quotes from other journalists, who then create other articles about some guy’s Twitter post, who actually got the idea from Reddit, and that article gets posted on Facebook. And then when it reaches Lemmy, we post the whole article because we don’t want the paywall.
We practice Fair Use on a regular basis by using and remixing images and videos into other videos, but isn’t that similar to an AI bot looking at an image, figuring out some weights from it, and throwing it away? I don’t own this picture of Brian Eno, but it’s certainly sitting in my browser cache, and Brian Eno and Getty Images didn’t grant me “informed consent” to do anything with it. Did I steal it? Is it legally or morally unethical to have it sitting on my hard drive? If I photoshop it and turn it into a dragon with a mic on a mixing board, and then pass it around, is that legally or morally unethical? Fair Use says no.
It’s folly to pretend that AI should be held to a standard that we ourselves aren’t upholding.
A: “Hey, did you just download all the movies of 2024?”
B: “Yeah”
A: “That’s illegal”
B: “No no, it’s not illegal. I’m gonna train my AI model. So it’s fine”
A: “With your i3 PC with 4 GB of RAM?”
B: “I’ll get better hardware later on. I might have to watch them to understand what to use for training”
A: “As long as you use it for training, it’s fine. Make sure to not watch it for your entertainment”
B: “Yeah sure, bye”
Legally, the one hosting the material is the one who is punished, not the downloader. Though, if they are using torrent software, they are both downloading and hosting. Copyright law doesn’t give a shit why they’re watching it.
If I downloaded pictures on the internet by visiting a web site, I’m not going to suddenly get punished because it’s copyrighted. Otherwise, every one of us is now in trouble because of the linked article.