

They believe that Jesus will come back when Armageddon happens, and Armageddon is supposed to happen in Israel.
Basically a deer with a human face. Despite probably being some sort of magical nature spirit, his interests are primarily in technology and politics and science fiction.
Spent many years on Reddit before joining the Threadiverse as well.
They believe that Jesus will come back when Armageddon happens, and Armageddon is supposed to happen in Israel.
You’re still setting a high standard here. What counts as a “well trained” human and how many SO commenters count as that? Also “easier to teach” is complicated. It takes decades for a human to become well trained, an LLM can be trained in weeks. And an individual computer that’ll be running the LLM is “trained” in minutes, it just needs to load the model into memory. Once you have an LLM you can run as many instances of it as you want to spend money on.
There’s no guarantee LLM will get reliably better at everything
Never said they would. I said they’re as bad as they’re ever going to be, which allows for the possibility that they don’t get any better.
Even if they don’t, though, they’re still good enough to have killed Stack Overflow.
It still makes some mistakes today that it did when introduced and nobody knows how to fix that yet
And humans also make mistakes. Do we know how to fix that yet?
If they aren’t comfortable with their Discord messages being public, perhaps they shouldn’t have posted those messages in a public forum that the public can access.
Good thing human teachers never have hidden biases.
How does this play out when you hold a human contributor to the same standards? They also often fail to summarize information accurately or bring up the wrong thing. Lots of answers on Stack Overflow are just plain wrong, or focus on the wrong thing, or don’t reference the correct sources (when they reference anything at all). The most common criticism of Stack Overflow I’m seeing is how its human contributors direct people to other threads and declare that the question is “already answered” there when it isn’t really.
LLMs can do a decent job. And right now they are as bad as they’re ever going to be.
That’s the neat thing, you don’t.
LLM training is primarily about getting the LLM to understand concepts. When you need it to be factual, or are working with it to solve novel problems, you can put a bunch of relevant information into the LLM’s context and it can use that even if it wasn’t explicitly trained on it. It’s called RAG, retrieval-augmented generation. Most of the general-purpose LLMs on the net these days do that, when you ask Copilot or Gemini about stuff it’ll often have footnotes in the response that point to the stuff that it searched up in the background and used as context.
So for a future Stack Overflow LLM replacement, I’d expect the LLM to be backed up by being able to search through relevant documentation and source code.
This is an area where synthetic data can be useful. For example, you could scrape the documentation and source code for a Python library and then use an existing LLM to generate questions and answers about the content to train future coding assistants on. As long as the training data gets well curated for quality it’s perfectly useful for this kind of thing, no need for an actual forum.
AI companies have a lot of clever people working for them, they’re aware of these problems.
There will eventually be enough public domain content that AI will be at the quality it is today with public materials alone.
So, AI will always be ~95 years behind the times?
Except the AIs produced by Disney et al, of course. And those produced by Chinese companies with the CCP stamp of approval. They’ll be up to date.
Many people with positive sentiments towards AI also want that.
So they’re still feeding LLMs their own slop, got it.
No, you don’t “got it.” You’re clinging hard to an inaccurate understanding of how LLM training works because you really want it to work that way, because you think it means that LLMs are “doomed” somehow.
It’s not the case. The curation and synthetic data generation steps don’t work the way you appear to think they work. Curation of training data has nothing to do with Yahoo’s directories. I have no idea why you would think that’s a bad thing even if it was like that, aside from the notion that “Yahoo failed therefore if LLM trainers are doing something similar to Yahoo then they will also fail.”
I mean that they’re discontinuing search engines in favour of LLM generated slop.
No they’re not. Bing is discontinuing an API for their search engine, but Copilot still uses it under the hood. Go ahead and ask Copilot to tell you about something, it’ll have footnotes linking to other websites showing the search results it’s summarizing. Similarly with Google, you say it yourself right here that their search results have AI summaries in them.
No there’s not, that’s not how LLMs work, you have to retrain the whole model to get any new patterns into it.
The problem with your understanding of this situation is that Google’s search summary is not solely from the LLM. What happens is Google does the search, finds the relevant pages, then puts the content of those pages into their LLM’s context and asks the LLM to create a summary of that information relevant to the search that was used to find it. So the LLM doesn’t actually need to have that information trained into it, it’s provided as part of the context of the prompt,
You can experiment a bit with this yourself if you want. Google has a service called NotebookLM, https://notebooklm.google.com/, where you can upload a document and then ask an LLM questions about the documents’ contents. Go ahead and upload something that hasn’t been in any LLM training sets and ask it some questions. Not only will it give you answers, it’ll include links that point to the sections of the source documents where it got those answers from.
No, it’s not “LLMs all the way down.” Synthetic data is still ultimately built on raw data, it just improves the form that data takes and includes lots of curation steps to filter it for quality.
I don’t know what you mean by “a replacement for search engines.” LLMs are commonly being used to summarize search engine results, but there’s still a search engine providing it with sources to generate that summary from.
Thanks for asking. My comment was off the top of my head based on stuff I’ve read over the years, so first I did a little fact-checking of myself to make sure. There’s a lot of black magic still involved in training LLMs so the exact mix of training data varies a lot depending who you ask; in some cases raw data is still used for the initial training of LLMs to get them to the point where they’re capable of responding coherently to prompts, and synthetic data is more often used for the fine-tuning phase where LLMs are trained to be good at responding to prompts in particular ways. But there doesn’t seem to be any reason why synthetic data can’t be used for the whole training run, it’s just that well-curated high-quality raw data is already available.
This article on how to use LLMs to generate synthetic data seems to be pretty comprehensive, starting with the basics and then going into detail about how to generate it with a system called DeepEval. In another comment in this thread I pointed to NVIDIA’s Nemotron-4 models as another example.
Raw source data is often used to produce synthetic data. For example, if you’re training an AI to be a conversational chatbot, you might produce synthetic data by giving a different AI a Wikipedia article on some subject as context and then tell the AI to generate questions and answers about the content of the article. That Q&A output is then used for training.
The resulting synthetic data does not contain any of the raw source, but it’s still based on that source. That’s one way to keep the AI’s knowledge well grounded.
It’s a bit old at this point, but last year NVIDIA released a set of AI models specifically designed for performing this process called Nemotron-4. That page might help illustrate the process in a bit more detail.
Betteridge’s law of headlines.
Modern LLMs are trained using synthetic data, which is explicitly AI-generated. It’s done so that the data’s format and content can be tailored to optimize its value in the training process. Over the past few years it’s become clear that simply dumping raw data from the Internet into LLM training isn’t a very good approach. It sufficied to bootstrap AI development but we’re kind of past that point now.
Even if there was a problem with training new AIs, that just means that they won’t get better until the problem is overcome. It doesn’t mean they’ll perform “increasingly poorly” because the old models still exist, you can just use those.
But lots of people really don’t like AI and want to hear headlines saying it’s going to get worse or even go away, so this bait will get plenty of clicks and upvotes. Though I give credit to the body of the article, if you read more than halfway down you’ll see it raises these sorts of issues itself.
Interesting. As poorly as I think of X as an organization, I do hope they follow through with their open system prompt commitment. That’s something that other major AI companies should be doing too.
Could also be malicious compliance on the part of whatever engineer set this up, prompting Grok in such a way that it’s making it obvious what’s going on under the hood.
Elon Musk decided they absolutely would not use lidar, years ago when lidar was expensive enough that a decision like that made economic sense to at least try making work. Nowadays lidar is a lot cheaper but for whatever reason Musk has drawn a line in the sand and refuses to back down on it.
Unlike many people online these days I don’t believe that Musk is some kind of sheer-luck bought-his-way-into-success grifter, he has been genuinely involved in many of the decisions that made his companies grow. But this is one of the downsides of that (Cybertruck is another). He’s forced through ideas that turned out to be amazing, but he’s also forced through ideas that sucked. He seems to be increasingly having trouble distinguishing them.