Study: AI “incest” may cause model collapse for tools like ChatGPT, Microsoft Copilot

Study: AI "incest" may cause model collapse for tools like ChatGPT, Microsoft Copilot

What you need to know

  • AI tools like ChatGPT and Microsoft Copilot are driving a ton of hype throughout the tech world. 
  • Generative AI systems rely on training data, typically stolen from human internet content creators, to train their models. 
  • However, as the industrialized hose of AI-generated content floods the internet, researchers are worried about how AI models may be impacted by their own regurgitated data. 
  • Now, a comprehensive study published in Nature seems to suggest that AI “inbreeding” fears may indeed be founded. 

As someone who has watched the tech industry evolve over the past few decades, I’ve seen my fair share of hype cycles and buzzwords that promised to revolutionize the world. But none have been as intriguing and potentially dangerous as the current wave of AI tools and generative models.


One possibility for paraphrasing the given statement could be: “AI models, European royal families, and George R. R. Martin share an intriguing connection. This is not necessarily related to any troubling infatuation with incest.” Another possibility could be: “Despite their seemingly disparate worlds, AI models, European royal families, and George R. R. Martin unexpectedly have something in common. This commonality does not involve any obsession with incest.”

Current technology trendsetters include AI models and associated tools from major tech companies such as Google, Microsoft, and Meta. Notably, large language models (LLMs) like ChatGPT, Microsoft Copilot, and Google Gemini are disrupting traditional computing interactions, although this disruption is still largely theoretical.

Currently, AI technologies place significant demands on servers and come with substantial financial costs, pushing industry leaders like OpenAI towards potential bankruptcy without additional investment. Even tech giants such as Google and Microsoft grapple with the challenge of effectively monetizing this technology, as consumers have yet to fully understand the value in paying for existing tools. Some argue that AI models may already have reached their pinnacle and could only deteriorate over time.

“Model collapse” is a largely theoretical concept that predicts that as increasing amounts of content on the web becomes AI-generated, that AI will begin essentially “inbreeding” on AI generated training data, as high-quality human-made data becomes increasingly scarce. There’s already been instances of this occurring in parts of the net where localized data is scarce, owing to content being created in less populated languages. We now have some more comprehensive studies into the phenomenon, with this new paper published in Nature. 

“Our research reveals that excessively employing generated content from models for training purposes can lead to irreversible flaws in the resulting models, with the extremes of the original content distribution becoming non-existent. This phenomenon is referred to as ‘model collapse.’ We demonstrate that this issue is not unique to Large Language Models but also affects variational autoencoders (VAEs) and Gaussian mixture models (GMMs).”

In plain language, you can consider “model collapse” as having a comparable effect on large language models (LLMs) as JPEG compression does on images. As memes and JPEGs are shared and saved multiple times online, they accumulate errors and distortions that get propagated further. The research is warning that indiscriminate usage of openly available data for training LLMs might lead to a comparable degradation in model quality.

In simpler terms, the paper argues that this phenomenon, which we develop an understanding for theoretically and observe its prevalence across various generative models, is crucial to maintain the advantages of learning from vast web data. The significance of authentic human interaction data will become more pronounced as LLM-generated content becomes increasingly common in data sourced from the Internet.

Tech companies don’t care about ‘healthy’ AI 

Study: AI "incest" may cause model collapse for tools like ChatGPT, Microsoft Copilot

It’s been quite a spectacle to witness the frenzied rush by tech giants like Google and Microsoft to profit from this supposed computational shift among generations, fueled by an excessive amount of hype and anticipation. In comparison to past tech trends such as blockchain and metaverse, which seemed more like fleeting fads, LLMs and generative AI are undeniably more substantial. Regrettably, these companies have been tripping over each other in their eagerness to release new products, sometimes with less than desirable results. For instance, Google’s hasty rollout of AI search queries led to amusing yet inappropriate answers, leaving users wondering if they were being advised to eat rocks. Microsoft’s Copilot PC launch “Recall” feature, on the other hand, was a complete misstep, demonstrating a striking lack of sensitivity and foresight regarding how AI technology should interact with consumers.

Microsoft and Google have intensified their commitments towards reducing their environmental impact, as the surge in artificial intelligence (AI) usage leads to soaring energy and water expenses for data centers. Unfortunately, Microsoft recently dismissed its ethical AI team – a decision that some argue may hinder long-term gains in corporate social responsibility.

As a tech enthusiast, I can’t help but express my concern over the actions some companies take in the name of artificial intelligence (AI). To me, it seems like they’re prioritizing their own financial gains above all else, disregarding potential risks and responsibilities that come with AI development. The term “model collapse” may not even register on their radar, as they might view it as someone else’s problem to solve in the next fiscal year.

Microsoft and Google are intensely working on strategies that potentially undermine income sources for content creators of various scales by directly incorporating their work into search results. This could render content creation economically unsustainable for anyone except large corporations, leading to a decline in web information quality and potentially worsening the “filter bubble” phenomenon. However, it’s plausible that this is part of their plan.

As a researcher, I have strong doubts that Microsoft and Google will give much consideration to my findings, let alone offer compensation for blatantly using my work without permission. Consequently, I’m bracing myself for a rather dismal future of the internet.

Read More

2024-07-26 10:39