This Research Changed AI Forever

This week's AI story

It seems like the last few years have been an absolute explosion in AI news. Hell, just 5 years ago, there might not have been a huge need for a newsletter like this one. But it feels as if every day, we get more and more news. Every week we get more and more developments. 

This week, Sam Altman (founder of OpenAI) stated that Chat GPT-4 is “the dumbest version of AI we will ever use again.” Considering how fast and how much that particular program has taken over, it's almost hard to believe. It's almost as hard to believe that one man’s research caused this boom. 

Jared Kaplan, a distinguished theoretical physicist at Johns Hopkins University, made a groundbreaking contribution to the field of AI in 2020. His research singlehandling ended one of the hottest debates in the field. He proved that there are no diminishing returns with information when training AI. In other words, the more information you can train it with, the better. This revelation has revolutionized the way we approach AI training. 

So, the only thing separating us from having the best version of AI is just data. 

Upon hearing Kaplan's research, every major tech company developing AI sprang into action. They went on a mission to gather as much information as possible. They started pulling data from Wikipedia, Reddit, and other publicly available sources. These companies were determined to provide their AI models with the best training data. If scaling were the issue, they would scale up as fast as possible.

While this strategy has worked well and has helped push AI to grow at this rapid pace, it does have one minor drawback. There is only so much data that is usable on the Internet without having to deal with copyright laws and intellectual property issues. Research from the Epoch Institute claims that these chatbots could be out of high-quality data by as early as 2026. 

And this is where things get both confusing and fascinating. The big three companies in the space—Google, Meta (the parent company of Facebook), and Open AI—started skating by rules and regulations to get their hands on that precious data, transcribing podcast episodes and YouTube videos and uploading copyrighted information to their datasets. 

Ultimately, we see that the major players are fine with lawsuits ( like the one Sarah Silverman has raised against Open AI) if they get more data. This is all because Jared Kaplin’s research showed that if you want better AI, you just need more data. 

Will we run out of data? Will there ever be a path forward for AI that isn’t going to stampede over copyright laws? Who knows. We do know that a research paper caused such a frenzy that data became the most precious resource in the world, and we might run out.


Join the conversation

or to participate.