Nvidia’s new open-source AI model beats GPT-4o on benchmarks

As a seasoned researcher with a deep fascination for artificial intelligence, I find myself intrigued by Nvidia’s latest creation, Llama-3.1-Nemotron-70B-Instruct. Having spent countless hours poring over AI models and their capabilities, I can confidently say that this announcement has piqued my curiosity.

On October 15, Nvidia casually introduced an innovative artificial intelligence model; it’s claimed that this new model surpasses current top-tier AI systems like GPT-4o and Claude-3 in performance.

Based on a post shared by the Nvidia AI Developer team on their X.com social media platform, it’s stated that the Llama-3.1-Nemotron-70B-Instruct model is currently one of the top models in lmarena.AI’s Chatbot Arena.

Nemotron

Essentially, Llama-3.1-Nemotron-70B-Instruct is a revised adaptation of the open-source Llama-3.1-70B-Instruct. The “Nemotron” in its name signifies Nvidia’s involvement in the final product.

Meta’s collection of “llama groups” serves as a free starting point for programmers, allowing them to construct and expand upon the models.

When it comes to Nemotron, Nvidia decided to step up and create a system intended to surpass in helpfulness well-known models like OpenAI’s ChatGPT and Anthropic’s Claude-3.

Nvidia transformed Meta’s standard AI model into one of the “most helpful” models globally by employing customized datasets, refined fine-tuning techniques, and its cutting-edge AI hardware.

“I asked it a few coding questions I usually ask to compare LLMs and got some of the best answers from this one. lol, holy shit.”

Benchmarking

When it comes to determining which AI model is “the best,” there’s no clear-cut methodology. Unlike, for example, measuring the ambient temperature with a mercury thermometer, there isn’t a single “truth” that exists when it comes to AI model performance.

As an analyst, I find it crucial to assess the performance of AI models in a manner comparable to human evaluation. To achieve this, I employ comparative testing methods.

As a researcher in the field of artificial intelligence, I engage in the practice of benchmarking AI models. This process entails presenting multiple AI models with identical queries, tasks, or problems, and then assessing the effectiveness of their responses by comparing them. Since determining what constitutes a useful result can be subjective, human evaluators are typically employed to provide blind assessments of each machine’s performance.

It seems that Nvidia is suggesting that the performance of their new model significantly surpasses models like GPT-4o and Claude-3, which are currently leading in the field.

The image shows rankings for the “Difficult” test within the Chatbot Arena Leaderboards, where Nvidia’s Llama-3.1-Nemotron-70B-Instruct is not explicitly displayed. However, if the developers’ assertion that it scored 85 on this test is accurate, then it would become the leading model in this specific category by default.

The intrigue surrounding this accomplishment might be heightened by the fact that Llama-3.1-70B is a mid-range open-source AI model developed by Meta. There exists a significantly larger variant of Llama-3.1, the 405B version, which was fine-tuned using a greater number of parameters (specifically, approximately 405 billion).

By comparison, GPT-4o is estimated to have been developed with over one trillion parameters.

2024-10-17 20:21

Nemotron

Benchmarking

Read More