Anthropic’s upgraded Claude AI model outperforms OpenAI-o1 in coding and can use a Windows 11 PC like humans — potentially backing NVIDIA CEO’s claim about software development being dead in the water

What you need to know

  • Anthropic recently shipped an upgraded version of Claude 3.5 Sonnet alongside a new model dubbed Claude 3.5 Haiku, with enhanced coding capabilities and more.
  • The AI firm also unveiled computer use, a new capability that allows users to prompt Claude to use computers as people do.
  • The company admits shipping the capability to the public poses great risks, but it plans will use the avenue to observe how people are leveraging the tool. It has elaborate measures in place to prevent misuse, such as restricted access to the web during training. 

As an observer with over two decades of experience in the tech industry, I find myself both intrigued and cautious about Anthropic’s latest release, Claude 3.5 Sonnet and Haiku. Having witnessed the rapid evolution of AI since its infancy, it’s fascinating to see a model like Claude venture into uncharted territory – computer use in public beta.


It appears that the field of generative AI is moving into a new stage, going beyond just creating images and text. Recently, Anthropic introduced an updated version of Claude 3.5 Sonnet, along with a fresh model called Claude 3.5 Haiku. As stated by the company, this upgraded version comes equipped with advanced coding skills, matching the performance standards as Anthropic’s Claude 3 Opus LLM.

Intriguingly, the latest feature titled “Computer Usage” is now open for beta testing. By employing the API, developers can now instruct Claude 3.5 Sonnet to operate computers as humans do – by viewing a screen, manipulating a cursor, clicking buttons, and typing text. As such, Claude 3.5 Sonnet becomes the initial AI model offering computer usage in public beta.

Anthropic acknowledges that users might run into various hurdles when engaging with the model, such as mistakes and a less-than-smooth interaction. Their goal is to leverage user feedback to refine and optimize the model, making it more effective and efficient.

Companies such as Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company are now part of a group focused on streamlining complex processes that typically involve numerous steps. To illustrate, “Replit is leveraging the functionalities of Claude 3.5 Sonnet in terms of computer usage and UI navigation to build a crucial feature. This feature assesses applications during their development for the Replit Agent product.

As an analyst, I’m excited to announce that the enhanced variant of Claude 3.5 Sonnet is now accessible on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. Anticipating its release, Anthropic plans to ship Claude 3.5 Haiku later this month.

According to the benchmarks provided, Anthropic’s revised Claude 3.5 Sonnet demonstrates a substantial improvement in performance, particularly in coding tasks. For example, the tool’s performance on the SWE-bench Verified has increased from 33.4% to 49.0%, suggesting it outperforms publicly available models such as OpenAI’s Strawberry reasoning AI by a significant margin, all while maintaining the same cost and speed as its earlier version.

The model adjusts its errors by trying again when it recognizes a problem, steering clear of the intended outcome. It’s worth noting that OpenAI’s o1 and o1-mini models excel in coding tasks and have successfully passed the coding portion of OpenAI’s research engineer hiring interview at an impressive rate of 90-100%.

AI agents are here but proceed with caution

Although the enhancements made are noteworthy, the revised Claude 3.5 Sonnet AI model managed to complete only about half of the tasks given during an evaluation aimed at determining its ability to modify flight bookings. The model encountered difficulties in around a third of instances when trying to arrange a return trip.

Anthropic’s model experiences issues with zooming and scrolling, leading to missed pop-up alerts due to its method of handling screenshots,” the firm noted, adding that “Claude’s computer usage continues to be slow and prone to errors.

The company acknowledges that launching this model could involve substantial risks, yet emphasizes that understanding its application in real-world scenarios brings more advantages than potential hazards.

According to Anthropic:

A more practical approach would be to grant current, less complex computer systems access. By doing so, we can monitor and study any problems that might surface at a basic level, allowing us to progressively enhance both computer usage and safety measures concurrently.

To ensure the Claude 3.5 Sonnet doesn’t fall into harmful use by malicious actors or cause harm unintentionally, its training process is designed differently. It doesn’t learn from users’ screenshots or prompts, and it’s kept offline during training to limit web access. Additionally, Anthropic has incorporated classifiers that guide the model away from risky actions such as creating accounts and posting on social media platforms.

Read More

2024-10-23 12:39