The Turing Point
Posts
The Turing Point: 29th Edition

The Turing Point: 29th Edition

AISociety Education
March 08, 2025

For the best possible viewing experience, we recommend viewing this edition online.

📰 Featured in This Edition:

Events

Neurons & Notions - AI SOC Fortnightly Discussion Session

AI News Recap

Grok 3 - The New Top Dog?
Hunyuan Turbo S - Never Seen Speed
Claude 3.7 - A swe’s best friend or biggest enemy?
OpenAI Unveils GPT-4.5
Wan 2.1 - Best AI Video Model?
Sesame Voice Assistant

Research Spotlight

Reinforcement Learning … For Among Us?
Minecraft Played By Multiple AI Agents?
Can AI Trained On Small Sample Rival The Big Players?

🗓 Upcoming: In AI Society

Neurons & Notions

Image Credit: UGAResearch

Join AI Society for our fortnightly discussion sessions! Each session starts with 45 minutes of key AI news and trends from our newsletter, followed by 45 minutes exploring recent research papers and their potential impact. Stay informed, engage in discussions, and deepen your understanding of this rapidly evolving field.

We will also be streaming the discussion on Youtube, so feel free to join us live (physically or virtually) or catch up with the session later!

📅 Date: Wednesday Week 4, Term 1 (12/03/2025)

🕒 Time: 1:00 - 2:30 pm

📍 Location: UNSW Business School 119

📺 Youtube Channel (Subscribe!): UNSW AISoc - Neurons & Notions #1

AI News Recap

Grok 3 - The New Top Dog?

Grok 3 is the latest LLM from Elon Musk’s xAI that has surprised the world with it’s incredible performance, feature rich package and an apparent lack of censorship. The new Grok 3 model boasts significant advancements in decision-making abilities, reasoning tasks and human-like thinking and interactions. To achieve this result xAI has assembled one of the largest computing clusters in the world, with the model being trained on 200,000 NVIDIA H100 GPUs which is an order of magnitude improvement in training power compared to GPT-4. The model has also been partly trained on the endless amount of data on the X (twitter) platform, which although improves it’s performance, also raises some questions about privacy and data quality.

Image Credit: Business Insider

So how does Grok 3 compare to other LLMs? For reasoning and STEM tasks such as American Invitational Mathematics Examination, LiveCodeBench and the science benchmark GPQA, Grok 3 outperformed all of its competitors by a significant margin. Moreover, it also placed first on Chatbot Arena’s LLM leaderboard, which is impressive as the rankings are determined from blind comparisons by users, and thus account for intangible factors that bypass traditional benchmarks.

Grok 3 also comes equipped with the following features:

Reasoning: As mentioned earlier Grok 3 joins the new wave of reasoning models, and boasts incredible performance across various reasoning tasks. There are two reasoning modes available:
- Think, which is the standard mode and will display Grok’s reasoning as it performs a task
- Big Brain, which is a more powerful mode for complex tasks that require more time and computational power.
DeepSearch: An agentic feature that allows Grok to search the web for any sources relevant to the task. It also thinks through the information gained from these sources and provides a reasoning interface that traces its chain of thought.
Voice Mode: This is Grok 3’s conservation model that allows it to provide audio outputs. This mode is also quite powerful but is largely uncensored compared to the voice modes in LLMs like ChatGPT, with the model having the ability to make distressing noises, stronger tones that convey sorrow, anger or annoyance and even swear.

As mentioned earlier, Grok 3 is greatly uncensored compared to models like GPT-4o, with the model providing responses on ethical dilemmas and moral conflicts, political figures and other controversial topics. This does not mean the model is completely uncensored as it will still avoid generating responses involving violence, crime or explicit details.

Grok 3 initially debuted on the X (Twitter) platform exclusively for Premium+ users, but is now available to all X users for free with the DeepSearch and Think features.The Big Brain feature however is still only offered to Premium+ subscribers. Additionally, xAI also offers a SuperGrok subscription that gives users access to the latest updates and advancements of Grok on the Grok website and the Grok App.

_{Published by Abhishek Moramganti, February 2025}

Hunyuan Turbo S - Never Seen Speed

Moments after we were introduced to the acclaimed DeepSeek-V3, China's AI landscape gives us yet another significant development. The Chinese company Tencent has introduced its latest AI model, Hunyuan Turbo S, which boasts response times faster than both ChatGPT and DeepSeek, delivering answers often in the span of a second.

Image Credit: 天下雜誌

Even though the model’s most notorious feature is its speed, it also possesses other attributes matching or surpassing that of mainstream models:

Performance: In benchmark tests across various domains including knowledge, reasoning, math, and code, Hunyuan Turbo S has allegedly demonstrated capabilities on par with DeepSeek-V3, OpenAI's ChatGPT 4o, Claude 3.5 Sonnet, and Llama 3.1.
Architecture: Tencent's Hunyuan Turbo S introduces Hybrid-Mamba-Transformer fusion architecture, marking the first successful integration of Mamba and Transformer deep learning structures in a large-scale model. This design reduces the computational complexity and Key-Value (KV) cache usage associated with traditional Transformer architectures.
Efficiency: Tencent claims that the cost associated with deploying Hunyuan Turbo S is significantly lower than mainstream large-scale models. This greatly lowers the barrier for adopting advanced AI technologies.

Tencent’s Hunyuan Turbo S represents a pivotal evolution in AI technology. Its blend of speed, advanced architectural design, and cost efficiency demonstrates that high-quality AI can be achieved with innovative engineering, further intensifying the global AI race.

_{Published by Victor Velloso, February 2025}

Claude 3.7 - A friend or foe of the modern SWE ?

Anthropic has unveiled Claude 3.7 Sonnet, a cutting-edge AI model that pushes the boundaries of artificial intelligence in software development. With enhanced reasoning, coding capabilities, and a new tool called Claude Code, this release is poised to reshape the way engineers interact with AI.

Claude 3.7 Sonnet introduces hybrid reasoning, allowing users to toggle between quick responses and detailed step-by-step analyses. This flexibility enhances problem-solving, catering to both rapid prototyping and complex debugging tasks. Additionally, the model’s enhanced coding capabilities make it a powerful tool for developers, particularly in generating and understanding code across multiple languages. The added feature of extended thinking mode, allows the model to self-reflect before answering, therefore it offers more nuanced and thoughtful responses compared to anthropic’s previous versions of Claude as there is a visible improvement in math, physics and coding related tasks. Furthermore, the Claude 3.7 sonnet excels in coding tasks, achieving an industry-leading 70.3% accuracy on the SWE-bench verified benchmark. This is a significant improvement over Claude 3.5 sonnet, allowing usage in complex coding workflows and ai agent integration.

The software engineering landscape is shifting as AI-driven tools become more prevalent. Studies show that over 60% of developers already use AI-powered code assistants like Claude, GitHub Copilot, or Google Gemini in their workflows. Moreover, a recent survey by Stack Overflow found that 44% of professional developers believe AI will significantly change their daily responsibilities within the next five years.

Despite fears of automation, industry leaders argue that AI is more likely to augment rather than replace engineers. Senior developers benefit the most from AI’s capabilities, using it to automate repetitive tasks and focus on higher-level problem-solving. However, Claude 3.7 Sonnet exemplifies AI’s potential to revolutionise software development. As AI becomes an integral part of engineering workflows, the industry will likely shift toward a collaborative model, where AI handles routine coding while human engineers focus on innovation, architecture, and ethical considerations.

While automation will undoubtedly change the job market, software engineering is far from obsolete. Instead, developers must adapt to a future where coding knowledge is essential, but the ability to leverage AI effectively will be the true differentiator.

_{Published by Aditya Shrivastava, February 2025}

OpenAI Unveils GPT-4.5

OpenAI has unveiled GPT-4.5, its latest language model to date that promises significant leaps over its predecessors with the core aim of being more natural and human-like, but early impressions have been underwhelming.

Apart from the expected improvements in knowledge base, OpenAI claims that the defining features of 4.5 is that the model feels more natural, has a higher emotional intelligence and hallucinates less. Although these improvements can be immediately noticed, qualitative testing such as in this video by AI Explained shows that the new 4.5 model still slightly underperforms at EQ and creativity compared to Claude 3.7. Needless to say these claims are vague and difficult to represent in popular benchmarks, which has been part of the reason behind the early mixed reception.

Speaking of benchmarks, while 4.5 is an upgrade over 4o, the improvements are marginal rather than generational like seen in the leap from GPT-3.5 to GPT-4. Moreover, OpenAI has only released benchmarks comparing the model to its own 4o series of models and has excluded comparisons with models like Claude 3.7 or Grok 3, making it difficult to gauge the true performance of this model. Nevertheless, even relatively small improvements in the base model can be incredibly useful as they can allow for the development of better reasoning models (such as o1 and o3) or facilitate other post-training innovations that have greatly boosted performance in this new age of LLMs. Currently GPT-4.5 is only available for ChatGPT Pro tier users (with it arriving for Plus users next week) or through the OpenAI API where it costs a whopping $150 USD per million output tokens, making it the most expensive model currently on the market.

_{Published by Abhishek Moramganti, February 2025}

Wan 2.1 - Best AI Video Model?

Wan 2.1 is the new free open-source and feature-rich AI video generation model from Alibaba that has swiftly taken its place among the top proprietary contenders like SORA from OpenAI.

While it is a common occurrence for open source models to lag behind monetised proprietary models, Wan 2.1 follows DeepSeek in this new age of free open source models that match or even outperform the best, with Wan 2.1 beating SORA in metrics like scene generation quality, single-object accuracy, and spatial positioning. Moreover, Wan 2.1 handles spatial and temporal consistency well and can deliver a smooth 1080p video at 30 fps, contributing to its with an impressive 84.7% VBench score. The core architecture of Wan is a diffusion transformer enhanced with a 3D casual variational autoencoder, trained on a mammoth 1.5 billion videos and 10 billion images. Wan 2.1 itself isn’t a single model and comes in 4 forms:

T2V-1.3B: The lightweight text-to-video model, requiring just 8.19GB of VRAM, making it practical for a large array of commercial GPUs
T2V-14B: The heavyweight text-to-video model with enhanced quality
I2V-14B-720P: Image-to-video transformation at 720p resolution
I2V-14B-480P: Image-to-video transformation at 480p resolution

The full feature list of Wan 2.1 includes:

Multilingual Text Support (English and Chinese)
Video editing to enhance existing videos
Text-to-Video
Image-to-Video
Text-to-Image
And even Video-to-Audio

Overall Wan 2.1 looks to be a powerful contender in the GenAI space, boosted by it’s flexibility and open-source availability, improving AI accessibility to artists and developers. The core Wan 2.1 model is free to download and use from HuggingFace, and also free to use on Alibaba’s Model Studio, a cloud based generative AI platform.

_{Published by Abhishek Moramganti, March 2025}

Sesame Voice Assistant

Image Credit: medial.app

As originally defined by Alan Turing, the question of whether "can machines think?" is more accurately reframed as "can machines be distinguished from humans in conversation?" This fundamental shift in perspective gave rise to the Turing Test, which sought to explore the potential for machines to exhibit human-like intelligence. During the early days of computing, this was a groundbreaking concept. Today, however, we find ourselves approaching a new frontier—one that surpasses Turing's vision. Enter Sesame, the latest leap forward in speech-based generative AI, which promises to blur the boundaries of what we know as sentience.

Differently from other AI systems, Sesame distinguishes itself by focusing not only on conversational abilities but also on contextual understanding and emotional nuance. While earlier systems could replicate human-like responses based on pre-programmed rules or statistical models, Sesame goes a step further by leveraging deep learning algorithms that enable it to grasp underlying emotions, intentions, and other subtleties within human speech. This allows it to generate responses that feel more genuine, empathetic, and contextually aware.

As this technology continues to evolve, it raises important questions about the nature of sentience and consciousness. If machines like Sesame can engage in conversations that evoke empathy, understanding, and emotional connection, can they truly be considered sentient? Or is the line between human and machine proposed by Turing becoming so blurred that the distinction itself may become irrelevant?

_{Published by Victor Velloso, March 2025}

📑 Research Spotlight💡

Reinforcement Learning … For Among Us?

February 2025

Training Language Models for Social Deduction with Multi-Agent Reinforcement Learning

Communicating in natural language is a powerful tool in multi-agent settings, as it enables independent agents to share information in partially observable settings and allows zero-shot coordination with humans. However, most prior works are limited as they either rely on training with large amounts of human demonstrations or lack the ability to generate natural and useful communication strategies. In this work, we train language models to have productive discussions about their environment in natural language without any human demonstrations. We decompose the communication problem into listening and speaking. Our key idea is to leverage the agent's goal to predict useful information about the world as a dense reward signal that guides communication. Specifically, we improve a model's listening skills by training them to predict information about the environment based on discussions, and we simultaneously improve a model's speaking skills with multi-agent reinforcement learning by rewarding messages based on their influence on other agents. To investigate the role and necessity of communication in complex social settings, we study an embodied social deduction game based on Among Us, where the key question to answer is the identity of an adversarial imposter. We analyze emergent behaviors due to our technique, such as accusing suspects and providing evidence, and find that it enables strong discussions, doubling the win rates compared to standard RL. We release our code and models at https://socialdeductionllm.github.io/

arxiv.org/abs/2502.06060

Minecraft Played By Multimodal AI Agents?

February 2025

Optimus-2: Multimodal Minecraft Agent with Goal-Observation-Action Conditioned Policy

Building an agent that can mimic human behavior patterns to accomplish various open-world tasks is a long-term goal. To enable agents to effectively learn behavioral patterns across diverse tasks, a key challenge lies in modeling the intricate relationships among observations, actions, and language. To this end, we propose Optimus-2, a novel Minecraft agent that incorporates a Multimodal Large Language Model (MLLM) for high-level planning, alongside a Goal-Observation-Action Conditioned Policy (GOAP) for low-level control. GOAP contains (1) an Action-guided Behavior Encoder that models causal relationships between observations and actions at each timestep, then dynamically interacts with the historical observation-action sequence, consolidating it into fixed-length behavior tokens, and (2) an MLLM that aligns behavior tokens with open-ended language instructions to predict actions auto-regressively. Moreover, we introduce a high-quality Minecraft Goal-Observation-Action (MGOA)} dataset, which contains 25,000 videos across 8 atomic tasks, providing about 30M goal-observation-action pairs. The automated construction method, along with the MGOA dataset, can contribute to the community's efforts to train Minecraft agents. Extensive experimental results demonstrate that Optimus-2 exhibits superior performance across atomic tasks, long-horizon tasks, and open-ended instruction tasks in Minecraft.

arxiv.org/abs/2502.19902

Can AI Trained On Small Sample Rival The Big Players?

February 2025

s1: Simple test-time scaling

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

arxiv.org/abs/2501.19393

Closing Notes

We welcome any feedback / suggestions for future editions here or email us at [email protected]

Stay curious,

🥫Sauces 🥫

Here, you can find all sources used in constructing this edition of WatchTower: