The Turing Point
Posts
The Turing Point - 34th Edition

The Turing Point - 34th Edition

AISociety Education
July 24, 2025

Featured in This Edition:

Events

UNSW AI Con

AI News Recap

From Tweets to Theorems: Musk’s Most Serious AI Yet
Windsurf Finds New Home In Cognition
Cloudflare Pay-Per-Crawl
Mirage - New Frontier For Video AI
Meta’s New Data Center Plans

Research Spotlight

Can Graph Problems Improve LLM Reasoning?
Thinking Isn't an Illusion

🗓 Upcoming: In AI Society

✨ UNSW AI Con ✨

AlSoc's first-ever flagship tech conference! Built at the intersection of students, industry, and research. We're bringing together future engineers and current experts to actually talk to each other (and not just on Linkedin)

Open to all: UNSW students, other uni Al societies. Anyone who wants in.

📅 Date: July 25th (Friday Week 8)

🕒 Time: 11am - 3pm

📍 Location: Leighton Hall, John Niland Scientia building

🎬 AI News Recap

From Tweets to Theorems: Musk’s Most Serious AI Yet

As the global AI arms race accelerates, xAI's Grok 4 is stepping onto the battlefield with an unusually confident stride. Touted as xAI’s most advanced foundation model to date, Grok 4 isn’t just a flashy release — it’s a technically serious upgrade built for reasoning, research, and real-time responsiveness.

Whether you're a professor dissecting transformer architectures or a student curious about the future of multimodal intelligence, Grok 4 is worth your attention (pun intended) .

One of Grok 4’s headline features is its strong performance on high-difficulty academic benchmarks, including:

AIME (American Invitational Mathematics Examination)
GPQA (Graduate Physics Question Answering)
MATH and MMLU
HumanEval (for code synthesis)

xAI claims Grok 4 outperforms GPT‑4 and Claude 3.5 on reasoning-intensive tasks. In multi-agent "Heavy" mode, the model reportedly reaches ~50% accuracy on Humanity’s Last Exam — a benchmark designed to challenge PhD-level knowledge application.

What’s notable is Grok’s compositional reasoning: it isn’t just regurgitating memorised patterns, but demonstrating emergent behaviour in structured problem-solving — a strong signal of architectural maturity and efficient instruction tuning.

A standout innovation is the introduction of Grok 4 Heavy, a premium-tier configuration where multiple agents collaborate asynchronously on a single task. This echoes trends seen in sparse expert models and tool-augmented agents, but with one key twist: Grok 4 Heavy doesn’t distribute parameters across experts, but rather spawns separate, specialised Grok instances that communicate during the reasoning process — effectively creating a "think tank" of AIs.Potential applications can be found in, distributed scientific paper summarisation, logic chain debugging across math proof sequences, parallel generation and consensus synthesis.

Though still being refined, Grok 4 now supports, voice input/output with expressive responses, image understanding for visual reasoning, planned support for video-based multimodal alignment. In effect, xAI is converging toward the multimodal assistant paradigm, similar to OpenAI’s GPT‑4o and Google’s Gemini 1.5. Where Grok differs is in its personality-layer interactivity: characters like Rudi (a red panda) and Ani (an anime-style chatbot) suggest a roadmap not just for utility, but for emotive, persistent AI identities.

After controversy surrounding Grok 3’s unsafe outputs, Grok 4 introduces tighter system prompting, improved content filtering, and a stronger emphasis on guardrails for hallucination, bias, and hate speech. However, xAI has walked a fine line — Grok still retains its “edgy” flavour, offering sarcasm and.

Grok 4 is a major step up from Grok 3, shifting from gimmicky personality to serious capability. It improves reasoning, adds real-time web search, supports multimodal input, and introduces a collaborative multi-agent mode. These upgrades, along with stronger safety measures, have earned it far more credibility from academics and the AI community than its predecessor.

While Musk’s media presence continues to take the spotlight, Grok 4 shows that behind the noise, xAI is building something technically serious — and increasingly hard to ignore.

As generative AI evolves, Grok may never be the “most polite” model in the room, but it’s increasingly clear it’s one of the most ambitious. For researchers, engineers, and students watching this space, Grok 4 is an invitation to take xAI seriously — and maybe even build on top of it.

_{Published by Aditya Shrivastava , July 2025}

Windsurf Finds New Home In Cognition

It has been a turbulent few weeks for Windsurf, with leadership seemingly ‘abandoning’ the company after a potential acquisition by OpenAI fell through. For context, Windsurf is an AI-native coding assistant that functions within an IDE, and stands as a popular competitor to Cursor, a company that has recently attained a $10 billion dollar valuation. The company’s AI tech has caught the eyes of many investors and global AI leaders, as in April this year OpenAI was in talks to acquire Windsurf for $3 billion, with the potential acquisition reportedly falling through. However, Varun Mohan and Douglas Chen, the co-founders of Windsurf have left the company for Google Deepmind, in a package that adds up to $2.4 billion. This has been a very controversial move as their sudden departure has not only left the company directionless, but also left Windsurf and its passionate workers out of any financial compensation that follows such major deals. Interim CEO Jeff Wang describes the internal situation by writing "The mood was very bleak, some people were upset about financial outcomes or colleagues leaving, while others were worried about the future. A few were in tears."

The company did end up finding stability and new leadership, as Cognition, the startup behind the controversial, yet popular autonomous AI coding agent Devin, announced that it will be acquiring Windsurf, but didn’t disclose the exact terms of the deal. With Cognition CEO Scott Wu stating that “Every new employee of Cognition will be treated the same way as existing employees: with transparency, fairness, and deep respect for their abilities and value,” and went on to say “After today, our efforts will be as a united and aligned team. There’s only one boat and we’re all in it together.” While the two companies will now be sharing IP and team members, currently Cognition and Windsurf aim to continue development of their respective products.

_{Published by Abhishek Moramganti, July 2025}

Cloudflare Pay-Per-Crawl

Even if you're not familiar with Cloudflare by name, there's a good chance you've encountered it while browsing the internet. In particular, you might recognise their interface when a website prompts you to complete a security check, such as verifying that you're not a robot.

The company mainly provides internet services that improve website speed and reliability. Nevertheless, it has become increasingly prominent in developing defensive tools against internet-related abuses, such as cyberattacks and identity theft.

Recently, Cloudflare has expanded its operations to address one of the most pressing challenges of today’s internet: the unauthorised appropriation of intellectual property. The explosive growth of artificial intelligence has exposed the internet to unprecedented risks of misuse, amplifying concerns over content scraping and data exploitation.

The explosive growth of artificial intelligence has exposed the internet to unprecedented risks of misuse. This is particularly concerning in relation to unremunerated content scraping, which AI models depend on during training, and the role that companies play in exploiting this data.

In response to these issues, Cloudflare has developed the Pay-Per-Crawl service, which enables websites to charge AI bots a fee each time they crawl and extract content. This system acts as both a deterrent against unauthorised scraping and a potential revenue stream for publishers whose content fuels AI training. When an AI crawler attempts to access protected content, Cloudflare can require it to authenticate and pay for that access or otherwise block the request entirely.

Cloudflare

Developed by nRuaif and built upon Meta’s LLaMA 13B architecture, Kimiko v2 combines the raw power of large-scale language modelling with careful fine-tuning for personality depth. For this reason, Kimiko is a perfect candidate for those seeking a virtual companion, a collaborative writing partner, or a believable character in roleplaying activities.

Cloudflare operates as a protective layer between websites and visitors, including both human users and automated bots. When a visitor, or an AI crawler, tries to access a website, Cloudflare intercepts the request first. It analyses the incoming traffic to determine whether it is legitimate or potentially harmful.

After Cloudflare identifies AI bots attempting to scrape content, it can enforce rules that require these bots to authenticate themselves and pay a fee before proceeding. This process involves verifying the identity of the crawler and charging it based on the number of requests made to the website.

If the bot does not comply or fails authentication, Cloudflare can block the request altogether, preventing unauthorised scraping. This ensures that content owners retain control over how their data is accessed and used, while also providing a mechanism to monetise AI-driven content consumption.

Cloudflare’s services, including the Pay-Per-Crawl feature, are available through flexible subscription plans, ranging from free tiers with basic protections to advanced paid plans that include enhanced security, performance optimisations, and new tools for managing AI crawlers. This means that both small content creators and large publishers can leverage Cloudflare’s technology to protect their content and monetise AI-driven data scraping.

_{Published by Victor Velloso, July 2025}

Mirage - New Frontier For Video AI

A video transformative technology allowing you to build upon existing videos real time is a huge progression for video-to-video AI tools, achieved by Decart. It allows a user to alter visual content on the go, such as edit a zoom meeting while on-call or transform a gaming environment into a different animation style. Although other companies have video tools such as Google’s VEO3 or OpenAI’s Sora, Mirage stands out for having high quality and stability throughout transformation along with minimal latency.

Many developers/designers have found this tool to be of great assistance to elevate their application, especially in the world of gaming. Mirage could be potentially used to dynamically add effects and have personalised experiences for users through different visual styles. Users can choose how they want to alter the video they see on their screens, with effects placed immediately in a seamless manner. A small demo application called “Oasis” was released in late 2024 powered by this AI model. This caught the attention of many as it allowed users to play in an environment resembling Minecraft, although it was not the actual gaming platform. Decart is in the works of further introducing full HD and 4K support in the future, as it currently only supports frames with 768x432 resolution.

Video-to-Video AI tools struggle with livestreams due to poor auto-regression models they use. This means that each frame is generated using information from previous frames, and any errors produced will mount on top of one another and significantly degrade the quality of video in the longer term.

Mirage addresses these issues with two innovations:

Diffusion: where the model is trained to clean up noisy frames without needing full context, allowing for accurate frame-by-frame generation
History augmentation: where errors are recognised and fixed from its own past output so it auto-learns to do the same during generation, allowing for quality to be maintained.

Further, speed is optimised as the system is tuned for NVIDIA’s hopper chips using custom low level GPU code and achieves an overall delay of maximum 100ms. This means that the steps required to generate each frame is reduced and models are trimmed to run more efficiently.

_{Published by Arundhathi Madhu, July 2025}

Meta’s New Data Center Plans

According to recent studies from the global data Center consultancy BCS, most data centre facilities aren’t ready for AI heavy workloads. In a survey of over 3,000 data centre professionals across 41 countries, 85% of respondents admitted that their facilities are not prepared for the demands of AI-heavy workloads.

Meta is expected to have nearly 600 million active users by the end of 2025, gaining traction to an increased global scale. In response to Meta’s growing AI extravagance, Zuckerberg has come to the solution to create a data centre, not just any ordinary data centre, a data centre the size of Manhattan, solely for the purpose of AI development. Hundreds of billions of dollars in AI products will be invested, including a multi-gigawatt facility called Prometheus expected to unveil in 2026. To account for this huge development, Zuckerberg is continually hunting down many top talents and researchers from AI labs such as OpenAI into his SuperIntelligence Labs. Reports show that it will be led by former ScaleAI CEO Alexandr Wang and ex-GitHub chief Nat Friedman.

Although these structures allow and offer supreme computational power, storage and networking capabilities, AI driven data centres are extremely energy and water intensive, making it have heavy costs on environmental impact. If Zuckerberg starts the chain of building mass data centres, studies show almost 1.7 trillion gallons of water can be consumed globally by 2027, prompting other organisations to also follow a similar path for AI development.

“With such heavy costs on the environment, is this evolution in the AI boom really worth the encouragement and support despite its detrimental effects?”, are questions many are raising.

_{Published by Arundhathi Madhu, July 2025}

📑 Research Spotlight💡

Can Graph Problems Improve LLM Reasoning?

Improving LLMs' Generalized Reasoning Abilities by Graph Problems

Large Language Models (LLMs) have made remarkable strides in reasoning tasks, yet their performance often falters on novel and complex problems. Domain-specific continued pretraining (CPT) methods, such as those tailored for mathematical reasoning, have shown promise but lack transferability to broader reasoning tasks. In this work, we pioneer the use of Graph Problem Reasoning (GPR) to enhance the general reasoning capabilities of LLMs. GPR tasks, spanning pathfinding, network analysis, numerical computation, and topological reasoning, require sophisticated logical and relational reasoning, making them ideal for teaching diverse reasoning patterns. To achieve this, we introduce GraphPile, the first large-scale corpus specifically designed for CPT using GPR data. Spanning 10.9 billion tokens across 23 graph tasks, the dataset includes chain-of-thought, program-of-thought, trace of execution, and real-world graph data. Using GraphPile, we train GraphMind on popular base models Llama 3 and 3.1, as well as Gemma 2, achieving up to 4.9 percent higher accuracy in mathematical reasoning and up to 21.2 percent improvement in non-mathematical reasoning tasks such as logical and commonsense reasoning. By being the first to harness GPR for enhancing reasoning patterns and introducing the first dataset of its kind, our work bridges the gap between domain-specific pretraining and universal reasoning capabilities, advancing the adaptability and robustness of LLMs.

arxiv.org/abs/2507.17168

Thinking Isn't an Illusion

Thinking Isn't an Illusion: Overcoming the Limitations of Reasoning Models via Tool Augmentations

Large Reasoning Models (LRMs) have become a central focus in today's large language model (LLM) research, where models are designed to output a step-by-step thinking process before arriving at a final answer to handle complex reasoning tasks. Despite their promise, recent empirical studies (e.g., [Shojaee et al., 2025] from Apple) suggest that this thinking process may not actually enhance reasoning ability, where LLMs without explicit reasoning actually outperform LRMs on tasks with low or high complexity. In this work, we revisit these findings and investigate whether the limitations of LRMs persist when tool augmentations are introduced. We incorporate two types of tools, Python interpreters and scratchpads, and evaluate three representative LLMs and their LRM counterparts on Apple's benchmark reasoning puzzles. Our results show that, with proper tool use, LRMs consistently outperform their non-reasoning counterparts across all levels of task complexity. These findings challenge the recent narrative that reasoning is an illusion and highlight the potential of tool-augmented LRMs for solving complex problems.

arxiv.org/abs/2507.17699

Closing Notes

As always, we welcome any and all feedback/suggestions for future topics here or email us at [email protected]

Stay curious,

🥫Sauces 🥫

Here, you can find all sources used in constructing this edition of Turing Point:

For the best possible viewing experience, we recommend viewing this edition online.