Google shares for the first time the TPUv7 details, at Hot Chips 2025 .
Super valuable insight, that could not otherwise be easily gleamed.
Ironwood is said to offer 2x the perf-per-watt of Google’s previous generation TPU, Trillium.
With up to 9,216 chips in a node, Ironwood can scale up to a MASSIVE 42.5 Exaflops in performance.
Though with 10MW of power consumption, that performance doesn’t come cheap.
But, like all of Google’s TPUs, this is solely for Google’s use as part of their Google Cloud services, so Ironwood is not available to look at outside of Google.
🧵 Read on 👇
🧵2/n. Ironwood TPU comes with several innovations.
The big one is how big the SuperPods can go. Now up to 9,216 chips, thanks to the use of optical circuit switches (OCS) to share memory throughout the pod. There’s 1.77 PB of directly addressable HBM altogether.
This generation also brings a focus on RAS features in order to have reliable systems.
Power efficiency also gets a boost, of course. Google is claiming a 2x perf-per-watt improvement – though it’s unclear if this is at iso-datatype.
🧵3/n.
🧵4/n.
🧵5/n.
🧵6/n.
🧵7/n.
🧵8/n.
🧵9/n.
🧵10/n.
🧵11/n.
🧵12/n. Google updated the SoC architecture so that it could scale beyond a single die, so they aren’t reticle limited. Consequently, Ironwood is their first multiple compute chiplet die, with two Ironwood compute dies on each chip.
💼 Finally a solid 57-page report on AI's effect on job-market from Stanford University.
THE SHIFT HAS STARTED.
Entry‑level workers in the most AI‑exposed jobs are seeing clear employment drops, while older peers and less‑exposed roles keep growing.
Though overall employment continues to grow, employment growth for young workers in particular has been stagnant.
The drop shows up mainly as fewer hires and headcount, not lower pay, and it is sharpest where AI usage looks like automation rather than collaboration.
22‑25 year olds in the most exposed jobs show a 13% relative employment decline after controls.
⚙️ The paper tracked millions of workers and boils recent AI labor effects into 6 concrete facts
The headline being entry‑level contraction in AI‑exposed occupations and muted wage movement.
AI replacing codified knowledge that juniors supply more of, than tacit knowledge that seniors accumulate.
🧵 Read on 👇
🧵2/n. 📊 The Data
The study uses administraest payroll processtive payroll records from ADP, which handles pay for over 25M workers, letting the authors observe monthly headcount and base salary with high granularity.
They build a balanced panel of firms present from 2021‑01 to 2025‑07, restrict to ages 18‑70 with recorded titles mapped to Standard Occupational Classification codes, and end up with 3.5M–5M workers per month in the main sample.
🧵3/n. 🧭 How AI exposure is measured
One exposure signal comes from occupational task links to GPT‑4 capabilities, aggregated to occupations, which ranks jobs by how model‑amenable their tasks look.
A second signal comes from the Anthropic Economic Index that tags millions of Claude chats by occupation tasks and classifies usage as automative or augmentative, which lets the authors separate substitute‑like usage from complement‑like usage.
The top-most Universities from US, UK, EU, China, Canada, Singapore, Australia collaborated.
Will completely research paper writing.
They proved, AI can already draft proposals, run experiments, and write papers.
The authors built aiXiv, a new open-access platform where AI and humans can submit, review, and revise research in a closed-loop system.
The system uses multiple AI reviewers, retrieval-augmented feedback, and defenses against prompt injection to ensure that papers actually improve after review.
And the process worked: AI-generated proposals and papers get much better after iterative review, with acceptance rates jumping from near 0% to 45% for proposals and from 10% to 70% for papers.
🧵 Read on 👇
🧵2/n. Across real experiments it hits 77% proposal ranking accuracy, 81% paper ranking accuracy, blocks prompt‑injection with up to 87.9% accuracy, and pushes post‑revision acceptance for papers from 10% to 70%.
81% paper accuracy, 87.9% injection detection, papers 10%→70% after revision.
🧵3/n. This diagram shows aiXiv’s closed-loop system where AI and humans submit work, get automated reviews, revise, and then publish once quality clears the bar.
It means the platform is not a simple preprint dump, it is a workflow that forces measurable improvement each cycle.
Review agents score novelty, soundness, clarity, and feasibility using retrieval so feedback is grounded, and a prompt-injection detector screens malicious instructions before any model reads the file.
If the revised version looks better in pairwise checks, it moves forward, then a panel of LLMs votes, and 3 of 5 accepts trigger publication.
So the figure is saying aiXiv operationalizes end-to-end research, from idea to accepted paper, with guardrails and iteration built in.
This is that original MIT report that said 95% of AI pilots fail and which spooked investors across US Stockmarket.
The reports says, most companies are stuck, because 95% of GenAI pilots produce zero ROI, while a small 5% win by using systems that learn, plug into real workflows, and improve with use.
Teams keep buying or building static tools that demo well but cannot remember context, adapt, or fit daily operations, and this report maps exactly how the few winners do it differently.
🧪 How they ran the study
They combined a review of 300+ public implementations with 52 structured interviews and 153 senior‑leader surveys across January to June 2025, which gives the patterns below real footing.
🧵 Read on 👇
The big split they call the GenAI Divide is simple, 95% of organizations get nothing from GenAI pilots while a tiny 5% extract millions, and the driver is not the model itself but whether the system can learn, remember, and fit the workflow.
The steep drop from pilots to production for task-specific GenAI tools reveals the GenAI divide
The First method to achieve 99.9% on AIME 2025 with open-source models! 🤯
DeepConf uses a model’s own token confidence to keep only its strongest reasoning, with GPT-OSS-120B while cutting tokens by up to 84.7% compared to standard parallel thinking.
Most systems still lean on self-consistency with majority voting, which lifts accuracy but hits diminishing returns and burns a lot of tokens.
🧠 The key idea
DeepConf is a test-time method that scores the model’s reasoning locally for confidence, filters weak traces, and often improves accuracy with fewer tokens without any extra training or tuning.
🧱 Why majority voting hits a wall
Parallel thinking samples many chains and votes, accuracy grows slowly as samples rise so compute scales linearly and the benefit flattens, which is exactly the pain DeepConf targets.
🔎 The confidence signals
Token confidence is the negative mean log probability of the top k candidates at each step, which gives a direct signal of how sure the model is at that moment.
Group confidence averages token confidence over a sliding window so local dips are visible without noise from the whole trace.
Tail confidence averages the last chunk of tokens because the ending steps decide the final answer and are where good traces often slip.
Bottom 10% group confidence looks at the worst parts of a trace, which is a strong indicator that the overall reasoning is shaky.
Lowest group confidence picks the single weakest window along a trace, which turns out to be a clean gate for dropping that trace early.
✅ Bottom line
DeepConf is a plug-in test-time compression recipe that filters or halts weak reasoning in place, so teams get higher accuracy and a big token cut without retraining or new hyperparameters.
🧮 Offline mode, smarter voting
DeepConf ranks traces by a confidence score and does confidence-weighted majority voting after optionally keeping only the top 10% or the top 90% by confidence.
With 512 traces, GPT-OSS-120B reaches 99.9% on AIME 2025 using tail or lowest-group confidence with filtering, compared to 97.0% for plain voting and 91.8% for pass@1.
⚡ Online mode, early stop while generating
A short warmup of 16 traces sets a stopping threshold s from the confidence distribution for the current problem.
During live generation, a trace stops the moment its lowest group confidence falls below s, so weak lines of thought do not waste tokens.
An adaptive sampling loop adds traces until the consensus is high enough, or a set budget like 512 is reached.
In short, package stable context up front, give exact instructions and examples, restate the current ask, let the model reason, and demand a strict output format.
🧵 Read on 👇
🧵2/n Start with task context. Tell the model who it is, what domain it is in, and what outcome matters. In the demo, the first try misread the images as a skiing incident. Adding “you are assisting a Swedish car-insurance claims adjuster” fixed that because it anchored the model in the right world and goal.
🧵3/n Add tone context. Specify how to behave, for example “be factual, be confident only when evidence is clear, say you are unsure if you cannot tell.” This reduces guessing and aligns the model’s attitude with the task. The presenters explicitly ask the model not to invent details and to avoid a verdict unless it is sure.
A small Qwen2.5 model is fine-tuned to think over retrieved documents, so a single lean setup can answer domain questions on resource-constrained local hardware.
Using summarised NHS pages, retrieval hits the right condition among top‑5 in 76% of queries, and the fine‑tuned model predicts the exact condition correctly 56% of the time, close to larger frontier models.
The whole pipeline is built for private deployments, so teams can run it without sending data to external APIs.
🔒 The problem they tackle
Many teams cannot ship prompts or data outside their network, especially in health and government, so cloud LLM endpoints are off the table.
They aim for a single lean model that can read retrieved evidence and reason over it, all running locally, so answers stay grounded and private.
The target setting is messy queries over a closed corpus, where retrieval constrains facts and the reasoning step interprets symptoms and next actions.
🧩 The pipeline in this paper.
The system indexes a corpus, retrieves the most relevant pieces for each query, then generates an answer that reasons over those pieces.
They use a classic retriever plus generator design, with retrieval first then reasoning, which fits decision tasks better than free‑form answering.
The chat flow lets a conversational agent decide when to call retrieval, then passes the retrieved context to the reasoning model to produce the answer.
🧵 Read on 👇
🧲 The retriever at work
Documents are split into overlapping chunks and embedded with a sentence transformer, then stored in a vector database for fast similarity search.
They use sentence-transformers all‑mpnet‑base‑v2, which maps text into a 768‑dimensional space with a max sequence of 384 tokens, and a Chroma store with L2 similarity.
If any chunk from a document makes the top‑k, the pipeline feeds the full original document to the LLM, so the model sees full context around the hit.
Below image shows the whole training loop for their lean, retrieval-augmented reasoning setup.
It starts with a private knowledge base of about 1,000 NHS condition pages. GPT-4o generates about 2,000 synthetic patient queries from those pages, so they have realistic questions tied to known answers.
For each query, a retriever pulls the top 5 likely documents. DeepSeek-R1 reads those documents and the query, then produces a final label plus a step-by-step reasoning trace. That bundle becomes one training example.
They then fine-tune Qwen-32B-Instruct on this data and distill it into a smaller t0-1 reasoning model. The result is a compact model that learns to reason over retrieved evidence from the approved corpus, so it can run locally and stay grounded.