Feed SQ | Gordon Institute

When Memory Misfires: Why Teaching the Last Crisis Risks Slowing the Next Recovery

Misapplied memories of past crises distort investor behavior and delay recovery
Overshooting stems less from ignorance than from faulty analogies that amplify panic or euphoria
Policy should focus on structural safeguards and qualified capital, not generic history lessons

Two Rails, One Race: Why Hong Kong's HKD and Japan's Yen Are Asia's Only Real Counters to a US-Led Stablecoin Future

Dollar stablecoins hold near-total global dominance
Only HKD and JPY can credibly counter in Asia
Success depends on fast, disciplined execution

Not Your Therapist: Why AI Companions Need Statistical Guardrails Before They Enter the Classroom

Picture

Member for

1 year 7 months

Real name

Ethan McGowan

Bio

Professor of AI/Finance, Gordon School of Business, Swiss Institute of Artificial Intelligence

Ethan McGowan is a Professor of AI/Finance and Legal Analytics at the Gordon School of Business, SIAI. Originally from the United Kingdom, he works at the frontier of AI applications in financial regulation and institutional strategy, advising on governance and legal frameworks for next-generation investment vehicles. McGowan plays a key role in SIAI’s expansion into global finance hubs, including oversight of the institute’s initiatives in the Middle East and its emerging hedge fund operations.

Published

Aug 28, 2025 17:55

Modified

Feb 4, 2026 00:29

Student well-being is falling fast
AI chatbots are spreading quickly
Without safeguards, risks will escalate

In the 2023–24 academic year, only 38% of U.S. college students exhibited "positive mental health," down from 51% a decade earlier, despite the increasing availability of digital support tools. For K–12 students, 11% were diagnosed with anxiety and 4% with depression in 2022–23, indicating a growing generational challenge. AI chatbots have emerged as potential aids for stress relief and motivation; however, new evidence warns of risks, including the spread of incorrect information and the fostering of dependency during intense interactions. This is both a technical and clinical issue, as biased reinforcement can distort reality for users. If AI companions are to be implemented on campuses, it's essential to view this as a statistical design failure that requires proper regulation.

From clinical caution to feedback risk: reframing the debate

The prevailing concern about "AI therapy" has been framed as a question of empathy and accuracy: Can a chatbot accurately detect a crisis? Will it hallucinate unsafe advice? Those concerns are real, but they miss the engine underneath. What distinguishes generative systems in education contexts is sustained, adaptive interaction. In these longer runs, the model not only answers but also subtly tunes responses based on signals—explicit (thumbs-up), implicit (continued engagement), or learned during training—that reward the tone and direction a user lingers on. Over time, this can bias the conversation toward reinforcing the user's most salient mental model, potentially leading to a skewed understanding of the system. For vulnerable students, this is not a neutral drift. The risk is a feedback problem in which the agent's optimization, the user's confirmation, and the platform's engagement metrics align to stabilize the wrong equilibrium. The policy lens should therefore shift from "Can chatbots do empathy?" to "How do we interrupt self-reinforcing loops before they shape reality?"

The mechanics: RLHF, endogeneity, and biased convergence

Modern assistants are trained with Reinforcement Learning from Human Feedback (RLHF): a reward model learns what humans prefer, and the chatbot is then optimized to maximize that reward. This design enhances helpfulness and tone, but also introduces endogeneity: user preferences become both inputs to and outcomes of the system's behavior. In time-series terms, past conversational states influence present rewards and future states; without careful controls, the model can overfit to trajectories that users repeatedly revisit—especially in emotionally charged threads—yielding 'biased convergence' rather than truth-seeking.

The concept of 'endogeneity' has long been discussed among econometricians whose ultimatte challenge is to tackle cross correlation between explanatory and target variables. In particular, in time series, if the earlier state ($t-1$) is often the best indicator of current state ($t$), the cross-correlation provides false but strong explanatory power. Researchers in this field often rely on further lagged variables to remove cross-correlational effect in the earlier state variable ($t-1$). Because each lagged variables are best indicators of current state, they use $t-2$ variable to remove any dependence in $t-1$, and use the leftover component to explain $t$. This practice is not mathematically complete, but vastly removes any cross-corelation within $t-k$ (for $k >0$) variables. Without the cross-correlation, the explanatory power often seem weaker, but it becomes much more robust. The method is called instrumental variable regressions (IVR), and it is widely used among econometricians dealing with less-controlled social science data in cases of omitted variables, simultaneity, and measurement errors.

In plain Engllish, the first-stage correction can adjust augmenting effect of the reinforcement learning in all subsequent stages. Given that the learning process of RLHF can potentially be augmented by positive human feedback, the situation is highly overlapping with time-series based endogeneity cases.

Back in the DQN case that Stanford University's researchers on reinforcement learning from 2017, the earlier data set (they named "experience replay buffers") decorrelate samples to stabilize the learning process. Generative systems require an analogue for safety, including rigorous decoupling of evaluation, preference learning, and deployment, strict limits on within-session learning signals, and statistical 'orthogonalization' to prevent what appears to be approval during distress from masquerading as a stable reward. Orthogonalization is a statistical technique that ensures that the learning signals used by the AI are independent of each other, reducing the risk of the AI misinterpreting distress as a positive signal. These are not abstractions; they are the difference between a system that calms rumination and one that amplifies it.

What the numbers actually say

Utilization of mental-health chatbots among U.S. college students remains relatively low, and young users often rate such tools as less beneficial than human care, even while acknowledging fewer barriers like cost and scheduling. Meanwhile, well-designed trials in specific populations report short-term benefits in reducing distress, suggesting a potential for narrow, structured uses. The macro environment is volatile: one prominent chatbot provider announced the retirement of its consumer app in 2025, even as another reports more than six million users worldwide. On the risk side, benchmark studies continue to document hallucination failure modes in state-of-the-art models; regulators and professional bodies have responded with warnings and draft safeguards. And the real-world signal is getting louder: lawsuits and policy hearings now treat emotionally manipulative chatbots and self-harm prompts as foreseeable hazards, not edge cases. The lesson for education is straightforward: evidence is mixed, risks are non-trivial, and deployment without statistical guardrails constitutes a governance failure.

Figure 1: Student mental health has eroded steadily over the past decade, with fewer than 4 in 10 reporting positive well-being in 2024—just as digital mental-health tools proliferated.

From replay buffers to "do-not-learn": translating safety into design

A practical mitigation is to prevent the model from learning, even implicitly, from the most fragile conversations. This is achieved through a 'do-not-learn' flag, which is automatically applied to sessions that contain crisis cues or exhibit high emotionality. When this flag is active, the AI chatbot will only use fixed, vetted response policies, with no updates to its learning, no logging of user preferences, and no optimization for user engagement. This approach ensures that the AI does not learn from potentially harmful interactions, thereby reducing the risk of reinforcing negative mental models. Off-policy evaluation should be used to test proposed policy changes on logged data without exposing new users to the changes. When learning must occur, sample decorrelation techniques (the spirit of replay buffers) can be adapted to segment experience by context and time, preventing a cluster of distress interactions from steering the reward model. Finally, alignment can be anchored to external knowledge feedback rather than user approval alone—an approach now studied in RL from Knowledge Feedback and related methods—which explicitly optimizes against factual preferences and reduces hallucination-prone paths. Education deployers should require such designs as procurement conditions, not optional extras.

Figure 2: Adoption of AI mental-health chatbots remains modest but is rising quickly, suggesting that small risks now may scale dramatically if governance lags behind use.

Measurement that resists endogeneity

Platforms should report metrics that are causally interpretable, not just flattering. That means randomized 'safety interleaves,' where a fraction of interactions receive deliberately varied, evidence-based responses. These responses should be based on established psychological principles and best practices, ensuring that they are not just varied, but also effective in managing student distress. Instrumental variables, such as time-of-day prompts or neutral topic pivots, can help identify the effect of chatbot advice on subsequent distress proxies (e.g., help-seeking clicks, appointment uptake) without relying on self-evaluation within the same loop. Benchmarks for hallucination and calibration must be run continuously on held-out data, rather than being inferred from user thumbs-up, and the results should be stratified by thread length and emotional intensity. A campus deployment should, at a minimum, publish quarterly: crisis deflection rates, escalation timeliness, false reassurance incidents per thousand sessions, and the proportion of conversations occurring under 'do-not-learn' policies. This is not overkill. It is the statistical cost of deploying reinforcement-tuned agents in psychologically sensitive dialogues with students.

The regulatory context—and why education should aim higher

The EU AI Act requires providers of high-risk AI to mitigate feedback loops where ongoing learning lets biased outputs contaminate future inputs. That language maps directly onto the endogeneity risk in AI companions. Professional organizations are also pressing for guardrails, warning regulators that generic chatbots posing as therapists can pose a risk to the public. However, campuses should not wait for compliance deadlines to expire. Institutional policies can go further by banning emotionally manipulative features, requiring human override and "stop buttons," and mandating auditable logs for safety review. Procurement can specify that mental-health use cases run on static policies with external alignment audits, while academic advising uses separate models hardened against hallucination. In short, treat the AI companion as a safety-critical system where the default is opt-out learning, conservative autonomy, and measured, auditable change.

Anticipating the counterarguments

Proponents will argue that chatbots are often the only scalable option when counseling centers are overwhelmed—and that some studies show meaningful reductions in distress. Both points are valid and still compatible with restraint. Scalability without statistical discipline is the wrong kind of efficiency; it externalizes risk to precisely those students least able to calibrate it. Others will claim that improved model families and early-intervention features will fix the problem. Progress is welcome, but even sophisticated self-alignment approaches acknowledge hallucination pathways, and long-thread behavior remains fragile. Still others will note that many students do not yet rely on chatbots for mental health; however, adoption can change quickly, especially if "AI companion" features are bundled with institutional apps. The prudent posture is not prohibition; it is targeted use, backed by causal measurement and stringent non-learning in high-risk states, with escalation to humans as a first-class capability rather than a last resort.

What should educators and administrators do next?

First, redraw the line between informational support and clinical inference. Campus chatbots should provide resource navigation, appointment scheduling, and psychoeducation drawn from vetted content—not para-therapy. Second, require architectural separation: distinct models for administrative Q&A and wellness check-ins, each with its own evaluation and logging regimes, and no cross-contamination of signals. Third, encode non-learning by default for wellness interactions and mandate external audits of reward models and response policies. Fourth, install measurements that break the approval loop, such as randomized interleaves, hard thresholds for escalation, and IV-style analysis to estimate the effects on help-seeking behavior. Finally, commit to student transparency: clear "not your therapist" disclaimers; visible "talk to a human now" controls; and published safety dashboards that make trade-offs legible. These steps are implementable today. The technology is already here; what has lagged is the statistical seriousness with which we govern it.

Closing the Loop: Putting Safety Before Scale

Ten years ago, the mental health of students was declining without the involvement of AI aids. Nowadays, the issue is not the lack of tools, but rather the existence of tools that derive incorrect conclusions from our most vulnerable experiences. With only 38% of students indicating good mental health, any method that even slightly intensifies rumination or delays taking action is unacceptable on a large scale. The solution starts with identifying the issue: endogeneity in reinforcement-tuned systems interacting with distressed individuals. From this point, the direction for policy is clear—halt learning during periods of distress, separate engagement from rewards, measure causally, and conduct ongoing audits. Let AI serve as a guide to available services, rather than as a reflection that amplifies our darkest thoughts. Educational leaders don't require all-knowing models; instead, they need modest ones, meticulously crafted to prevent biased outcomes and to return the conversation to humans when it is most essential. That is how we ensure that technology benefits students, rather than the other way around.

The views expressed in this article are those of the author(s) and do not necessarily reflect the official position of the Swiss Institute of Artificial Intelligence (SIAI) or its affiliates.

References

American Council on Education. (2025). Key mental health in higher education stats (2023–24). ACE.

American Psychological Association (APA). (2025, March 12). Using generic AI chatbots for mental health support. APA Services.

Bang, Y., et al. (2025, April). HalluLens: LLM hallucination benchmark. Proceedings of ACL.

Centers for Disease Control and Prevention. (2025, June 5). Data and statistics on children's mental health. CDC.

Chaudhry, B. M., et al. (2024). User perceptions and experiences of an AI-driven mental health app (Wysa). Digital Health, 10.

Colasacco, C. J. (2024). A case of artificial intelligence chatbot hallucination. Journal of the Medical Library Association, 112(2).

European Union. (2024). Regulation (EU) 2024/1689: Artificial Intelligence Act. Official Journal of the European Union.

Hugging Face. The Deep Q-Learning algorithm. (Experience replay explainer).

Lambert, N., et al. (2024). Reinforcement Learning from Human Feedback (RLHF). (Open book; foundations and optimization stages).

Li, J., et al. (2025). Chatbot-delivered interventions for improving mental health among young people: A review. Adolescent Research Review.

Liang, Y., et al. (2024). Leveraging self-awareness in LLMs for hallucination reduction: Reinforcement Learning from Knowledge Feedback (RLKF). KnowledgeNLP Workshop.

Rackoff, G. N., et al. (2025). Attitudes and utilization of chatbots for mental health among U.S. college students. JMIR Mental Health, 12.

Stanford HAI. (2025, June 11). Exploring the dangers of AI in mental health care. Stanford Institute for Human-Centered AI.

Wysa. (2025). Wysa: Everyday mental health (company site; "6+ million users").

Wysa Research Team. (2024). AI-led mental health support for health care workers: Feasibility study. JMIR Formative Research, 8, e51858.

Woebot Health. (2025, April 28). Woebot Health is shutting down its app. HLTH Community.

Picture

Member for

1 year 7 months

Real name

Ethan McGowan

Bio

Tariffs, Taxes, and the Term-Premium Trap: Why the Fed's Next Move Hinges on Treasury's

Tariffs nudge prices up but can shrink Treasury coupon supply
Channel tariff revenue to bills to compress term premia and enable Fed cuts
Extending tax cuts swells deficits, hardens long yields, and squeezes education budgets

Forge the Shield: Why ASEAN Needs a Union-Style Economic Defense Now

ASEAN risks tens of billions in tariff costs under new U.S.

Europe's Fiscal Rubicon: Balancing Budgets in an Age of War, Yields, and Waning Guarantees

Europe must consolidate while rearming
Compliance still lags; credible plans are urgent
Reprioritise spending and taxes; leverage EU-level financing

By July 2025

Gold Isn't General: Why Olympiad Wins Don't Signal AGI—and What Schools Should Do Now

Picture

Member for

1 year 7 months

Real name

Ethan McGowan

Bio

Published

Aug 27, 2025 17:23

Modified

Sep 17, 2025 13:28

AI’s IMO gold isn’t AGI
Deploy it as an instrumented calculator
Require refusal metrics and proof logs

In July 2025, two advanced AI systems achieved "gold medal–level" results in the International Mathematical Olympiad (IMO), solving five out of six problems in the 4.5-hour timeframe. Verified by Google DeepMind, these results were matched by OpenAI's experimental model. Despite this, over two dozen human participants still outperformed the machines, with about 11% of the 630 students earning gold medals. This achievement is noteworthy, as DeepMind's systems had only reached silver the previous summer. The significance lies not in reaching artificial general intelligence but in combining effective problem-solving with a safety mechanism known as strategic silence, which raises essential considerations for educational institutions regarding AI implementation and regulation.

Reframing the Achievement: From General Intelligence to Domain-Bounded Mastery

The prevailing narrative treats an Olympiad gold as a harbinger of generalized reasoning. A more defensible reading is narrower: these systems excel when the task can be formalized into stepwise deductions, search over structured moves is abundant, and correctness admits an unambiguous verdict. That is precisely what high-end competition math provides. DeepMind's 2024 silver standard required specialized geometry engines and formal checkers. By 2025, both labs will combine broader language-based reasoning with targeted modules and evaluation regimes to reach gold on unseen problems. This is impressive engineering, but it does not necessarily demonstrate that the same models can resolve ambiguous, real-world questions where ground truth is contested, noisy, or deferred. In classrooms, this distinction is particularly relevant now because education systems are under pressure—following record declines on PISA mathematics and uneven NAEP recovery—to bridge capability gaps with the help of AI. If we mistake domain-bounded mastery for general intelligence, we risk deploying tools as oracles where they should be framed, regulated, and assessed as instrumented calculators.

Figure 1: Both AI systems reached **35/42**, exactly the gold cutoff, but not the maximum—while **72 of 630** humans (≈**11.4%**) also earned gold. The result signals calibrated, checkable problem-solving—not generalized intelligence.

The New Safety Feature: Strategic Silence Beats Confident Error

A lesser-discussed aspect of the IMO story is abstention. Where earlier systems "hallucinated," newer ones increasingly decline to answer when internal signals flag inconsistency. In math, abstention is straightforward to reward: either a proof checks or it does not, and a blank is better than a confidently wrong derivation. Recent research formalizes this with conformal abstention, which bounds error rates by calibrating the model's self-consistency across multiple sampled solutions. A 2025 work shows that learned abstention policies can further improve the detection of risky generations. The upshot is that selective refusal, rather than omniscience, underpinned part of the Olympic-level performance. Transfer that tactic to messy domains—such as ethics, history, and policy—and the ground shifts: the equivalence between answers is contestable, and calibration datasets are fragile. Education policy should therefore require vendors to publish refusal metrics alongside accuracy—how often and where the system declines—and to expose abstention thresholds so that schools can adjust conservatism in high-risk contexts. That is how we translate benchmark discipline into classroom safety.

Proof at Scale—But Proof of What?

A parallel revolution makes Olympiad success possible: large, synthetic corpora of formal proofs in Lean, improved autoformalization pipelines, and verifier-in-the-loop training. Projects like DeepSeek-Prover and subsequent V2 work demonstrate that models can produce machine-checkable proofs for competition-level statements; 2025 surveys chart rapid gains across autoformalization workflows, while new benchmarks audit conversion from informal text to formal theorems. This scaffolding reduces hallucination in mathematics because every candidate proof is mechanically checked. Yet it does not imply discovery beyond the frontier. When ground truth is unknown—or when a novel conjecture's status is undecidable by current libraries—models can only resemble discovery by recombining lemmas they have seen. Schools and ministries should celebrate the verified-proof pipeline for what it offers learners: transparent exemplars of sound reasoning and instant feedback on logical validity. But they should resist the leap from 'model can prove' (i.e., demonstrate the validity of a statement based on existing knowledge) to 'model can invent' (i.e., create new knowledge or solutions), especially in domains where no formal oracle exists. Policy should encourage the use of external proof-logs and independent reproduction whenever AI-generated mathematics claims novelty.

Education's Immediate Context: A Capability Spike Amid a Learning Slump

The timing of math-capable AI collides with sobering data. Across the OECD, PISA 2022 recorded the steepest decline in mathematics performance in the assessment's history—approximately 15 points on average compared to 2018, equivalent to about three-quarters of a year of learning—while a quarter of 15-year-olds are low performers across core domains. In the United States, the 2024 NAEP results indicate that fourth-grade math scores are increasing from 2022 but remain below those of 2019, and eighth-grade scores are stable after a record decline. Meanwhile, teacher shortages have intensified: principals reporting shortages rose from 29% to nearly 47% between 2015 and 2022, and global estimates warn of a 44-million teacher shortfall by 2030. In short, demand for high-quality mathematical guidance is surging, while supply lags. The risk is techno-solutionism—handing a brittle tool too much agency. The opportunity is targeted augmentation: offload repetitive proof-checking and step-by-step hints to verifiable systems while elevating teachers to orchestrate strategy, interpretation, and meta-cognitive instruction that machines still miss.

Figure 2: A quick heat table shows the big global drop (**−15 PISA points**) alongside the U.S. picture: **Grade 4** has a small recovery (**+2 since 2022**, still below 2019), while **Grade 8** is flat since 2022 and down vs 2019. The policy problem is recovery pace, not just tool capability.

A Data-First Method for Sensible Deployment

Where complex numbers are missing, we can still build transparent estimates to guide practice. Consider a district with 10,000 secondary students and a mathematics teacher vacancy rate of 8%. If a verified-proof tutor reduces the time teachers spend grading problem sets by 25%—a conservative assumption derived from automating correctness checks—. Each teacher reclaims 2.5 hours weekly for targeted small-group instruction, total high-touch time rises by roughly 200 teacher-hours per week (10,000 students / ~25 per class, ≈ 400 classes; 8% vacancy implies 32 classes unstaffed; reclaimed time across 368 staffed classes yields ≈ 920 hours; assume only 22% of those hours translate to direct student time after prep/admin leakage). Under these assumptions, the average small-group time per student could increase by 12–15 minutes weekly without changing staffing levels. The methodology is deliberately conservative: we heavily discount reclaimed hours, assume no gains from lesson planning, and ignore positive spillovers from improved diagnostic data. Pilots should publish these accounting models, report realized efficiencies, and include a matched control school to prevent Hawthorne effects from inflating early results. The point is not precision; it is falsifiability and local calibration. The responsible deployment of AI is crucial for the future of education, underscoring the weight of decisions that policymakers must make.

Guardrails That Translate Benchmark Discipline into Classroom Trust

Policy should codify the differences between math-grade reliability and real-world ambiguity. First, treat math-competent AI as an instrumented calculator, not an oracle: require visible proof traces, line-by-line verifier checks when available, and automatic flagging when the system shifts from formal to heuristic reasoning. Second, adopt abstention-first defaults in high-stakes settings: if confidence falls below a calibrated threshold, the system must refuse, log a rationale, and route to a human. Third, mandate vendor disclosures that include not only accuracy but also a refusal profile—the distribution of abstentions by topic and difficulty—so schools can align system behavior with their risk tolerance. Fourth, anchor adoption in international guidance: UNESCO's 2023–2025 recommendations emphasize the human-centered, transparent use, teacher capacity building, and local data governance; OECD policy reviews highlight severe teacher shortages and the need to support staff with accountable technology, rather than inscrutable systems. Finally, ensure every procurement bundle includes professional learning that teaches educators to audit the machine, not merely operate it.

Anticipating the Critiques—and Meeting Them With Evidence

One critique claims that a gold-level run on Olympiad problems implies imminent generality: if models solve novel, ungooglable puzzles, why not policy analysis or forecasting? The rebuttal is structural. Olympiad items are adversarially designed but exist in a closed world with crisp adjudication; success there proves competence at formal search and verification, not cross-domain understanding. News reports themselves note that the systems still missed one of six problems and that many human contestants scored higher—a sign that tacit heuristics and creative leaps still matter. A second critique warns that abstention may mask ignorance: by refusing selectively, models could avoid disconfirming examples. That is why conformal-prediction guarantees are valuable; they bound error rates on calibrated distributions and make abstention auditable rather than cosmetic. A third critique says: even if not general, shouldn't we deploy aggressively given student losses? Yes—but with verifiers in the loop, refusal metrics in the contract, and open logs for academic scrutiny. The standard for classroom trust must exceed the standard for leaderboard wins.

The Real Payoff: Moving Beyond Answers to Reasoning

If gold is not general, what is the benefit of today's models? In education, it is the chance to make reasoning—the normally invisible scaffolding of problem-solving—observable and coachable at scale. With formal tools, students can identify where a proof fails, edit the line, and instantly see whether a checker confirms or rejects the fix. Teachers, facing overloaded rosters, can reallocate time from marking to mentoring. Policymakers can define success not as "AI correctness" but as student transfer: the ability to recognize invariants, choose lemmas wisely, and explain why a tactic applies. This reframing turns elite-benchmark breakthroughs into pragmatic classroom levers. It also acknowledges limits: outside math, where correctness admits no oracle, explanation will be probabilistic and contestable. Hence, the need arises for abstention-aware systems, domain-specific verifiers where they exist, and professional development that equips teachers with the language of uncertainty. Progress on autoformalization and prover-in-the-loop pipelines is the technical foundation; human judgment remains the ultimate authority.

Back to the statistic, forward to action

A year ago, the top AI could only achieve a silver standard at the IMO; this summer, two laboratories surpassed the gold standard, while many young competitors still surpassed them. This statistic is illuminating not because it predicts AGI, but because it shows the nature of genuine advancement: narrow fields with dependable verification are yielding to systematic exploration and principled restraint. Educational institutions should react similarly. View math-capable AI as an enhanced calculator with logs, rather than as an oracle; require metrics on refusals and proof traces; enhance teacher capabilities so that recovered time can be transformed into focused feedback; and demand independent verification for any claims of innovation. By aligning procurement, teaching methods, and policy with this understanding, Olympiad gold will benefit students rather than lead us into overstatements. The immediate goal is not general intelligence; it is broad reasoning literacy across a system that is still healing from significant educational setbacks. That is the achievement worth pursuing.

The views expressed in this article are those of the author(s) and do not necessarily reflect the official position of the Swiss Institute of Artificial Intelligence (SIAI) or its affiliates.

References

Ars Technica. (2025, July). OpenAI jumpthe s gun on International Math Olympiad gold medal announcement.

Axios. (2025, July). OpenAI and Google DeepMind race for math gold.

CBS News. (2025, July). Humans triumph over AI at annual math Olympiad, but the machines are catching up.

DeepMind. (2024, July). AI achieves silver-medal standard solving International Mathematical Olympiad problems.

DeepMind. (2025, July). Advanced version of Gemini with Deep Think officially achieves gold-medal standard at the International Mathematical Olympiad.

National Assessment of Educational Progress (NAEP). (2024). Mathematics Assessment Highlights—Grade 4 and 8, 2024.

OECD. (2023). PISA 2022 Results (Volume I): The State of Learning and Equity in Education.

OECD. (2024). Education Policy Outlook 2024.

UNESCO. (2023; updated 2025). Guidance for generative AI in education and research.

Xin, H., et al. (2024). DeepSeek-Prover: Advancing Theorem Proving in LLMs (arXiv:2405.14333).

Yadkori, Y. A., et al. (2024). Mitigating LLM Hallucinations via Conformal Abstention (arXiv:2405.01563).

Zheng, S., Tayebati, S., et al. (2025). Learning Conformal Abstention Policies for Adaptive Risk (arXiv:2502.06884).

Picture