Why ML Conferences Have Lost Legitimacy: A Case Study in Error Propagation

April 29, 2026

by rylanschaeffer@gmail.com

A labmate of mine spent three months trying to improve upon min-p sampling — a method for generating text from language models that had been published as an ICLR 2025 Oral, the 18th highest-scoring submission that year. After months of work, he made a troubling discovery: by failing to control for hyperparameter tuning, he could make almost any sampler look like the state-of-the-art. That’s when we started looking more carefully at the paper.

What we found was worse than we expected. Not just a paper with weak evidence, but a chain reaction:

In 2025, an ICLR Oral claimed a new sampling method was superior to all alternatives. Its evidence included fabricated adoption numbers, omitted baseline data, incorrect statistical tests, and cherry-picked scores. It received the field’s highest distinction.

In 2025, a NeurIPS Best Paper used that method as a premise to draw a faulty conclusion about the limits of LLM decoding. The premise was wrong, but the conclusion entered the scientific record as a guide for future research.

In 2026, another ICLR Oral repeated the exact same evaluation mistakes on the same topic and was again rewarded with an Oral presentation.

When we tried to correct the record, we discovered the system fights back. This is a story about how errors enter the ML research ecosystem, how they propagate, and why the lack of any mechanism for correction or consequence has cost academic machine learning research its legitimacy.

Min-P Sampling: An ICLR 2025 Oral Under Scrutiny

Nguyen et al. (2025) proposed a new method for sampling from language models called min-p. The paper claimed that min-p achieved superior quality and diversity as well as a pareto-optimal tradeoff between quality and diversity over established alternative sampling methods like basic (temperature-only) sampling, top-k, and top-p. The paper went on to become the 18th highest-scoring submission to ICLR 2025 and was selected for an Oral presentation — a distinction reserved for the top ~1% of submissions.

All four lines of evidence the paper presented — community adoption metrics, human evaluations, LLM-as-a-Judge evaluations, and NLP benchmark evaluations — were deeply flawed.

Fabricated adoption numbers

The paper claimed “over 54,000 GitHub repositories using [min-p], amassing a cumulative 1.1 million stars.” The combined stars of all major LM repositories — transformers, ollama, llama.cpp, vLLM, and others — totaled about 453,000 at the time. The 1.1 million number was calculated by searching GitHub for the string “min-p,” which produces massive numbers of false positives.

These fabricated numbers were not incidental to the paper’s success — they were central to its selection as an Oral. Three of four reviewers and the Area Chair specifically cited them as a key reason for their high scores. The Area Chair wrote that min-p “is already widely adopted by the community” and has “extremely high impact,” directly referencing the 54,000 repositories claim.

There are two failures here. The first is that reviewers did not do the work. No one paused to ask whether the claimed numbers were even remotely plausible — 5 minutes on GitHub would have raised obvious red flags. The second is that after the authors retracted the numbers, nothing happened. The scores stood. The Oral stood. No mechanism exists at ICLR to trigger reevaluation when a primary justification for acceptance turns out to be invented.

Misanalyzed human evaluations

Human evaluations are the gold standard for assessing language model outputs. The paper’s human evaluation had several problems.

Data were omitted. The paper stated that human participants evaluated min-p against one baseline: top-p. But scores for a second baseline — basic sampling — had been collected and silently excluded, comprising a full third of the data.

The statistical analysis was wrong. The paper claimed min-p “consistently scored higher than top-p sampling across all settings,” supported by a pooled paired t-test. But “consistently across all settings” requires testing each setting individually. When proper one-sided paired t-tests were run for all 12 comparisons (2 metrics × 3 temperatures × 2 baselines), only 5 of 12 favored min-p without correction. After Bonferroni correction: 1 of 12 at α = 0.05. Zero at α = 0.01.

Qualitative feedback was mischaracterized. The paper claimed participants “frequently noted” preferring min-p outputs. Manual annotation of the actual responses showed more evaluators preferred basic sampling over min-p.

In response to these findings, the authors ran a new human evaluation with a different implementation, task, rubric, hyperparameters, and participant pool. It also failed to show min-p outperforming baselines. Wherever min-p appeared to have an edge, it was in high-temperature conditions where all samplers produced lower absolute scores — conditions no practitioner would choose.

Cherry-picked LLM-as-a-Judge scores

The LLM-as-a-Judge evaluations lacked basic methodological details: no mention of which model was sampled, which model judged, or how hyperparameters were selected. Win rates were reported without uncertainty estimates.

Additionally, the comparison was arranged in min-p’s favor. Min-p received ~2× more hyperparameter tuning than top-p and ~10× more than basic sampling. And the reported numbers appear cherry-picked: the higher of two available scores was reported for min-p (win rate 52.01 at p=0.05, not 50.14 at p=0.01), while the lower of two was reported for top-p (50.07 at p=0.9, not 50.43 at p=0.98).

NLP benchmarks that don’t survive a proper sweep

The paper claimed “min-p sampling achieves superior performance across benchmarks and temperatures.” A comprehensive sweep on GSM8K CoT — 9 models, 2 stages, 4 samplers, 31 temperatures, 6 hyperparameters per sampler, 3 seeds, ~6,000 A100-hours — told a different story. When each sampler was given an equal hyperparameter budget, all converged to roughly the same performance. Min-p’s apparent advantage was an artifact of unequal tuning.

We met with the authors twice. Some corrections were made — the omitted data was added, the adoption numbers retracted — but the paper’s central claims and conclusions were not updated. We contacted the ICLR 2025 Program Chairs. Their response: post a comment on OpenReview.

How Errors Propagate: A Faulty Conclusion in NeurIPS 2025’s Best Paper

Jiang et al. (2025)’s “Artificial Hivemind” — NeurIPS 2025 Best Paper in the Datasets & Benchmarks track — studies a real and important phenomenon: language models produce strikingly homogeneous outputs, both within a single model and across different model families. They call this the “Artificial Hivemind” effect.

One question the paper asks is whether better sampling strategies can mitigate this homogeneity. To test this, the authors turn to min-p, which they describe as “a dynamic strategy for enhancing generation diversity.” They run the same experimental setup with min-p decoding and find that 81% of response pairs still exceed 0.7 similarity and 61.2% exceed 0.8. The paper concludes that “more generalizable solutions are needed at the model training level to robustly preserve output diversity,” since even “diversity-oriented decoding” fails to break the pattern.

This conclusion rests entirely on the premise that min-p actually increases diversity. It doesn’t. As discussed in the previous section, min-p does not improve diversity over baseline samplers — not in human evaluations, not in benchmark evaluations, and not in LLM-as-a-Judge evaluations. Min-p is not a “diversity-oriented” sampler that failed to solve homogeneity. It is a sampler that was never shown to increase diversity in the first place.

The correct interpretation of the Artificial Hivemind experiment is therefore not “even diversity-enhancing decoding can’t save us” but rather “we tested a sampler that doesn’t enhance diversity, and unsurprisingly, it didn’t enhance diversity.” The question of whether genuinely diversity-enhancing decoding strategies could mitigate the Artificial Hivemind effect remains open. The conclusion sounds important and actionable: don’t bother with decoding, fix training instead. But it’s built on air.

Weak Baselines, Weak Progress: p-less Sampling at ICLR 2026

Tan et al. (2026)’s “p-less sampling” proposes a hyperparameter-free, information-theoretic approach to token sampling and was selected as an ICLR 2026 Oral. The paper claims that p-less “consistently outperforms existing sampling approaches.” One of its primary baselines is min-p.

This is where error propagation becomes stagnation. If min-p doesn’t outperform basic samplers when given equal hyperparameter tuning — and as discussed above, it doesn’t — then “outperforming min-p” is not evidence of progress. You’re beating a baseline that was never shown to be strong.

But set aside the weak baseline. The p-less paper also repeats the same evaluation patterns that made the min-p paper’s claims unreliable.

The headline results (Table 1) compare p-less against baselines using single default hyperparameters. When the authors themselves sweep hyperparameters more broadly (their Table 8), the gaps shrink — on GSM8K with Llama-2-7b, top-p at p=0.4 achieves an AUC of 0.264 versus p-less’s 0.267. This is exactly the lesson from our min-p analysis: samplers look different when one gets more tuning than the others, and look the same when tuning is equalized. The paper’s own data hints at this, but the conclusion doesn’t reflect it. And the broader hyperparameter sweep is only reported for one of three models.

The paper’s primary metric — area under the accuracy-temperature curve from 0.5 to 2.0 — aggregates across a wide temperature range, and much of p-less’s advantage comes from high-temperature regimes where competing methods degrade sharply. At T=1.0, where most practitioners actually operate, results are mixed and sometimes favor baselines. The human evaluation compares p-less at T=2.0 against default sampling at T=1.0, confounding the sampler’s contribution with the temperature difference. No statistical significance tests are reported for any accuracy claims, despite being reported for efficiency claims.

Preliminary results from extending our hyperparameter-controlled benchmark sweeps to p-less tell the same story we found for min-p: when each sampler is given an equal budget, the differences largely vanish.

The deeper issue is not that p-less is a bad idea — a tuning-free sampler with good high-temperature robustness is a genuinely useful concept. The issue is that the field has learned nothing from min-p’s evaluation failures. The same patterns recur: unequal hyperparameter tuning, aggregation over temperature ranges that flatter the proposed method, missing significance tests, and no serious effort to control for the additional degrees of freedom a new method introduces. These are not novel mistakes. They are the same mistakes, made again, on the same topic, at the same venue, one year later.

Why Flawed Papers Get In and Corrections Don’t

There is a structural asymmetry in ML peer review. Papers making bold claims face a low bar: exciting results, novel method, big numbers. Papers challenging those claims face a much higher one. The result is that errors enter the literature easily and are nearly impossible to correct once they’re in.

We know this because we’ve been on the wrong side of it. Our rebuttal of min-p has been rejected multiple times — and most recently, at ICML 2026, was referred for an ethics review on the grounds that documenting methodological problems in a published paper could constitute “insinuations of misconduct.” The ethics reviewer recommended we “remove or neutralize targeted/accusatory framing” — for a paper that identifies omitted data, incorrect statistical tests, and retracted adoption numbers, all of which the original authors have acknowledged by modifying their manuscript.

The reviewer objections are revealing, not because they are unreasonable in isolation, but because of what they collectively imply about what the system will and won’t tolerate.

One reviewer wrote that our work was “unprofessional” and urged us to contact the original authors — despite us having met with them twice, corresponded extensively, and prompted the corrections they made. When we clarified this, the reviewer replied: “If you had earlier said [you shared the manuscript], I’d have stopped mentioning it.” The concern was never really about process. It was discomfort with the act of public scientific critique.

Another objection: “I do not think it is a good idea to ‘go after’ one particular paper or group.” The suggested alternative was to spread the critique across multiple papers — to make it feel less personal. We took this advice for the ICML submission: we added a second case study (p-less sampling), reframed the paper around general evaluation standards, and added a formalized evaluation protocol. It didn’t matter. Reviewers at ICML raised the same objections. One wrote that the standards we propose are “largely well known” — then rated originality 1 out of 4. The fact that two consecutive ICLR Oral papers violate all four “well-known” standards was apparently not evidence that the standards needed restating.

A third objection: venue fit. Multiple reviewers suggested we submit to the ML Reproducibility Challenge or a workshop instead. The logic is circular: ICLR publishes a flawed Oral, but the correction belongs at a lower-visibility venue. The error gets the spotlight; the correction gets a poster in a side room.

Meanwhile, the min-p paper sailed through review with fabricated adoption numbers that no reviewer checked, omitted baseline data that no reviewer noticed, and a pooled statistical test that no reviewer questioned. It received an Oral. The p-less paper, one year later at the same venue, repeated the same evaluation mistakes and also received an Oral.

The pattern is clear. If you claim a new method is superior based on selectively presented evidence, you get an Oral and community adoption. If you show up with 6,000 GPU-hours of experiments demonstrating those claims don’t hold, you get told it’s “not novel enough” and referred for an ethics review. The incentive structure does not reward careful science. It rewards being first and being loud.

Why Academic Research Is Losing Legitimacy

This post traced a single thread through three papers at three top venues. A flawed ICLR 2025 Oral made unsupported claims about a sampling method. A NeurIPS 2025 Best Paper took those claims at face value and drew a faulty conclusion about the limits of decoding strategies. An ICLR 2026 Oral repeated the same evaluation mistakes on the same topic, one year later, and was again rewarded with the field’s highest distinction. Each paper individually made mistakes. Together, they are a system working as designed.

ML conference publications can no longer be taken at face value. When the same basic errors — unequal hyperparameter tuning, missing significance tests, overclaimed results — produce Oral presentations at consecutive editions of the same venue, the signal that “this paper was accepted at [top venue]” stops meaning what it used to. Practitioners and researchers outside ML already treat conference acceptances with skepticism. The cases documented here suggest that skepticism is well-calibrated.

The field has no mechanism for self-correction. When we tried to publish our findings on min-p, we were rejected twice and referred for an ethics review. In a recent NeurIPS 2025 position paper, we and several colleagues argued that ML conferences should establish a dedicated “Refutations and Critiques” track to give corrections the same institutional legitimacy as the claims they challenge. Without such a mechanism, errors enter the literature at the front door and corrections are told to use the service entrance. The system treats the act of correcting the record as more suspect than the act of polluting it.

There is no punishment for bad work. Right now, there is only upside to overclaiming and zero downside. To the best of my knowledge, no awarded paper has ever been retracted from an ML conference for unsupported claims. No award has been rescinded after the evidence collapsed. No author has faced any professional consequence for publishing fabricated adoption numbers. The system does not just fail to punish bad behavior — it fails to even notice. And so the rational move, if you are optimizing for career advancement, is to claim as much as you can get away with, because “getting away with it” is the default outcome. Our investigation into min-p began because a labmate spent three months trying to improve upon it — building on an ICLR Oral, as you’re supposed to — before discovering that any sampler could be made to look best through selective presentation. Those three months are gone. The person who hyped the method paid nothing for them. Until there are actual consequences for polluting the scientific record, the pollution will continue.

None of this is unique to sampling from language models. Min-p is a case study, not an exception. The evaluation patterns that made these papers’ claims unreliable — unequal tuning, pooled statistics, cherry-picked reporting, unverified adoption metrics — are endemic. The question is whether the field’s institutions will adapt to catch them. Or whether “accepted at a top venue” will continue to be a statement about marketing rather than science.