AI Interpretability Research Faces Growing Disillusionment
Researchers express growing disillusionment with current mechanistic interpretability approaches in AI.
![[AI Research]: The Burden of Comparison in ECCV Reviews](https://res.cloudinary.com/dobyanswe/image/upload/c_limit,f_auto,q_auto,w_1200/v1778253904/blog/2026/eccv-reviewer-request-for-comparison-2026.jpg)
The confetti has barely settled from the last major AI conference, and already the whispers of the next submission cycle are echoing through research labs. For many, this isn’t just about presenting cutting-edge work; it’s a high-stakes gauntlet of peer review, a process that, while essential, can often feel like an uphill battle against shifting sands. At the forefront of this struggle lies a particularly vexing demand: the pervasive requirement for exhaustive comparisons. This post delves into the intricate, and often frustrating, landscape of comparison requests in the European Conference on Computer Vision (ECCV) review process, dissecting its implications for researchers and the very integrity of scientific discourse.
The relentless march of AI research, coupled with the democratizing power of platforms like arXiv, has created a dynamic where freshly minted ideas can outpace the formal peer review cycle by months, if not years. This speed, while exhilarating, presents a thorny problem for conferences like ECCV. Reviewers, armed with the power to accept or reject, are often under immense pressure to ensure a paper’s novelty and superiority. This pressure frequently translates into a demand for comparisons not just against meticulously published work, but against the latest, unvetted submissions on arXiv.
ECCV, in its attempt to temper this frenzy, has issued explicit guidance: authors are not obligated to compare with recent arXiv reports, and failure to cite or surpass arXiv performance is not grounds for rejection. This is a crucial, yet often overlooked, directive. The reality on the ground, however, can be starkly different. Many a researcher has faced reviews that subtly, or not so subtly, penalize their work for not acknowledging or outperforming a paper that appeared on arXiv mere weeks before the submission deadline. This creates a perverse incentive: authors might feel compelled to dedicate precious rebuttal time to retroactively compare against these ephemeral preprints, diverting focus from defending their core contributions.
The underlying issue is not a desire to stifle progress, but a fundamental challenge in evaluating work within such a fluid ecosystem. How does a reviewer, tasked with assessing a paper’s contribution, objectively compare it to a piece of work that has not undergone any formal scrutiny? The very purpose of peer review is to provide that rigorous vetting. By implicitly or explicitly demanding comparisons with arXiv preprints, the review process risks undermining its own authority. It transforms the formal publication venue into a secondary benchmark, while the ephemeral preprint becomes the de facto gold standard. This is akin to judging a published novel against a rough draft of another work-in-progress – the comparison is inherently unfair and doesn’t reflect the effort and validation that goes into a final, published piece.
Furthermore, this emphasis on arXiv performance can disproportionately disadvantage researchers from institutions with less access to cutting-edge computational resources or those who operate outside the immediate, hyper-connected academic hubs where such preprints proliferate. It can create an uneven playing field, where adherence to these unwritten comparison rules becomes a proxy for being “in the loop” rather than for the intrinsic merit of the research.
Beyond the arXiv deluge, another significant hurdle arises when reviewers demand comparisons with published research that lacks readily available code or data. The ECCV guidelines address this directly, stating that requests for comparison with published research requiring re-implementation must be “appropriately justified” if they influence paper decisions. This is a sensible caveat, recognizing that the burden of re-implementing complex algorithms from scratch can be astronomical, often demanding weeks or months of effort that are simply unavailable during the rebuttal period.
However, “appropriately justified” is a subjective term. What one reviewer deems a critical comparison, another might consider an unnecessary detour. When a paper’s core novelty hinges on a subtle improvement over a previous method, and that previous method exists only in a published paper with no accompanying code, the reviewer’s request to replicate and compare becomes a substantial imposition. The author is then faced with an impossible choice: either attempt the Herculean task of re-implementation, likely introducing new bugs and errors in the process, or risk rejection based on an incomplete comparison.
This issue is exacerbated by the fact that many foundational works in AI, particularly those from earlier eras, may not have had the benefit of modern open-source practices. Yet, their influence persists. Demanding rigorous, empirical replication of such work without providing the necessary resources or time is not conducive to fair evaluation. It can lead to reviewers relying on potentially outdated or incomplete understandings of prior art, or worse, making decisions based on the difficulty of comparison rather than the substance of the presented work.
The ECCV’s stance against mandating comparisons on “withdrawn datasets” is a welcome clarification. This addresses a niche but problematic scenario where outdated or flawed datasets might still be cited, leading to potentially misleading comparisons. The emphasis should always be on current, relevant benchmarks and methodologies. However, the core problem of re-implementation remains a significant bottleneck. A more robust approach might involve encouraging reviewers to clearly articulate why a specific re-implementation is critical for assessing the paper’s contribution, and for reviewers to be willing to accept well-reasoned arguments about the feasibility and necessity of such an undertaking within the conference review timeline.
In an era where Large Language Models (LLMs) are rapidly integrating into every facet of our digital lives, their prohibition in the review process at ECCV is a stark and important declaration. The policy explicitly prohibits the use of LLMs by reviewers to write reviews, generate content, or share substantial paper/review content. This directive is rooted in fundamental concerns about policy violations and, critically, confidentiality.
The allure of LLMs for an overwhelmed reviewer is understandable. Imagine an AI that could summarize a paper, draft a preliminary critique, or even generate comparison tables based on cited works. The temptation to delegate parts of the arduous review process to such tools must be immense. However, the risks are manifold. Firstly, LLMs, while powerful, are not infallible. They can hallucinate, misinterpret nuances, and perpetuate biases present in their training data. A review generated by an LLM could inadvertently introduce factual errors or mischaracterize the paper’s contributions, leading to unjust rejections.
More importantly, the confidentiality of submitted manuscripts is paramount. Research papers often contain novel ideas that are not yet public. Sharing substantial portions of these papers, or the reviews themselves, with an external LLM service, even under the guise of “assistance,” could constitute a breach of this confidentiality. This has serious implications for intellectual property and the trust placed in the peer review system. Researchers submit their work with the understanding that it will be handled with discretion by a select group of peers.
The prohibition of LLMs by reviewers is a powerful affirmation of the irreplaceable role of human intellect, critical thinking, and ethical judgment in the scientific process. While AI can undoubtedly streamline aspects of research management (e.g., automated reviewer assignment via semantic search on platforms like OpenReview, or tools like PeerSubmit, Dryfta, Fourwaves, EasyChair, PROCONF, Leconfe, and OpenWater which offer sophisticated workflow automation), the core act of evaluation – understanding, critiquing, and contextualizing research – must remain a human endeavor. The ECCV’s firm stance acknowledges that the integrity of the review process depends on the nuanced, critical, and confidential engagement of human experts.
The ECCV review process, like many in the AI community, is a complex ecosystem grappling with the rapid pace of innovation, the increasing volume of submissions, and the inherent challenges of subjective evaluation. While platforms like OpenReview and robust internal policies aim to provide structure and fairness, the burden of comparison remains a significant pressure point for researchers.
The sentiment echoed on platforms like Hacker News and Reddit – the “Reviewer 2 is a jerk” trope, the critique of review quality, and the perception of a “zero-sum game” – are not without foundation. The system, while striving for objectivity, is deeply human and thus prone to inconsistencies, biases, and overload. The explicit guidelines regarding arXiv comparisons and re-implementation are vital steps in mitigating some of these issues. However, their effective implementation relies heavily on reviewer adherence and a shared understanding within the community.
Ultimately, the path forward requires a continuous dialogue between conference organizers, reviewers, and authors. Conferences must not only articulate clear policies but also actively foster a culture where these policies are respected and where the focus remains on the intrinsic scientific merit of the work. For researchers, understanding these policies and being prepared to respectfully address comparison requests, while also advocating for fair evaluation, is key. The burden of comparison is a symptom of a broader challenge in evaluating groundbreaking research in a hyper-accelerated scientific landscape. By acknowledging these challenges and actively working to refine the review process, we can move closer to a system that truly celebrates innovation rather than simply demanding its adherence to an ever-shifting benchmark.