Following considerable anticipation, ChatGPT-5 has arrived. Instead of the much-touted artificial general intelligence (AGI), OpenAI has delivered a more intelligent, orchestrator model that now internally selects the appropriate sub-model to perform a given task. For a simple request, it will choose a smaller, faster model. For a complex one, it will employ models that require more computational time. The system will likewise decide whether to access the internet. This change addresses a long-standing issue for OpenAI, whose proliferation of different models (4o, 4-mini-high, 3.5, 4.1, 4.1-mini, etc.) created confusion for many users. While this appears to be a positive development, it is too early to offer a definitive general assessment. But, what can already be stated is, the logic underpinning ChatGPT 5 presents significant challenges for academic research.
Large language models and academia
In previous work, I have discussed some of the principal problems that large language models pose to scientific research, which directly undermine the foundations of academic integrity, particularly concerning transparency, reliability, and reproducibility (Sampaio et. al, 2024). Here, however, I wish to focus on reproducibility. This is a foundational principle of scientific inquiry. The premise is straightforward. If the same data and methods are applied under similar conditions, identical or, at a minimum, highly similar results should be attainable, especially within comparable contexts. This is perhaps the most severe problem when using language models, because a traditional academic software package has a version that can be traced. For instance, suppose SPSS had changed its method for calculating linear regression from version 32 onwards due to statistical advancements. To reproduce prior results, one would simply need to download and use SPSS version 31 or earlier, resolving the issue. With language models, this traceability is entirely lost.
The launch auf ChatGPT-5
The launch of ChatGPT-5 demonstrates that this is far from a priority for its developers. If I were to receive a request to peer-review a study that used GPT-3 for a thematic qualitative analysis, I would be compelled to decline, as the model is simply no longer available on the OpenAI website. Worse still, language models are frequently updated and ostensibly improved without any corresponding change in their version designation. For example, the GPT-4o available just before the launch of GPT-5 was not the same as the one available on its release day. The available Gemini 2.5 Pro has already undergone at at least three improvements. Internally, these may be treated as distinct builds, such as Gemini 2.5.1.3, but this is information to which academics are not privy.
Consequently, in the same situation of peer-reviewing or attempting to reproduce a study, I would be unable to achieve my objective. Even if OpenAI were to reverse its course tomorrow and make its older models available again, the underlying problem would remain exposed. For example, the original versions 3.5 and 4 of GPT have not been accessible for a significant time. They might be available through some third-party APIs, but they are not easily and reliably retrievable in the manner of academic software. This issue must also be raised in the context of academic software that has yielded to industry pressure or hype and embedded AI into its analyses, as is the case with the three most widely used qualitative analysis software packages, namely NVivo, Atlas.ti, and MAXQDA.
All three now feature functionalities for automated summarization and classification using generative AI, and all three currently run on GPT models. We do not know which model, precisely. With the disappearance of older models, it is highly probable that they will be using GPT-5. This will create a peculiar situation. We might be using the same version of MAXQDA but obtaining different AI-generated results because the underlying model has been updated. Again, this is a matter over which researchers will have no control. Now, the question arises. How does this compare to the prior landscape?
The situation was not ideal, as a “replication crisis” has been discussed for several years in various fields, such as Psychology, Medicine (Wang, Sreedhara & Schneeweiss 2022), and Economics (Brodeur et al., 2025). Furthermore, it can be acknowledged that significant opacity already existed at the core of scientific practice. One example is web scraping. In practice, different scrapes of the same content can yield markedly different results, introducing biased and non-representative data (Foerderer, 2023). We can also point to Google Scholar, which is quite opaque in its indexing methods, and its searches are not reproducible. Another example is qualitative software itself, which has long produced automated content analyses based on natural language processing AI models without this practice being widely questioned.
Similarly, a degree of opacity was present in any academic software. In daily use, we do not know the exact computational procedures of SPSS or Stata. Although documented, we assume the companies are adhering to best practices and we trust their outputs. Does this imply an academic legitimacy derived from the institution? Are NVivo or Stata more trustworthy that OpenAI because they are designated as academic software? Is legitimacy, then, vested in the companies themselves? So, has nothing changed? Indeed, science has long contended with problems of opacity and reproducibility. The current situation, however, represents a qualitative shift in the nature of the problem. This change manifests across at least three axes, which have been intensified by recent developments.
Change across three axes
The first axis is the transition from static to dynamic opacity. Traditional software, like SPSS, operated within specific, immutable versions. An algorithm in SPSS 31 would execute the same operation today as it would in two years. Confidence was rooted in the stability of the process and legitimacy traditionally derived from explicit methodologies, certifications, or peer review, attributes largely absent in the proprietary and opaque nature of LLM development. Now, a model’s label, such as “GPT-4o,” corresponds to a fluid service, not a static product. The model is continuously adjusted without notice, meaning the methodology described in a scientific paper is obsolete upon publication. Irreproducibility ceases to be an occasional flaw and becomes a structural feature inherent to the system’s design. The second axis is the shift in the architecture of power, involving the externalization of methodological control. Software like NVivo and MAXQDA, by integrating generative AI APIs, transfers responsibility for analytical stability to a third party (OpenAI, Google). Reproducibility becomes dependent on the commercial policy of a technology company.
The GPT-5 model deepens this by acting as an orchestrator. It becomes a decision-making platform that internally chooses which sub-model or tool to use for each task. This extra layer of abstraction introduces a hidden variance. The same query might yield different results not due to error, but because the platform activated distinct computational pathways based on factors invisible to the user (e.g.,system load, cost). The researcher transitions from being a tool-user to being platform-dependent, fundamentally altering the relationship with the scientific method. The third axis is the ultimate consequence of these changes. The scientific community no longer faces merely a reproducibility crisis, but a crisis of methodological control. The problem is not only the difficulty of replicating a result, but the researcher’s inability to know and guarantee the stability of the tool being used. If the platform can, without transparency, decide when and how to search the web, which sub-models to activate, what filters to apply, and with what parameters, the researcher loses the ability to describe the procedure with sufficient detail for replication. Without an exportable computational logbook, reproducibility becomes a lottery. In this context, it is tempting to substitute brand for method, writing “results were obtained using GPT-5,” as if this constituted a sufficient specification. It does not. It is an abdication of methodological responsibility.
This situation risks normalizing a standard of tolerated irreproducibility, where papers describe objectives rather than auditable procedures. The response requires the scientific community to establish new standards of rigor. Demands for a “trace audit” or “logbook” documenting the internal routes taken by the model, the availability of “frozen models” for replication purposes, and exhaustive documentation of prompts and generated artifacts are necessary countermeasures. These are features not yet available in commercial AI models. This does not mean we have returned to a “primitive” state of science. It does mean, however, that we must establish clear minimum rules for the use of generative AI in research.
Rules on how to treat AI in science production
The first is to treat the AI workflow as part of the research data itself. This includes recording and archiving the prompts used, system messages, exact date and time, configured parameters (such as temperature and token limits), context files, and the entire interaction history. The second is to explicitly describe how the system was configured. This includes whether model selection was automatic, whether web Browse was enabled, and which external services were utilized. If the platform does not provide this level of detail, this limitation must be reported in the body of the article, not merely in footnotes. The third is to verify important findings with more stable and open models, even if they are less advanced, to establish a comparative baseline and facilitate cross-validation.
By testing the same phenomena or hypotheses across different architectures and datasets (even less advanced ones), we can assess the robustness and generalizability of the findings. If a result is replicable or observable across distinct models, this significantly increases confidence in its validity, mitigating the risk that the findings are mere artifacts of a specific model or training process. The fourth is to recognize that when the method depends on a constantly changing service, exact replication of results can only be expected within a short timeframe. Thereafter, comparisons must be made conceptually, using similarity indicators or the cross-validation suggested above.
Still, these measures do not resolve the core problem, which is the misalignment between the commercial interests of the companies providing these systems and the requirements of scientific inquiry. It is therefore fundamental to develop open and public alternatives that can run in controlled environments, even with lesser capabilities, to prevent the scientific method from becoming entirely dependent on big tech.
In sum, something has changed. We have moved from a regime where replication was difficult in practice to one where it is impossible in principle. The gain in usability for the general public has been purchased at the cost of methodological control for the scientific community. The question, then, transforms from “has nothing changed?” to “how will the scientific community adapt to ensure research integrity in this new regime?”.





This reminds me of a non-linear curve fitting program that suddenly started giving different standard deviations for the fitted parameters. I was asked to find out why. Comparison with my own programs showed that they’d changed the way taht they estimated the covariance matrix, from using the Expected Fisher information matrix, to using the Observed information matrix. Nobody that was contactable in the company seemed to be aware of this and there was no hint in the printout of how the (approx) standard deviations were calculated.
This is a good article laying out an important challenge for contemporary science. But the author does not mention what seems to be the most obvious short-term solution: scientists should not use Chat-GPT or other AI tools as part of their research workflow (other than as a crutch to help write up text parts of a paper – and when used for that, the generated text needs to be double-checked very carefully!). If a software package now uses AI tools in a non-defeatable and non-reproducible way, then don’t use that package. Period.
The reproducibility crisis was bad enough already, we don’t need this extra element of randomness inserted into the process!