A Cautionary Tale for Open Science: AlphaFold3

S Krishnaswamy

A NEW AI (Artificial Intelligence) model, AlphaFold3, has excited the scientific community. Developed by Google's DeepMind Anglo-US subsidiary with Isomorphic Labs (a subsidiary of Google group’s parent company, Alphabet), AlphaFold3 made headlines in the journal Nature on May 9, 2024 for its ability to predict interaction of protein structures with other molecules like DNA and RNA. This holds immense promise for drug discovery and medical treatments.

However, a cloud hangs over this excitement: limited access to the technology. DeepMind has not released the full code or the inner workings of the model, opting instead for a closed-source approach. They have provided a simplified algorithm description and a web server for limited use. This decision has reignited the debate about open science in an era dominated by private funding and AI.

Open science champions the idea of freely sharing research data, methods, and code. This openness allows others to verify findings, fosters collaboration between researchers, and ultimately speeds up scientific progress. Traditionally, research was fuelled by public funding, making open science a cornerstone of academic integrity. However, the rise of massively-funded private companies like DeepMind and research done by them with a strong commercial focus has raised concerns. Their investment in AlphaFold3, and other similar cases, illustrates this tension.

OPEN SCIENCE AND

PROTEIN STRUCTURES

Proteins are the tireless workhorses of our cells, performing essential functions and building the very infrastructure of life. To function properly, they must fold into a particular three-dimensional structure. This folding is constrained by the physical properties of atoms and the way amino acids link together in a protein chain. Traditionally, determining this structure has relied on experimental techniques like X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, and more recently, Cryo-Electron Microscopy. These methods can be time-consuming (days to years) and expensive, especially for complex proteins interacting with other molecules like DNA, RNA, or other proteins. Since the first protein structure was solved in 1957, only around 200,000 structures have been experimentally determined. Significantly, all these structures and the underlying experimental data are freely available in public databases. Research journals typically require this data to be made accessible as part of the publication process.

Unveiling protein structures and their interactions with other molecules are crucial for rapidly developing new drugs and therapies, including antibody-based treatments. This has fuelled immense interest from the pharmaceutical industry.

Since the 1950s, researchers have strived to predict protein structures solely from their amino acid sequence, with limited success until recently. In 2018, Google's DeepMind introduced AlphaFold, and significantly improved upon it in 2020 with AlphaFold2. These advancements stemmed from collaboration with the public-funded European Bioinformatics Institute.

Both the AI AlphaFold versions were trained on a massive public database containing over 170,000 protein sequences and their corresponding structures. The programme utilizes a type of deep learning called an attention network, something like a complex jigsaw puzzle. The attention network allows the AI to focus on specific pieces, progressively assembling them to form the complete protein structure.

ONLY PRIVATE

FUNDED RESEARCH

AlphaFold3 marks a significant departure from its predecessors. It incorporates a technique called Diffusion Networks, similar to those used in image generation programs like DALL-E. Notably, this development was a fully private venture by Google's DeepMind, collaborating with its sister company Isomorphic Labs.

DeepMind, of course, justifies its limited access model by arguing for a good return on investment. They offer a user-friendly web server as a way to democratise access to AlphaFold3's functionalities, even if the underlying technology remains undisclosed.

However, as many researchers point out, there are significant drawbacks. The web server restricts use for non-commercial research, limiting the ability of independent researchers and start-ups to innovate. Additionally, it cannot handle complex protein structures or those bound to potential drugs, crucial aspects of drug discovery. Most importantly, the lack of access to the code hinders scientific progress. Researchers cannot fully understand, improve, or adapt AlphaFold3 for specific needs, thereby slowing down potential advancements.

OPPOSITION FROM

RESEARCHERS

The publication of AlphaFold3 in Nature sparked a significant response from the scientific community. A review team member and a group of biologists co-authored an open letter to Nature raising concerns about several deviations from standard practices and potential policy violations. The letter garnered over 600 endorsements. Media outlets echoed these concerns, praising the technology's potential while criticising the lack of openness.

In response, Max Jaderberg (AI chief of Isomorphic Labs) and Pushmeet Kohli (DeepMind's VP of research) announced on X (formerly Twitter) a planned release of the code within six months for academic use. Nature, in a May 22nd editorial, acknowledged the debate and solicited reader feedback on promoting open science practices. The journal emphasized its existing open science policies, but conceded the challenges posed by private sector funding and potential proprietary research outcomes.

In a significant development, researchers from Columbia University and Harvard Medical School unveiled OpenFold in mid-May. This open-source tool offers an alternative to AlphaFold 2, providing researchers with more transparency into the underlying processes. OpenFold allows labs to train their own customised versions, potentially incorporating proprietary data and tackling specific research problems. This approach could replicate functionality similar to AlphaFold 2 without relying on Google's servers.

The scientific community is actively pursuing open-source alternatives to AlphaFold3. While some are developing new tools, others are attempting to extract more information from the existing web server. However, the challenge lies not in replicating the code (estimated at a month), but in the immense computational resources required to train the AI model, a significant time and cost barrier.

OPEN SCIENCE VS

INTELLECTUAL PROPERTY

The AlphaFold3 case exemplifies a longstanding conflict: open science versus intellectual property (IP) rights. Companies have a valid interest in protecting their investments, but excessive IP restrictions can hinder scientific progress and limit the societal benefits of innovation.

Potential solutions exist. Alternative licensing models, like those used in open-source software, could grant public access to the code while safeguarding commercially sensitive aspects. Restricting commercial use or sub-licensing could be other options. Additionally, data sharing agreements could allow independent researchers to analyse specific datasets used to train AlphaFold3, without revealing core algorithms.

Public-private partnerships can leverage resources from both sectors while promoting open access. Governments can incentivise open science by directing funding towards projects with data sharing plans. Journals can require authors to disclose data access policies and encourage open-source code repositories.

The human genome project serves as a successful example. This international effort mapped human genes, using a hybrid model with private and public funding, while ensuring open access to sequenced data. This approach accelerated research advancements in genomics and personalised medicine.

The limitations of AlphaFold3 with respect to access raise ethical concerns. A tool with such potential to revolutionise healthcare should be more widely available to the scientific community. Restricted access risks creating a scenario where only well-funded institutions and companies can harness AlphaFold3 for drug discovery. This could significantly delay development of life-saving treatments, particularly for diseases affecting developing nations. Imagine a situation where a protein structure critical for a rare disease in a developing country is too complex for the web server. Without access to the underlying code, researchers there would be unable to contribute to finding a cure, creating a significant ethical barrier.

NEED FOR OPEN SCIENCE

IN AI DEVELOPMENT

The AlphaFold3 story sheds light on the challenges posed by AI in scientific research. AI models are often "black boxes," meaning their inner workings are complex and difficult to understand. This opacity can make it difficult to verify their results and replicate their successes. Open access to the code and training data used in AI models like AlphaFold3 would allow researchers to understand the model's biases, improve its accuracy, and adapt it for specific applications.

The AlphaFold3 case serves as a cautionary tale. While private funding can accelerate scientific progress, it must not come at the expense of open science. Finding a balance between protecting intellectual property and fostering transparency is crucial for maximising the societal benefits of scientific breakthroughs. Embracing open science principles ensures that advancements like AlphaFold3 make scientific inventions truly accessible for everyone.

Enable GingerCannot connect to Ginger Check your internet connection
or reload the browserDisable GingerRephraseRephrase with Ginger (Ctrl+Alt+E)Edit in Ginger