ResearchGuides: Computer Science: Articles and Journals

Shining a Light on AI HallucinationsApr 3, 2025

The ability of artificial intelligence (AI) to sift through mountains of information and deliver useful results is rapidly reshaping the way people learn, work, and handle numerous tasks. Yet, for all the convenience and value Generative AI and large language models (LLMs) deliver, they have a problem. Despite delivering text, video, and images that appear accurate and convincing, they sometimes hallucinate.

These fabrications—which can range from minor, plausible errors to utterly absurd assertions—are a legitimate cause for concern. At the very least, the resulting misinformation or botched image can rank as mildly amusing or annoying. In a worst-case scenario, however, LLMs can dispense bad medical or legal advice or lead to biased, discriminatory, or dangerous decision-making.

As a result, researchers and data scientists are actively exploring ways to rein in AI hallucinations through improved training and model refinements. One method, Retrieval Augmented Generation (RAG)^a, cross-checks data using outside databases and other sources. Another approach relies on novel statistical methods to spot possible semantic glitches.^b Still another plugs physical data and other information into a model.

At the same time, data scientists are introducing robust filters, parameter controls, and human feedback loops that can serve as guardrails. “The challenge is to produce useful and accurate results without undermining the ability of these models to support creative ideation and brainstorming,” said Chris Callison-Burch, a professor in the School of Engineering and Applied Science at the University of Pennsylvania.

Just the Facts

No one disputes the cause of the hallucination problem: the size of today’s Generative AI and LLM models, along with the way they work introduce opportunities for errors—sometimes on a grand scale. A model like OpenAI’s Chat GPT-4 tops out at more than 1.76 trillion parameters.^c Data scientists build LLMs using a method called autoregressive generation.^d “It uses a probability distribution to predict the next word,” Callison-Burch said.

In other words, autoregressive generation doesn’t set out to produce accurate information per se; it’s designed to spit out the most probable outcome based on past data and surrounding words. Larger models trained on high-quality and more-diverse datasets generally increase the odds for accurate results because they can look at greater numbers of words in context.

However, they can’t expunge hallucinations.

Plenty of mind-bending examples of AI hallucinations have appeared over the last few years. Among them: adding glue to pizza ^e to make toppings stick better, and eating rocks to promote better nutrition and digestive health.^f Gen AI models produce extra fingers and sometimes eliminate body parts or key elements in an image. While it is easy for any reasonable person to dismiss the most egregious examples, things get far more complicated when LLMs veer into health and wellness advice ^g and lawyers rely on them to prepare court briefs and cite case law.^h

Stories of these AI systems going off the rails have become the new normal. Yet, for all the chaos, the cause is relatively simple. “Because a language model contains massive amounts of Internet data, it may faithfully represent a concept or the overall knowledge it acquired during training—even if the information isn’t accurate,” said Daphne Ippolito, an assistant professor in the Language Technologies Institute at Carnegie Mellon University.

The bottom line? As words become numeric vectors and numerical values multiply by an order of magnitude inside a modelⁱ, things get fuzzy and meanings get blurred. The model simply regurgitates words based on probability. “AI hallucinations are a feature rather than a bug,” said Anima Anandkumar, Bren Professor of Computing and Mathematical Sciences at the California Institute of Technology (Caltech). “Models are incentivized to produce plausible text, rather than factual text.”

Model Behavior

Addressing the AI hallucination problem is no simple feat. Yet, several methods have emerged that take direct aim at the problem. For example, RAG prompts the LLM to check the Web or a proprietary database to verify that the information it generates is correct. This process can take place in real time. For instance, an LLM could check to see who now serves as the governor of a state, or the current price of gasoline, rather than relying on outdated training data.

RAG extracts the desired data, plugs it into the original query, and generates a new, more-accurate response. In some cases, the system undergoes an additional self-evaluation process.^j It examines its answer to determine if it abides by parameters that make sense. If the system is uncertain about whether the information is correct, it can conduct additional searches to support or contradict the original response. This might include the use of specialized databases.

The result is a more comprehensive and accurate response, with fewer hallucinations. “Retrieval Augmented Generation doesn’t change the base LLM. It runs atop it by providing additional context in the prompt—using an API or similar tool to retrieve relevant information,” Callison-Burch explained.

Another promising method emerged in June 2024, when researchers at the University of Oxford in the U.K. published a paper in Nature outlining a novel way to detect AI hallucinations.^k “If you ask an LLM the same question several times, you are likely to get different answers back,” explained Sebastian Farquhar, a Google DeepMind researcher who co-authored the paper as a Ph.D. student. “It can be extraordinarily difficult to determine whether a model is uncertain about what to say versus being uncertain about how to say it.”

The solution—referred to as semantic entropy—is rooted in the inherent vagueness of language and the numerous ways a model can interpret words, phrases, concepts, symbols, and other data. For example, the words “fly” and “bank” have multiple meanings, and a word like “thing” refers to almost any object. The Oxford researchers developed a method that uses probability and statistical analysis to score word patterns based on potential meaning versus correct phrasing.

The technique identifies likely errors and hallucinations through subtle inconsistencies in language that occur across different versions of a response. For example, the method found 45 LLM errors across 150 factual claims when it examined text generated from a group of Wikipedia biographies. The entropy approach works across six open source LLMs, including GPT-4 and LLaMA 2. It does not require any prior knowledge, specific training, or task-specific data and instructions to work across different types of datasets.

However, this statistical technique addresses only a particular type of hallucination, Farquhar emphasized. It is not effective in detecting errors when an LLM consistently spits out incorrect results. “There are a lot of ways a model can go wrong and spew garbage. This is just a way to address the problem of hallucination,” Farquhar said.

Active Imagination

To be sure, tamping down AI hallucinations will likely require multiple techniques—often used simultaneously or to address specific needs. At Caltech, Anandkumar is exploring ways to build LLMs that understand real-world properties. “One of the biggest problems with these models,” she said, “is that they lack physical grounding.” For example, “If you ask LLMs to play tennis, they provide a lot of theories about it, but they cannot actually play tennis.”

Keeping such limitations in mind, and avoiding the urge to anthropomorphize models, is vital. Consequently, Anandkumar and other researchers at Caltech are now constructing AI models that incorporate detailed physical data. These so-called neural operators span biology, chemistry, and physics.^l They focus on everything from soil and plants to clouds and objects in outer space.

“We teach the models the laws of physics from the ground up,” Anandkumar said. “So, instead of training the model with Internet data of varying factuality, it is equipped to generate answers and simulate information that are physically valid and factually correct.” It’s possible to apply the technique to both classical and generative AI models.

Anandkumar also has developed an open-source toolkit, LeanDojo,^m that plugs mathematical reasoning into LLM training.ⁿ This allows an LLM to avoid certain types of hallucinations by verifying every proof step that an LLM proposes. “The training process typically takes place over the course of a GPU-week and leads to 100% accuracy,” Anandkumar said.

Meanwhile, publicly available LLMs such as ChatGPT, Gemini, and Copilot are turning to still other methods to reduce hallucinations. They are introducing alignment training techniques that insert fact-checking and feedback from users into the training loop.^o Some also tap a method called “Causal AI” that forces a system to examine a query or issue from multiple perspectives simultaneously; this increases the odds the LLM will expose a potential hallucination, but not release it.

The Dream of Better AI

The inherent complexity of large probabilistic AI models makes it nearly impossible to eradicate every hallucination or factual error. In addition, creators may want to prioritize certain functions, capabilities, and outcomes, which can lead to edge cases where hallucinations pop up. “The challenge is to build models that do things that different people desire. One person may want an answer that is wonky and creative, and another might want a reply that is factual,” Ippolito said.

Further complicating matters, hallucinations sometimes have value. They can lead to new and revolutionary ways to think—and see the world. Ippolito, who explored the topic in a Ph.D. dissertation ^p while attending the University of Pennsylvania, argued that it is important to avoid snuffing out the creative sparks that Generative AI and LLMs sometimes deliver. The same qualities that lead to hallucinations can help people produce new art, music, and product designs. Already, companies like Nvidia and Intel are using Generative AI and LLMs to produce new and more efficient chip designs.

For now, the march toward more lucid and accurate AI models continues. Although no single approach will solve the core problem—a base model that hallucinates—the sum of these techniques offer a way to get hallucinations under control. “There’s a clear need to improve LLMs,” Callison-Burch said. “But ultimately, humans must understand there’s a need to stay in control of the technology and learn to use it responsibly.”

Izacard, G., Lewis, P. Lomeli, M., Hosseini, L., Petroni, F., Schick, T., Dwivedi-Yu, J., Joulin, a., Riedel, S., and Grave, E.
Atlas: Few-shot Learning with Retrieval Augmented Language Models. November 16, 2022. https://arxiv.org/pdf/2208.03299
Farquhar, S., Kossen, J., Kuhn, L., and Gal, Y.
Detecting hallucinations in large language models using semantic entropy. Nature, June 19, 2024. Vol. 630, pages 625–630 (2024). https://www.nature.com/articles/s41586-024-07421-0
Ji, Z., Yu, T., Xu, Y., Lee, N., Ishii, E., and Fung, P.
Towards Mitigating Hallucination in Large Language Models via Self-Reflection, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1827–1843 December 6-10, 2023. https://aclanthology.org/2023.findings-emnlp.123.pdf
Yang, K., Swope, A.M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A.
LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. NeurIPS 2023, June 27, 2023. https://arxiv.org/abs/2306.15626
Wang, C., Zhou, H., Chang, K., Li, B., Mu, Y., Xiao, T., Liu, T., and Zhu, J.
Hybrid Alignment Training for Large Language Models. CAS Key Laboratory of Behavioral Science, Institute of Psychology, CAS, Beijing, China, June 21, 2024. https://arxiv.org/html/2406.15178v1
Ippolito, D.
Understanding the Limitations of Using Large Language Models for Text Generation, 2021. https://www.cis.upenn.edu/~ccb/publications/dissertations/daphne-ippolito-thesis.pdf

AI Risks for Democracy, the Economy, and Civil RightsApr 1, 2025

Since January’s launch of DeepSeek R1, the Chinese open-source variant of OpenAI’s GPT-o1, the metaphor of the ‘AI race’ has dominated the geopolitical discussion about artificial intelligence (AI) technology. The debate on the real impact of AI on society and the best ways for AI to benefit society as a whole has faded into the background.

However, these topics were at the heart of the conference session “Risks from AI to the Economy and Society” at the recent American Association for the Advancement of Science annual meeting in Boston.

Tina Eliassi-Rad, a computer science professor from Northeastern University, focused on the risks of AI for democracy. She showed that over the last 10 years there has been a worldwide decline of democracy, notably in civil liberties, the functioning of government, the electoral process, and pluralism. “Democratic backsliding is in part the result of instability in the democratic system,” she said. Regarding the influence of AI, she added, “AI technology that amplifies misleading and false information increases instability in democracy.”

In the context of Franklin D. Roosevelt’s quote that “Democracy cannot succeed unless those who express their choice are prepared to choose wisely,” Eliassi-Rad asked, “Can we choose wisely, in the age of Generative AI, when it is no longer clear whether a text, speech, image, or video is real or artificial?”

AI technologies give rise to a complex feedback-loop, she explained. “Algorithms that recommend products and news affect economic processes, political regulations, and societal norms. Those, in turn, affect the algorithms, and this feedback loop goes round and round,” she said.

Another important question is how generative AI influences what people believe, because they make choices based on their beliefs. Said Eliassi-Rad, “A large language model (LLM) is not an expert, even though people treat it often as one. It is a probabilistic model of a knowledge base. It returns what is both probable and likable.” That gives rise to risks ranging from convincing falsehoods to the possibility of jailbreaking or working around LLM’s guardrails.

Furthermore, Eliassi-Rad said, “When people are flooded with relevant information, they are less able to estimate how probable things really are.”

So, what should be the answer? According to Eliassi-Rad, “Regulations, even if they are enacted one day, will not be enough to fend off the risks. . . The real safeguard is education. Learn how generative AI tools work, what they are good for, and what they should never be used for.”

Daron Acemoglu, a professor of economics at the Massachusetts Institute of Technology (MIT) and recipient of the 2024 Nobel Prize in Economics, used a numerical analysis of the effects of automation on the economy to propose a different path for AI development from the current one.

Acemoglu provided data showing that the automation implemented since 1980 thanks to computers, and more recently AI, has led to disappointingly little productivity growth, “nowhere comparable to the rapid growth of what the U.S. and European economies experienced in the 1940s, ’50s, ’60s, and ’70s,” he said.

In analyzing the impact of automation on inflation-adjusted real wages, Acemoglu showed another alarming trend: “About half of the population in the U.S. doesn’t have much of an improvement in their labor market fortune. . . A lot of the inequality that we have experienced over the last 40 years appears to be related to automation. . . This is not an act of nature, but it’s related to the choices tech firms and society make.”

Acemoglu said many AI applications are going in the same direction as previous digitalization: automating away human labor without growth in productivity. In response, he advocated a different AI path from the current one: “machine usefulness,” as opposed to “machine intelligence.”

The current AI path of machine intelligence focuses on machines replacing more and more human cognitive tasks. Acemoglu said, “Machine usefulness, instead, means that machines are going to be at the service of humans in order to amplify expertise, knowledge, and the capabilities of humans. . . It enables workers to perform more sophisticated and new tasks, exactly what you see during periods in which wages and worker’s employment increase.”

We cannot rely on the free market to change the path from “machine intelligence to “machine usefulness,” said Acemoglu. “We need an institutional response, just like during the 19th century Industrial Revolution.”

After discussing AI risks to democracy and the economy, speaker Cory Doctorow said we need to dissect current AI revenue models. Doctorow, a science-fiction author, journalist, and digital rights activist, focused on how firms make money with AI and how they derive power. “I have been developing a theory that is very serious, with a very funny name,” he said: “enshittification.” Doctorow coined the term several years ago to describe the concept of online products and services declining in quality over time.

He explained that, at first, platform companies treat their users well in order to keep their user base growing, but eventually they will exploit those users to benefit their business customers. Finally, they will turn against those business customers to capture all the value for themselves.

However, the impact of this goes far beyond Big Tech. “Once organizations digitize, they can quickly change their business logic,” Doctorow said. “In America today, the majority of hospital nurses are being hired through one of three apps that all bill themselves as ‘Uber for nurses.’ ShiftKey is the market leader, and because America has such a poor privacy environment, ShiftKey is able to, in real time, acquire credit data about nurses before offering them a shift. Nurses who have more credit card debt or longer outstanding debts are offered a lower wage for the same shift as nurses who are more financially independent. This is something that is coming down to nursing, but it’s endemic to other sectors of the gig economy.”

Like Acemoglu, Doctorow warned against having unrealistically high expectations of job automation with AI. “I don’t think AI can do your job. I don’t think AI can do my job. . . Throwing more compute and more data at AI training in the expectation that that will allow it to reason as we do is like breeding horses to get faster and faster with the expectation that one day one of your mares will give birth to a locomotive.

“But just because AI can’t do your job, it does not follow that an AI salesman can’t convince your boss to fire you and replace you with an AI that can’t do your job.”

Bennie Mols is a science and technology writer based in Amsterdam, the Netherlands.

Digital Past Stirs Analog Memories at Extinct Media MuseumApr 1, 2025

“We are analog beings living in a digital world, facing a quantum future,” physicist Neil Turok wrote in his 2012 book The Universe Within. The deluge of digital information we face every day “feels rather alien to us as natural, living creatures,” he later explained. If that’s true, it may be one reason why visiting the Extinct Media Museum in Tokyo, Japan, feels so relaxing.

The compact collection is a shrine to recording and storage media of yesteryear, both analog and digital, as well as the devices used to record and play them back. It has a colorful hoard of 8mm and 16mm films, floppy disks, VHS and cassette tapes, reel-to-reel tapes, 8-track tapes, mini-discs, Betamax, laserdiscs, and more. The shelves are also stuffed with yesterday’s recording, photography, communications, and computing devices. Nestling alongside Double 8 cameras are retired clamshell phones, a Macintosh Plus, early iPods, the first iMac, Panasonic audio players, and a spaghetti of vintage cable interfaces.

A jumble of interface cables from the 1980s and 1990s on display at the Extinct Media Museum in Tokyo.

“Since the first iPhone, smartphone design has been basically the same, but before that, feature phones had many interesting designs, and you can see them here,” said deputy museum curator Barbara Asuka. “While manufacturers have their own museums to display their own products, we have many brands that visitors can compare.”

In Japan, antiquated technologies retain currency. Fax machines are still common, business cards are still de rigueur, and 60% of all transactions were still done in cash in 2023. The following year saw the central government finally end its use of floppy disks. Personal seals (hanko), a technology that’s at least 3,000 years old, are still used on legal and banking documents.

Hands-on hardware

The museum isn’t a collection of antiques gathering dust, though. Visitors are encouraged to pick devices up and try them out. You can wind the crank on an old 1950s film camera, hit the keys on an Olivetti typewriter, handle a clamshell mobile phone from the 1990s, feel the heft of a shoulder-mounted Victor GR-C1 video camera (the model used in Back to the Future), and hear the whir of a Poppy Kurukuru Terebi film viewer as it plays an Ultraman reel.

JVC camcorder — A JVC GR-C1 camcorder, the same model used by Marty McFly in the 1985 classic film Back to the Future.

“People of my generation come here not only for the nostalgia factor when seeing devices they used as kids, but to relive memories of when Japan was a high-tech leading nation,” said Yoshiyasu Kubota, 56, an electronics engineer and researcher affiliated with the museum. “Younger people come here to discover technologies they’ve never experienced before, like cassette tapes.”

At the museum, I took my own trip down memory lane when encountering an Apple IIc, my first PC and Apple’s first portable computer. I remembered swapping floppy disk video games like Ultima IV 40 years ago with classmates and hooking up the IIc’s video output to our old Zenith CRT TV to play in color instead of using its 9-inch monochrome monitor. Copying, labeling, and hole-punching blank 5¼-inch floppies to double their capacity was an essential analog experience in what quickly became my first digital playground.

early Apple desktop computers — Iconic early PCs: the Apple IIc from 1984 (left), and the Macintosh SE/30 from 1989.

Floppy disks are also on display at the museum, from 8-inch to 3½-inch, but they’re relatively new as recording media. The oldest piece in the museum is a made-in-Japan Lily bellows camera from 1916. An early hand-cranked film camera, the Pathé-Baby from 1922, stands near a box of film that came pre-addressed to Pathex Inc. in New Jersey for mail-in developing. It was a far cry from uploading video in seconds to YouTube or TikTok.

Rarities from yesteryear

There are also made-in-Japan gems: a 1977 Fuji Film 8mm reel of Star Wars in Japanese, a first-generation Walkman from 1979, the one with two headphone jacks for sharing music, and ceramic musical boxes with telephone handset cradles that would play melodies for callers on hold.

cassette players in Extinct Media Museum Tokyo — The first-generation Walkman (top row, in blue) was released in 1979 and featured two headphone jacks for sharing music.

Another oddity is the Sharp Elsi-Mate EL-429 combination calculator and abacus from 1984. It had a small digital display, a solar-power cell, 24 buttons and 65 abacus beads. Why a hybrid abacus and calculator? When it was introduced, abacuses were still popular in Japan and people didn’t fully trust calculators, Asuka explained. Abacus calculation is still taught in Japan in the belief that it promotes mental and physical dexterity. Even in 2025, one can sometimes find elderly Japanese shopkeepers and merchants who use abacuses when tallying purchases.

For those studying the history of consumer electronics, the museum has many examples of devices that became “extinct” as casualties of a format war. One is the 1992 Sony NT-1 Digital Micro Recorder, a.k.a. the Scoopman. It had high-quality sound despite using SD card-sized cassettes certified as the smallest in the world for digital recording in 1994 by Guinness World Records. The Scoopman series and NT cassettes were meant to compete with the microcassette and mini-cassette but were discontinued after only two models as devices with internal drives were more cost effective. Now they’re fuel for nostalgia buffs in the museum.

“I’ve always loved gadgets, as Japanese home appliances such as radio cassette players and mini stereo systems were all the rage in the 1980s,” said Takuya Kawai, the museum’s founder and chief curator.

“My favorite device is a Double 8 camera from the 1950s. It’s simple and beautiful, and it’s amazing that it still runs on a spring,” he said. “It’s fun to see that the turret lens (a structure that rotates three lenses with different focal lengths) seems like the origin of the lens configuration of today’s smartphones.”

Kawai, a photographer and videographer, opened the museum in 2023, a 12-minute walk from Tokyo Station in the Kanda district. He had noticed people cleaning house during the Covid pandemic and discarding their old gadgets, so he put the word out on social media and donations began pouring in. The museum now has some 3,000 items, of which about half are on the shelves at any given time.

“I made recordings with 8mm, Betamax, and digital over the years, but eventually all formats fall out of use, and we can longer view the contents,” he said. “I thought it would be interesting to have a museum of these recent artifacts that have vanished.”

More than half of the museum’s visitors are from overseas, many writing words of gratitude in its guestbook.

“I’ve never been to a place like this before,” said Jay, an 18-year-old tourist from south India, in an interview. He is young enough to have never used most of the devices and media on display. “When I saw it online, I knew I had to visit when I came to Japan. I’m really interested in the history of MacBooks.”

Kawai mused that all media and media devices are ephemeral. In the distant future, an unearthed SD memory card may be as alien to archeologists as Sumerian cylinder seals are to digital natives. A banner at the entrance to the Extinct Media Museum reads: “All media other than paper and stone will become extinct.”

Tim Hornyak is a Canadian journalist based in Tokyo, Japan, who writes extensively about technology, science, culture, and business in Japan.

Security Research Gaps Leave Critical Infrastructure Open to CyberattackMar 27, 2025

A cocktail of glaring gaps and irreproducible results in the field of industrial cybersecurity research is conspiring to leave critical infrastructure vulnerable to potentially “catastrophic attacks,” according to a systematic review of the computer science literature by cybersecurity researchers in the U.S. and Germany.

Led by security researcher Efrén López-Morales at Texas A&M University in Corpus Christi, an investigative team analyzed 133 research papers detailing industrial control system (ICS) cyberattack methods and defensive strategies published between 2007 and 2023. Collectively, the papers covered no less than 119 ways in which cyberattacks could sabotage an ICS, and 70 cyberdefense strategies for fighting them off.

Their analysis revealed, however, that many of the mooted cyberdefensive measures were built on sand: most had “little to no” research proving their effectiveness, said López-Morales, and of the defensive research he and his team were able to find, there was often no data repository provided—containing, for instance, the network intrusion test code used—which would have allowed other research teams to attempt to check or replicate results by other routes, as adherence to rigorous scientific methods would ordinarily demand.

As a result, the researchers say, critical infrastructure in essential businesses, spanning energy generation, manufacturing, water treatment, gas pipelines, and food production, remains at risk of debilitating cyberattacks. The news comes as U.S. energy utilities alone faced a 70% increase in cyber intrusions between January and August 2024, compared to the same period in 2023. Though none of the 1,162 attacks were crippling, experts fear it could only be a matter of time before one has serious effects on the energy supply.

In that context, the Texas A&M team’s research (or lack of research) found one salient risk is the number of unexplored vulnerabilities in many of the merchant-market programmable logic controllers (PLCs) used in industrial control systems to automatically operate machinery. Comprised of small, ruggedized computers running a real-time operating system, PLCs use a combination of software, logic hardware, and firmware to control the actuators, valves, and motors of what can sometimes be massive machines, based on inputs from timers and arrays of sensors measuring parameters like pressure, temperature, and vibrations.

López-Morales and his colleagues from the CISPA Helmholtz Center for Information Security in Saarbrucken, Germany, and from the University of California at Santa Cruz, found that attacks and defenses had generally been evaluated on only “a small subset of the PLCs available on the world market.”

As the team put it in their paper, which was presented at the 33^rd Usenix Security Symposium last fall: “The market share percentage of important PLC makers such as Mitsubishi, Omron, ABB, and GE is considerably higher than their research share—the number of papers explicitly using them for attack and defense evaluation purposes. For example, even though Mitsubishi’s PLCs account for 14% of the global market, they contribute to 0% of the research share in our review.”

In other words, the researchers said, attack formats detailed in the research literature variously targeting PLC communications modules, operating systems, control logic, memory, CPUs, firmware, plus the notion of using the input/output lines as covert poisoning or exfiltration channels, may have gone unchecked for many of the commercial devices on the market.

It was not supposed to be this way; at least, not since the summer of 2010. The reason? That was the year the ICS/PLC sector got the mother of all wake-up calls, when the vulnerability of PLCs was demonstrated for all to see by the Stuxnet computer worm, an infectious and ferocious piece of malware thought to have been developed by U.S. and Israeli intelligence services.

Using four previously unknown (also called ‘zero-day’) Windows vulnerabilities, Stuxnet percolated down to its target: a type of PLC made by Siemens of Germany that Iranian engineers were using to drive uranium-enriching gas centrifuges. The malware forced the centrifuge drive motors to intermittently speed up and then slow down, inducing out-of-bounds accelerations and decelerations that successfully shattered hundreds of the machines at Iran’s nuclear facility in Natanz.

But if Stuxnet’s success was such a wake-up call regarding PLC attack implications, why did the Texas A&M-led research find the ICS/PLC sector is in a dire situation where research gaps and a lack of reproducible research prevail? López-Morales gives the sector its due and said it has indeed made major efforts to improve PLC security since Stuxnet, adding that the enabling technology in industry has changed to more-connected versions that are more inviting of attacks.

“ICS/PLC security has gotten better since Stuxnet,” said López-Morales. “In the last few years, ICS companies have made important efforts to form meaningful collaborations with ICS security researchers. The data reveals that the number of PLC security research papers steadily goes up, starting in 2011, and Stuxnet was discovered one year before, in 2010. I do not think that this is a coincidence.”

Specifically, they found there were no peer-reviewed PLC security papers published in the literature in 2010, while 20 papers were published over the next decade before declining again, perhaps because Covid-19 lockdowns affected industrial security research activity.

López-Morales cited another important factor. “In 2010 when Stuxnet was discovered, ICS and PLCs were, in general, not connected to any computer network, such as the Internet. In fact, the PLC that Stuxnet targeted, the Siemens S7-300, did not include an Ethernet port, which means it was not possible to connect it to a computer network even if you wanted to.”

[Stuxnet was dropped in cafes and leisure centers near Natanz on USB thumb drives, in the hope social engineering would see it delivered by employees. It was.]

He explained that “modern PLCs are designed to be connected to a network and have advanced network features. When a PLC or any computer system is connected to a network, the opportunities for it to be attacked increase exponentially. This means that the interconnected world of today’s PLCs is much more dangerous than the isolated world when Stuxnet happened. But ICS researchers have risen to the challenge and have developed new ways to protect ICS and PLCs.”

Another structural change the Texas A&M research identified was the move from always using ‘HardPLCs’—legacy units based on dedicated, proprietary hardware—to cheaper ‘SoftPLCs’, which emulate the former but are based on Windows and Linux platforms. In a sector lacking shared security research, it might seem a move likely to increase critical infrastructure’s attack surface, as more cyberattackers know how to game those operating systems.

However, that is not necessarily the case. “There is also a big advantage to SoftPLCs,” said López-Morales. “All the advanced security mitigation measures that have been developed for modern operating systems over the years are now available to SoftPLCs. For example, Address Space Layout Randomization (ASLR) is a security measure that prevents exploitation of memory corruption vulnerabilities, and it is used by most modern operating systems, but not by many legacy PLCs because of hardware limitations.”

The researchers found a dearth of measures for securing SoftPLCs, and the team recommended as a result that industry develop defense methods that can secure both hard and soft varieties, and also measures for SoftPLCs alone that address their potentially risky use of features like virtualization and cloud integration.

What do security professionals make of Texas A&M’s findings, particularly the lack of code repository sharing behind the reproducibility crisis? Said Ruimin Sun at the Knight Foundation School of Computing at Florida International University in Miami, who studies control logic modification attacks on PLCs, “My take is that proprietary PLC hardware commonly limits data and code sharing due to the sheer amount of code preparation effort and motivation required to do so.”

Sun notes that the move to open source software in some ICS systems is making it easier to get hold of at least some of the test code required. She said an annual ACM workshop she co-chairs, called Re-design Industrial Control Systems with Security (RICSS), “saw more papers with open source code in 2024 than the 2023 event; mostly hardware-agnostic machine learning-driven code for anomaly detection.”

Regarding the lack of research on SoftPLCs, Sun said it is more of a problem for academics right now. “Although SoftPLCs are more cost-effective, they are more prevalent in academic testing environments.” She added that “PLC hardware is still considered more robust and reliable, and since PLCs are controlling critical infrastructure, the industry is still more confident using them.”

Sun’s co-chair at RICSS, Mu Zhang of the Kahlert School of Computing at the University of Utah in Salt Lake City, said the sharing of ICS security research is “challenging” for two reasons. “First, ICS environments are traditionally closed systems, often comprising proprietary hardware and software components, which makes it difficult to publicly release research findings without inadvertently disclosing intellectual property.

“Second, ICS platforms vary significantly from one another, making it difficult to establish a generalizable method for disseminating research results. Take machine learning-based anomaly detection as an example: existing detection models are typically developed and evaluated on specific testbeds using manually crafted attacks designed for those systems.

“Consequently, reproducing these attacks and verifying the effectiveness of an anomaly detector across different environments—such as applying a model designed for a water treatment system to a manufacturing testbed—is extremely difficult, if not impossible.”

It’s tough, too, Zhang said, for academic researchers who want to probe the efficacy of security measures to gain access to emerging SoftPLC-based ICS testbeds, “as these systems are typically designed and managed by industry experts from companies like Rockwell and Siemens, or system integrators.”

As a result, Zhang said, “Academics not only struggle to obtain access to these platforms, but also face significant constraints in modifying them for research purposes.”

Such struggles highlight that a fine balance needs to be struck between avoiding revealing critical infrastructure information that could be useful to an attacker, and verifying, via the scientific method, that results are reproducible and actually work in practice.

As cybersecurity is part of the broad discipline of computer science, its adherents ought to be adopting a more scientific approach, because the public will not be forgiving if they do so only after disaster strikes.

Paul Marks is a technology, aviation, and spaceflight journalist, writer, and editor based in London, U.K.

Universities Take Strategic Steps in the Face of Uncertain FundingMar 26, 2025

Recent changes in federal funding policy are impacting computer science academics across the United States. Federal agencies, including the National Science Foundation, which funds a significant portion of computer science research, are reviewing their grants to comply with recent executive orders issued by President Donald J. Trump.

Thomas Conte, associate dean for research at the College of Computing at Georgia Institute of Technology (Georgia Tech), said a lot of uncertainty exists in his department since the announcements began.

“Many faculty members were deeply upset and some of our Ph.D. students wondered if their thesis topics were no longer allowed,” said Conte. “The impact has already been grave emotionally. If all the proposed changes happen, the impact will be significant and devastating.”

Impact on faculty hiring varies across institutions

Many computer science departments are not yet seeing significant changes to faculty hiring from the funding changes. However, computer science leaders are cautious and watchful.

“We calibrated our faculty hiring in line with rate of replacement, rather than expansion—while still being strategic about recruiting in areas where we see demand/growth potential,” says Magdalena Balazinska, director of the Paul G. Allen School of Computer Science and Engineering at the University of Washington. “We have gone through a period of rapid, significant growth in recent years, so while we think it’s prudent for us to slow down during a period of uncertainty, we are not pulling back entirely.”

However, Chris Umans, a professor of computer science and executive officer for computing and mathematical sciences at the California Institute of Technology (CalTech), has not seen a discernible impact on hiring thus far. He said that CalTech is actually in the midst of making a number of faculty offers greater than their department’s historical average over the past several years.

Potential impact on graduate programs

Many universities are also very concerned about the impact of the changes to the graduate programs. Because the changes were announced in the middle of admissions, many schools have not seen an impact yet but may in the future.

At CalTech, graduate admissions are largely determined by individual faculty. Because some are holding back due to funding uncertainty while others are not, Umans says the overall picture seems similar to that of last year.

The situation is currently similar at University of Illinois at Urbana-Champaign. Nancy Amato, Abel Bliss professor of engineering and director of the university’s Siebel School of Computing and Data Science, said that because the Ph.D. and thesis master’s student admissions decisions for the coming year had mostly been completed when the potential funding changes were announced, the school has not decreased admissions or rescinded offers. However, she said that might not be the case in the future.

“Just in our school we have over 300 students, primarily Ph.D. students, who are supported by research assistantships from federal funding,” said Amato. “If there is a significant decrease in federal research funding, the results could be catastrophic.”

Other schools are being more conservative with their approaches. Balazinska said that uncertainty related to research funding has made the University of Washington’s computer science department more conservative with admissions. Her department has a waiting list this year for the doctoral program, but the department cannot currently be certain of its ability to support them financially.

“There is no question that this situation is bad for everyone, and we hate having to restrict our admissions when we have so many qualified candidates eager to come to UW and work with our faculty,” said Balazinska.

Loss of federal funding poses significant risk

The impact of changes in federal agency funding, especially for the National Science Foundation, is a top concern of universities.

“A number of program officers for our funded grants were technically probationary employees and were terminated,” says Balazinska. “Some of our faculty received new funding and others have served on review panels very recently.”

The Georgia Tech College of Computing completed $41 million in research expenditures last fiscal year and is on track to exceed that this year. However, Conte said his organization has significant exposure with the funding changes. Some projects with the U.S. Agency for International Development (USAID) have already been canceled.

“If all the bad things that have been announced are implemented, we will be exposed by as much of a quarter of our funding,” said Conte. “We want to keep our research alive as a top priority. Our top priority again is our people. It is our students; it is our faculty. It’s also our greatest resource.”

Strategically reducing dependencies on federal funding

Georgia Tech’s College of Computing has launched a task force for strategic direction in new ways to fund research and computing. While the university also created a tactical task force to help faculty members deal with the immediate impact of the funding changes, Conte’s committee is creating a strategic plan for finding alternate funding sources.

“I really believe that computer science as a whole has to step back and look strategically at how we are funding ourselves,” said Conte. “We shouldn’t be so subject to political whims. We need to think outside the box right now and come up with different strategies for funding, not only currently vulnerable research, but also, I think, enabling more faculty freedom in research moving forward.”

The University of Illinois at Urbana-Champaign has been reviewing its current degree programs for changes that can decrease reliance on federal research funding and increase resilience across campus. Amato said her department currently offers 16 of its undergraduate degrees as blended computer science degrees with other colleges, such as a Computer Science + Linguistics degree. It also recently added six new blended data science undergraduate degrees, such as Finance + Data Science, and is looking at ways to use this model for new master’s degree programs. The university is also looking at increasing its number of master’s degree programs, since many students are interested in graduate studies during economic downturns.

Corporate funding is another top source of alternate funding that universities are considering. Conte said that instead of the traditional annual research funding for one faculty member, Georgia Tech is increasing its block grant program with industry. The university also is looking into Industry-University Cooperative Research Centers, a program in which companies pay a membership fee in exchange for reduced overhead and shared IP rights. Independent applied research groups, such as the Georgia Tech Research Institute, is another option to keep the lights on research-wise and broaden the umbrella from applied to more fundamental research.

“In reality, the phenomenal tech industry and computer and computing culture in the U.S. has been a result of federal investment. And to pull that and assume that it’ll continue is naïve,” said Conte. “The tech industry is going to have to step up now. They have been enjoying the benefits of federal funding over the decades, everything from the Internet to all the incredible advances that keep microprocessors on the cutting edge and keep the innovation and microprocessors centered in the United States.”

Lasting future impact if cuts become reality

While uncertainty clouds most computer science programs right now, many higher-education leaders are worried about a lasting future impact—not just for their universities, but for society as a whole. Balazinska said academic research is essential for driving an innovative economy and for producing a skilled workforce in critical fields. Additionally, she explained that fundamental discoveries can drive not only new businesses, but entirely new industries.

“If universities end up suffering significant cuts, the nation’s fundamental research will deteriorate, and other countries will take over leadership in technology,” Balazinska said. “I think there is little doubt that computing innovation has been essential to this country’s accomplishments, competitiveness, and national security.”

Jennifer Goforth Gregory is a Raleigh, NC-based technology journalist who has covered B2B tech for over 20 years. In her spare time, she rescues homeless dachshunds.

Fulfilling the Growing Power Requirements of AI DatacentersMar 25, 2025

After falling out of favor and languishing for years, nuclear energy is suddenly in vogue again.

Nuclear power plant operators have artificial intelligence to thank for this renewed interest. AI-intensive applications are consuming more and more energy, and the emergence of generative AI has prompted demand to rise even higher, fueling the need for datacenter growth.

Already, the energy required to run AI functions in datacenters is accelerating, with an annual growth rate between 26% and 36%. At this rate, in a mere three years, AI could be using more power than the entire country of Iceland used in 2021.

Datacenter spending on AI processors and accelerators hit a record $26.7 billion in Q3 2024, according to Futurum. By 2030, McKinsey is projecting global demand for datacenter capacity could more than triple, while Goldman Sachs Research estimates datacenter power demand will grow 160% by 2030.

Consequently, tech giants are turning to nuclear power to meet those needs in their datacenters. For example, Microsoft inked a 20-year deal with Constellation Energy, owner of the Three Mile Island nuclear power plant, the site of the worst commercial nuclear accident in U.S. history back in 1979. The reactor that will be reopened to power Microsoft’s datacenters was not involved in the accident.

Additionally, former Microsoft CEO Bill Gates’s company TerraPower invested $1 billion in a nuclear power plant that broke ground in Kemmerer, WY, in June 2024. Part of the impetus was so that “datacenters can serve the exploding AI demand,’’ Gates told NPR at the time.

Google has signed a deal with Kairos Power to build multiple small modular reactors (SMRs) to power its datacenters and offices. The first SMR is expected to be online by 2030.

“The grid needs new electricity sources to support AI technologies that are powering major scientific advances, improving services for businesses and customers, and driving national competitiveness and economic growth,’’ Google explained in a blog post. “This agreement helps accelerate a new technology to meet energy needs cleanly and reliably and unlock the full potential of AI for everyone.”

While not creating any new power generation, Amazon Web Services purchased Talen Energy’s Cumulus Data Assets, a 960-megawatt nuclear-powered datacenter in Susquehanna, PA, for $650 million.

Why nuclear power?

Nuclear energy tends to check most of the boxes sought for a modern power source: it is steady, not subject to weather conditions, and it does not spew carbon dioxide during operations, noted IEEE Senior Member Simay Akar. “As newer, smaller, safer reactor designs become increasingly practical for the first time, companies can seriously consider nuclear as part of their energy mix,” she said.

While renewable energy such as solar, wind, and battery storage have the potential to meet most of the increased power needs from datacenters at certain times of day, they don’t produce power consistently enough to be the only energy source for datacenters, according to Jim Schneider, a digital infrastructure analyst with Goldman Sachs Research.

“Our conversations with renewable developers indicate that wind and solar could serve roughly 80% of a datacenter’s power demand if paired with storage, but some sort of baseload generation is needed to meet the 24/7 demand,” Schneider wrote in a January 2025 Goldman Sachs report. He added that while nuclear is the preferred option for baseload power, the challenges of building new nuclear plants make natural gas and renewables more realistic short-term solutions.

These challenges include the costliness and time it takes to build nuclear power plants, and the difficulty of ramping up and down the power output of a nuclear power plant to match variable demand.

Still, nuclear energy is attractive because it has zero carbon dioxide emissions—although it does create nuclear waste that needs to be managed carefully. In addition, the scarcity of specialized labor, the challenges of obtaining permits, and the difficulty of sourcing sufficient uranium all pose a challenge to the development of new nuclear power plants, according to a Goldman Sachs report.

By the 2030s, however, new nuclear energy facilities and developments in AI could start to bring down the overall carbon footprint of AI datacenters.

Akar has a few concerns about nuclear power. While nuclear power doesn’t produce carbon emissions, it creates radioactive waste that must be stored in safety for several thousand years, she said. “We still do not have that perfect long-term solution, and that makes me uneasy. Then there is the question of safety: new technology, yes, is much safer, but any accidents are terrible in consequences. People are always going to be wary of this, and that makes it hard to rebuild trust in nuclear power.”

Nuclear power also isn’t as flexible as renewables. “Our energy grid is changing—we need power sources that can ramp up and ramp down quickly, and nuclear doesn’t work that way,’’ Akar said. “Solar, wind, and storage give us that flexibility while continuing to improve every year.”

All energy technologies have their tradeoffs, said Benton Arnett, senior director of markets and policy at the Nuclear Energy Institute, a nuclear energy trade association based in Washington, D.C. “Everything that is constructed has some construction risk, and any operating asset has some form of risk.”

A 2022 study by the Center for ESG and Sustainability found that when looking at the totality of risks in tandem with issues such as human health, accident percentages, and the effect of nuclear power on communities, “those impacts are really very low,’’ said Arnett.

What tech companies have realized, Arnett said, is that some lingering 1980s perception issues about nuclear power “were way overblown. The biggest mistakes that were made in energy planning in this country was that we stopped building nuclear in the 1980s.”

Blending nuclear power with sustainable sources

While acknowledging new advancements like SMRs “make nuclear energy more adaptable and safer than ever before,” Akar believes a combination of nuclear power and renewable sources is the ideal approach.

“The way forward, said and done, is not a choice that must be made among competing sources, but the development of a clean, more robust energy system that integrates renewables and nuclear,’’ she said. “Continuing our strong investment in solar, wind, storage, and emerging technologies will ensure that growing energy needs are met with the highest degree of sustainability possible.”

Renewable energy sources are “making incredible progress,” Akar added, but on their own, “they cannot always provide power around the clock. That is where nuclear [power] comes in: It provides a steady, reliable source of clean electricity to fill the gaps when the Sun is not shining or the wind is not blowing.”

The mix of renewables and nuclear will help to ensure a stable, carbon-free future, independent of fossil fuels as energy demands continue to rise, she said. “It is not about picking one over the other; it is working together in the building of a cleaner, resilient energy system.”

Is colocation the answer?

Because renewable energy sources typically require a much larger land footprint than natural gas or nuclear, they are more likely to be located far from the large cities that use much of the energy they generate, Goldman Sachs said. This makes their transmission costs higher.

Jacopo Buongiorno, a nuclear science and engineering professor at the Massachusetts Institute of Technology and director of science and technology at MIT’s Nuclear Reactor Laboratory, thinks colocating new datacenters and nuclear reactors “makes the most economic and environmental sense,” because it “drastically reduces the need for transmission capacity because the energy used by the datacenters is generated next door, so to speak.”

For example, a datacenter consuming 200 megawatts of electricity could save over $100 million a year just in transmission charges, Buongiorno said during a TED Talk.

Colocation also creates synergies around physical security and cybersecurity through the sharing of resources, maximizing protection, and reducing costs, he added.

However, Buongiorno acknowledged that “The integration of these systems is not trivial and will need serious engineering to satisfy requirements of reliability, safety, environmental sustainability, and efficiency.”

How to power AI in the years ahead

Akar and Buongiorno differ on where the power priorities should be as AI’s electrical demands continue to increase.

“Powering [AI] sustainability is not just a technical challenge, it’s a global imperative,’’ Buongiorno said. “Nuclear energy stands as a uniquely positioned solution offering clean, reliable, and scalable power for the datacenters that will drive AI forward as we confront the challenges of energy and climate in the 21^st century.”

He suggested reimagining nuclear energy “not as a relic of the past, but as a vital component of our shared future.”

Akar, on the other hand, stressed that the future of energy should be built on renewables. “We’ve got the technology, we’ve got the momentum, and we must scale up solar, wind, and storage to meet the rising power needs of the world sustainably,’’ she said.

“Nuclear can play a role in the transition, but the long-term goal must be a fully renewable, storage-backed energy system that will ensure both reliability and environmental responsibility.”

Esther Shein is a freelance technology and business writer based in the Boston area.

AI Transforms Medical DiagnosticsMar 20, 2025

Understanding aches, pains, and medical conditions is a challenging task. Doctors and other health professionals attend school for years to learn about symptoms, diagnostic methods, and treatments. Afterwards, most spend considerable time reading literature and staying up to date with developments in the field.

Suddenly, large language models are changing the way clinicians approach medical diagnostics. For years, machine learning and other forms of AI have helped decipher data, spot trends, and automate processes. Now, ChatGPT and other Generative AI (GenAI) tools are streamlining administrative tasks, enhancing clinical-decision support, and augmenting patient education.

“Generative AI is rapidly moving into the mainstream of healthcare,” said Adam Rodman, an assistant professor of medicine at Beth Israel Deaconess Medical Center in Boston, and an AI researcher at Harvard Medical School. “It’s helping both doctors and patients understand events, communicate, and make important medical decisions.”

In fact, research conducted by Rodman and others demonstrates that AI delivers relatively accurate diagnoses, and it may even outperform humans at some tasks. Yet GenAI also introduces challenges, risks, and dangers. This includes spewing incorrect information—a.k.a. hallucinations—and possibly eroding critical face-to-face contact between physicians and patients.

Doctors Little Helper

Busy doctors, anxious patients, and an overtaxed healthcare system is the new normal. GenAI could help fill the gaps. “It is a disruptive technology that can process complex information quickly and serve as a real-time tool,” said Andrew S. Parsons, Associate Dean for Clinical Competency and an associate professor in the Division of Hospital Medicine at the University of Virginia School of Medicine.

A group of researchers, including Parsons and Rodman, recently embarked on a study to better understand how large language models impact diagnostic reasoning compared to conventional resources. They analyzed the behaviors and outcomes of 50 physicians across three diagnostics approaches: doctors who used only conventional methods, those who relied on ChatGPT-4 along with conventional resources; and AI with no human interaction.

Remarkably, the LLMs scored 16 percentage points higher than the conventional group. They reported the results in an October 2024 paper published in the Journal of the American Medical Association, “Large Language Model Influence on Diagnostic Reasoning: A Randomized Clinical Trial.” “The study suggests that AI alone can outperform clinicians in specific diagnostic reasoning tasks when presented with structured vignettes,” Parsons said.

This doesn’t mean GenAI will replace human clinicians. “Real-life scenarios have nuances. They require an understanding of patient contexts, emotions, and dynamic decision-making,” Parsons noted. More likely, generative AI will complement current methods, which include widely used decision-support tools such as UpToDate and DynaMed. “ChatGPT introduces a dynamic, conversational interface,” he said.

Generative AI isn’t only a valuable diagnostic tool for clinicians; it can help patients sort through complex medical information. An August 2024 study conducted by Kaiser Family Foundation found that 25% of adults under age 30 already use generative AI at least once a month to gather health information. Among all adults, 17% rely on AI chatbots. “Patients are increasingly using the technology to educate themselves and get a second opinion. People want to understand their health better,” Rodman said.

Large language models are a clear step up from search engines, Rodman said. “If a medical professional has collected and collated the information, a large language model can be remarkably good at checking symptoms and assessing general issues. Nevertheless, questions remain about how these models work and how effective and accurate they are when a patient interacts with them directly. It is too early to draw definitive answers,” he added.

Rx for Results

In the years ahead, ChatGPT and similar LLMs could profoundly reshape the medical field, said Ethan Goh, Clinical Research Fellow at the Stanford Clinical Excellence Research Center and a lead author of the JAMA paper. These tools increasingly help clinicians transcribe notes, identify correct billing codes, and capture important data. “They drive significant efficiency gains,” he said.

The next frontier is integrating large language models into mainstream diagnostics. This could fundamentally change the way doctors practice medicine and make clinical decisions, Goh said. “Leveraging GenAI technology is a way to ingest and synthesize large amounts of information quickly. The technology could prove valuable in open-ended environments where there isn’t a clear single-best answer. It could guide clinicians through a range of possibilities—including things that a human may have overlooked.”

Another benefit of GenAI, Rodman said, is that it can reduce human cognitive biases that sometimes lead to poor clinical decisions. “Oftentimes, humans anchor on early data, and they don’t question themselves after they have reached a conclusion,” he explained. “AI models don’t do that. If you instruct them to reexamine things or look at a situation in a different way, they will do so—and they are often quite good at it.”

While AI has the potential to transform medical diagnostics, it must gain the full trust of clinicians, patients, and regulators. The fact that it lacks explainability is a concern. The most glaring issues are the potential to commit clinical errors, and patients mistaking AI advice for definitive medical guidance. “Today’s AI lacks contextual awareness. It can’t handle ethical dilemmas or understand the unpredictability of clinical environments,” Parsons said.

AI ultimately will require a hybrid approach, Parsons added. Within this human-in-the-loop model, “AI serves as a tool and complements human clinicians rather than replacing them. Healthcare professionals must focus on maintaining their core clinical reasoning abilities while embracing AI as a supportive tool. The trade-off involves leveraging AI’s speed and breadth without losing sight of the deeper, patient-centered aspects of care.”

Beyond Bots

Integrating GenAI into everyday clinical practice will necessitate training, so medical professionals understand how to prompt chatbots and interpret results. Sets of predefined prompts could help guide doctors and others through decision-making, Parsons said. Another challenge is integrating AI into existing workflows—or revamping processes to incorporate chatbots and other forms of generative AI. This might require tight integration with electronic health records and connected medical devices, Goh said.

Experts hope to conduct more research into the space. Rodman said that approaching chatbots and GenAI in a creative yet measured way is crucial. They could become valuable assistants. “The goal is to make everyone more informed and efficient,” he said. “Perhaps AI serves as the first point of contact but with need to make sure that a care team is involved in all aspects of diagnosis and treatment. The goal is to deliver better and more humane healthcare.”

Samuel Greengard is an author and journalist based in West Linn, OR, USA.

Predicting the Unpredictable: The AI OutlookMar 20, 2025

The history of technological advance is characterized by disruption, yet we rarely witness a technology emerge with propensity to shock like artificial intelligence (AI).

This year started with a bang when Chinese startup DeepSeek launched R1, an open-source, low-cost large language model (LLM) that blindsided global tech, knocked nearly $600 billion off U.S. chipmaker Nvidia’s market value, and provoked a serious spat with OpenAI over proprietary technology. No one (or few, at least) saw that coming.

Future shockwaves look inevitable. Yet, despite its unpredictability, AI is ripe for speculation: What’s next for generative artificial intelligence (GenAI) and intelligent agents? How will lawsuits and regulation play out? Will solutions emerge for AI’s sticky sustainability problem?

As the DeepSeek dust settles, AI experts from different fields were asked what breakthroughs and challenges they expect to see in the rest of 2025.

Tracking the technological trends

LLMs continue to be at the front of researchers’ minds, but there are many other advances to watch.

Sonja Schmer-Galunder, Glenn and Deborah Renwick Leadership Professor in AI and Ethics at the University of Florida, predicted an increase in the “democratization” of AI development. “We will see more open source and proliferation of competitive models, especially now, after DeepSeek. This could help balance the current concentration of AI power,” she said.

According to Schmer-Galunder, there is also “lots of talk” about AI agents, despite the lack of established ethical guidelines and standards for their safe use. “Agents will likely mature further and become more task-specific. The problem is often that real-world implementations show limitations or unintended uses,” she said.

Isabelle Augenstein, an expert in natural language processing (NLP) at Denmark’s University of Copenhagen and a co-lead of Denmark’s Pioneer Centre for Artificial Intelligence, expects to see new LLMs focused on non-English languages. “Moreover, I hope to see increased multi-modal capabilities of LLMs, an area where they are still lacking,” she said.

Augenstein also flagged breakthroughs in the development of small, high-performance models that are less compute- and cost-hungry. “This would be important both from an environmental perspective and for adoption in more scenarios,” she said.

Keiland Cooper, a cognitive scientist and neuroscientist at the University of California, Irvine, and president of non-profit research organization ContinualAI, anticipates advances in AI for scientific research. In the drug development space, Cooper highlighted a recent analysis by researchers at the Boston Consulting Group of the performance of AI-discovered drugs in clinical trials. The researchers found that “AI-discovered molecules have an 80-90% success rate, substantially higher than historic industry averages.”

Advances in robotics should be expected too, said Cooper. “The barriers of mapping the physical to the digital space—while not completely overcome— have been aided by clever methods learned from training LLMs, and applying and mixing them with new data types to the robotics space.”

Juan David Gutiérrez is an associate professor at the Universidad del Rosario in Bogotá, Colombia. Like Augenstein, Gutiérrez expects to see growth in non-English-language LLMs and pointed to the development of Latam-GPT, a 50B-parameter LLM based on a Spanish-Latin American corpus.

Across Latin America, predicted Gutiérrez, governments will continue to expand their use of AI. His current research has identified over 500 AI systems already used by governments in the region. “The expansion is not just quantitative, but also in terms of the diversity of sectors of government where AI is deployed,” he said.

However, Gutiérrez anticipates AI advances could be impacted by U.S. restrictions on exporting hardware and wider data access. “Difficulties in accessing data of the required quality and volume will be a significant barrier for companies that are trying to develop new models that are customized for the needs of the region.”

AI challenges are becoming ever more complex and global

The unfolding story of AI is more complicated than the sum of its technological breakthroughs. Who gets to build AI, how fast it is rolled out, and how it is legislated (or not) are critical topics up for debate in 2025.

Said Schmer-Galunder, “Big tech companies seem to be completely engulfed in an accelerating race towards AGI (artificial general intelligence), under the assumption that whoever gets there first, wins. This has global geopolitical ramifications affecting national security and economic power.” She pointed to international cooperation and the adoption of frameworks for developing safe AI as potential solutions.

Augenstein identified a growing “disconnect” between academia and industry in the research space, as well as between big tech and other stakeholders. “This is both due to the closed-source nature of many high-performance language models, as well as the computational costs of training and experimenting with LLMs on an academic compute budget,” she said.

Gutiérrez expects to see AI regulation adopted more widely across Latin American in 2025, at both the national and subnational levels. “In countries such as Chile, Brazil, and Colombia, the bills have advanced and may be approved by the end of the year,” he said. However, there are concerns: while issuing legislation may be reasonably straightforward, “Building implementation capacity is difficult for Latin American countries, given the financial constraints that most states face,” he said.

The ongoing debate over societal impact and sustainability

Responsible AI that is sustainable and built for the public good is an often-cited aspiration in academia and industry, yet consensus on responsible solutions remains elusive.

Said Cooper, “While sustainability has been in the back of many researchers’ minds for some time, further techniques to drive down the cost and energy use of training models is a growing concern as models continue to grow larger.”

GenAI’s popularity is exacerbating the sustainability issue, said Schmer-Galunder. “We are building giant datacenters that use lots of unnecessary energy for silly prompt requests, sometimes just for our entertainment.”

At a societal level, Schmer-Galunder raised concerns about the absence of collective decision-making as AI continues to fundamentally reshape our lives. “This change is happening even though we lack a coherent vision for the type of society we want, or want to create, for generations to come,” she said.

Schmer-Galunder suggested that while technology can boost efficiency, we may not want it to penetrate every aspect of our lives. “We need to really deeply think about that and, collectively, decide where we don’t want AI—either because we want humans to stay in control, or because it doesn’t lead to improvements in well-being.”

The thrilling potential of technological breakthroughs mixed with challenges around competition, legislation, geopolitics, and sustainability loom large in the AI outlook. The likelihood of shockwaves also seems high. In Cooper’s words, “As always, I’m most excited about the advances that we didn’t see coming.”

That AI shocks will happen in 2025 is perhaps predictable now, but what those shocks look like and what their consequences will be are the big unknowns.

Karen Emslie is a location-independent freelance journalist and essayist.

How Do You Measure AI?Mar 20, 2025

Millions of people use artificial intelligence (AI) tools like ChatGPT daily to do everything from generating code to drawing images to creating business ideas.

Those AI tools appear to be getting better. Back in November 2022 when it was launched, ChatGPT was powered by GPT-3.5, at the time the most powerful model offered by OpenAI. Yet GPT-3.5 was quickly eclipsed by GPT-4 just a few months later. GPT-4 crushed GPT-3.5 on a range of benchmarks, including its performance on the bar exam (GPT-4 scored in the 90th percentile; GPT-3.5 in the 10th). In short order, GPT-4 itself was also overtaken by GPT-4o, now OpenAI’s most powerful model by a long shot.

At the same time, companies like Google, Anthropic, and Microsoft also have developed increasingly capable AI models that seem to blow past the top scores of previous models on a range of tests and evaluations. That makes the trajectory of AI improvement quite clear. Models are getting dramatically better as scaling laws mean their intelligence grows significantly as they’re given more data and compute time.

Today, the top models all appear to be comparably powerful, and leaps and bounds better than their predecessors of just a couple years ago. However, directly comparing the relative strengths of each model is still difficult. Is GPT-4o or Claude 3.5 Sonnet better at coding? Is Gemini 1.5 Pro or Mistral Large the better choice for document analysis? Which model is more efficient or effective for a specific task or use case?

Those are not easy questions to answer because we lack effective standardized evaluation methods to tell us how good a particular AI model is at a particular thing. There are plenty of tests and benchmarks to measure different outcomes, but they’re still inadequate if you want to definitively understand which model to use for a particular task.

It’s like knowing you’re looking at a group of Olympic gold medalists, but don’t know what sport they play.

A feature or a bug?

To start, the problem lies in the nature of today’s AI systems. The difficulty of measuring them is in many ways a feature, not a bug.

Generative AI, the type of AI that powers OpenAI’s ChatGPT, Anthropic’s Claude, Google’s Gemini, and others, is a general-purpose technology that is good at a lot of different things, not narrowly great and purpose-built for one specific thing. This inherently complicates attempts to measure it consistently.

“Generative AI is designed for versatility, developed through large-scale pre-training to support open-ended applications,” said Matthias Bethge, an AI researcher at Germany’s University of Tübingen. “Traditional evaluation methods, which rely on static tests for predefined abilities, fall short in capturing this versatility.”

As an example, you can use the exact same model to do two wildly different things like, say, generate art in the style of the Dutch Masters and code in Python. You know the model you use to do this is generally state-of-the-art, but you don’t always know if the model is state-of-the-art at the specific task you want to do, chosen from a menu of thousands of tasks the model is capable of doing.

In other words, this incredible AI model may suck at art or code, or both, and be world-class at something else. This presents obvious problems when trying to pick which tools to use to get stuff done, but this obvious problem does not have an obvious solution right now. The best that AI companies can do right now is measure their technology using a range of different methods, said Kathy McKeown, an expert in natural language processing at Columbia University.

“Companies measure the quality of generative AI systems for the most part by evaluation on a suite of task-specific benchmarks,” she said.

Common benchmarks used to measure models include testing them on things like General-Purpose Question Answering (GPQA), performance across a wide range of subject matter expertise (Massive Multitask Language Understanding, or MMLU), and how well a model deals with multimodal inputs (MultiModal Understanding, or MMU). These benchmarks are commonly cited in technical reports by AI companies when they release a new model. Companies also often cite how well models perform on standardized tests designed to evaluate human competency.

In the GPT-4 technical report, for instance, benchmarks such as question answering, story cloze (asking the model to come up with the end of a story), natural language inference, and summarization were used to evaluate how well the model does certain tasks, said McKeown. But we can only guess at the full range of benchmarks companies might be using, as they don’t reveal everything, she cautioned.

While the benchmarks are helpful, they also have drawbacks.

MMLU, for instance, one of the more common benchmarks, consists of tens of thousands of multiple-choice academic questions across a range of subjects. The idea is that a chatbot able to answer many of these questions correctly is smarter than one which answers fewer correctly.

However, it’s just not that simple.

If an AI model’s training data contains questions and answers from the MMLU test, then it can cheat on the test, since it already is more likely to know the answer than a model not trained on this information. And, spoiler, there’s no outside teacher or organization grading AI’s work. Companies are grading and reporting their own performance.

Also, the dataset itself can contain problems in other ways, said McKeown.

For example, some benchmarks use a common metric called ROUGE to evaluate how well a model does on summarization tasks using a common dataset called XSum.

“But we know there are a lot of problems with the dataset,” said McKeown, because it has reference summaries that are unfaithful to the input. “And a metric like ROUGE will reward a model, then, for generating unfaithful summaries for XSum.”

There are also plenty of languages, she noted, that are not well-represented in training data, further reducing the effectiveness of certain benchmarks, depending on what you’re trying to evaluate.

In other words, even the quantitative measures of AI are only as good as the data underpinning them.

The challenge of evaluating narrow tasks.

Make no mistake, broad benchmarks can be useful. They will start to tell you, with some reliability, how good a new AI model is at a broad category of tasks. Yet they don’t really tell you if a particular model will be best to use for a tangible task you’re trying to do in your business or life.

“What we do much less of is evaluate whether AI helps people complete real-world tasks,” said Lydia Chilton, a professor of computer science at Columbia University. “It can write a cold email that looks and sounds like a good cold email should, but did that cold email get you the response you wanted?”

To do that today, we have to rely on far less scientific methods of measurement. And, right now, a lot of that measurement is qualitative. You try out a particular model for a particular use case and see if it produces an output that is superior to that of another model you want to use. In fact, this is part of how one of the top measurement sites evaluates AI.

The Large Model Systems Organization (LMSYS) is an open research organization that, among other projects, maintains the Chatbot Arena Leaderboard, an industry-standard scoreboard that rates how effective current AI models are. The leaderboard uses a combination of human feedback (users rate the outputs of two different models) and Elo ratings, a method used in the chess world to rank the relative skill level of players in zero-sum games. At any given time, you can see which models reign supreme on the leaderboard—a fleeting honor that AI companies love to brag about.

This gets closer to a more granularly useful measurement system. For the top models, enough users across enough use cases have rated the model highly. That helps you narrow down which models to be using across use cases. But it still doesn’t tell you exactly which model is good at which task. And the efficacy of doing that varies widely across tasks.

One area that does appear to be doing well, said Chilton, is evaluating AI for code generation.

“A nice property of the code synthesis problem is that the outputs are relatively easy to evaluate: you can run the code and test if it did the right thing,” she said. This makes it much easier to determine if one model is better at coding than another.

“For this problem, you can more easily close the feedback loop to test whether the output solved the problem, and that’s why code from generative AI is typically quite useful.”

The same is true of relatively simple questions and answers, she says. For a single question with relatively little context, it’s fairly straightforward to evaluate whether or not the answer to the question is correct and does not contain hallucinations (AI’s tendency to confidently make stuff up). It also is relatively easy to determine if you like the output’s length, tone, and clarity.

But for everything else in between?

“Current methods for evaluating generative AI require a significant upgrade to encompass the full range of capabilities these models offer,” said Bethge. His research group is attempting to pioneer one such method, which they call “infinity benchmarks.”

“[This is] a more dynamic approach that leverages a constantly evolving, open pool of data drawn from both existing and emerging test sets,” Bethge said. “This enables more flexible and comprehensive assessments.”

By reusing and aggregating diverse data samples, the method can evaluate models under different testing conditions and dynamically update prior evaluations, he said. That gets around the problem of static benchmarks, essentially continuously tracking and measuring model performance against ever-expanding tests and requirements.

Computer Science: Articles and Journals

Find a Journal

Google Scholar

Remote Access

Article Impact

Browse Helsinki University Library’s E-Journal Collections

ACM Communications: News Articles

Just the Facts

Model Behavior

Active Imagination

The Dream of Better AI

Further Reading

A feature or a bug?

Further Reading

Directory of Open Access Journals

Cabells - a Database of Predatory Journals

Journal Evaluation

Book or Journal Not in Library Yet?

Looking for Standards?

Author Identifiers