AI Models: Stochastic Parrots doing Tamasha?
Bappa Sinha
THE unprecedented popularity of ChatGPT has turbocharged the AI hype machine. We are being bombarded daily by news articles announcing humankind’s greatest invention – Artificial Intelligence (AI). AI is “qualitatively different”, “transformational”, “revolutionary”, “will change everything” – they say. OpenAI, the company behind ChatGPT, announced a major upgrade of the technology behind ChatGPT called GPT4. Already, Microsoft researchers are claiming that GPT4 shows “sparks of Artificial General Intelligence” or human-like intelligence – the Holy grail of AI research. Fantastic claims are made about reaching the point of “AI Singularity” of machines equalling and then surpassing human intelligence. The business press talks about hundreds of millions of job losses as AI would replace humans in a whole host of professions. Others worry about a Sci-fi like near future where super-intelligent AI goes rogue and destroys or enslaves humankind. Are all of these predictions grounded in reality or is this just over-the-board hype that the tech industry and the VC hype machine are so good at selling?
The current breed of AI Models are based on a specific kind of statistical tools known as “neural networks”. While the term “neural” conjures up images of an artificial brain simulated using computer chips, the reality of AI is that neural networks are nothing like how the human brain actually works. These so-called neural networks have no similarity with the network of neurons in the brain. This terminology was, however, a major reason for the artificial “neural networks” to become popular and widely adopted despite its serious limitations and flaws.
“Machine Learning” algorithms which are currently used are an extension of statistical methods that lack theoretical justification for extending them in this way. Traditional statistical methods have the virtue of simplicity. It is easy to understand what they do and when and why they work. They come with mathematical assurances that the results of their analysis are meaningful with a well defined set of variables and assuming very specific conditions. Since the real world is complicated, those conditions never hold and as a result statistical predictions are seldom accurate. Economists, epidemiologists and statisticians acknowledge this and then use intuition or informal reasoning to apply statistics to get approximate guidance for specific purposes in very specific contexts. It also requires regular monitoring, to keep checking whether the methods continue to work well enough over time under changed circumstances. These caveats are often overlooked, leading to the misuse of traditional statistical methods with sometimes catastrophic consequences, as in the 2008 great financial crisis or the LTCM blowup in 1998 which almost brought down the global financial system. Remember Mark Twain’s famous quote “Lies, damned lies and Statistics.”
Machine learning relies on the complete abandonment of the caution which should be associated with the judicious use of statistical methods. The real world is messy and chaotic and hence impossible to model using traditional statistical methods. So the answer from the world of AI is to drop any pretence at theoretical justification on why and how these AI models which are many orders of magnitude more complicated than traditional statistical methods should work. Freedom from these principled constraints makes AI Model “more powerful”. They are effectively elaborate and complicated curve-fitting exercises which empirically fit observed data without us understanding the relationships behind why they fit the data.
Let's illustrate the above using a concrete but simple example. If we measure the volume of water in a can while heating it we will observe the volume increases as the temperature rises. If we measure the volume at say every 10 C increase in temperature starting from say 20 C to 100 C, it would then be possible to guess the volume for a temperature which we haven't measured, say at 35 C. This process is called interpolation. A statistician may assume that water expands linearly with temperature and thus guess the volume at 35 C to be halfway between the volumes at 30 C and 40 C. It turns out that water doesn’t expand linearly and so our statistician may have to find a better curve to fit the observations. Machine learning automates this process of finding “better curves”. If the interpolations are made on inputs reasonably close to the training data then the predictions are pretty good. The problem is that the real world is not “well behaved” mathematically. Even in our simple example, what would happen if we try and predict the volume of water close to the freezing point of water, zero degree C, based on our past observations? It would go awry since while water generally expands with increasing temperature, at temperatures below 4 C water expands with decreasing temperature. Guesses beyond the training range are called extrapolation and can be very deceptive. As we have seen making predictions in a particular space without understanding why things work the way they do just on the basis of empirical observations or “training data” is tricky. AI Models try to compensate for that by doing dense sampling of the space. The problem is for tasks like recognising images or composing sentences, the entire possible universe of permutations is unimaginably huge and weird. And the space is highly non-linear. It is for this reason that these models are trained on millions of images or petabytes of text data covering almost the entire publicly available internet and books, vastly more than a child requires to learn anything.
Without any theory of why they work when they do, it is impossible to know whether they would be adequate for specific tasks and when and how they will spectacularly fail. Also, given the complicated tasks they are being put on, it is very difficult to tell whether they are working properly or just well enough to fool the observers. And when they do seem to work whether is it because of spurious correlations in their vast training data.
But, it’s also true that these AI Models can sometimes do things that no other technology can do at all. Some outputs are astonishing for example the passages which ChatGPT can generate or the images that DALL-E can create. This is fantastic at wowing people and creating hype. The reason they work “so well” is the mind-boggling quantities of training data – enough to cover almost all text and images created by humans. Even with this scale of training data and billions of parameters, the AI Models don't work spontaneously but require kludgy ad-hoc workarounds to produce desirable results. These models can be adjusted by turning knobs called hyper-parameters which have no theoretical justifications and whose effects are poorly understood.
Even with all the hacks, the models often develop spurious correlations, i.e., they work for the wrong reasons. For example, it has been reported that many vision models seem to work by exploiting correlations pertaining to image texture, background, angle of the photograph and specific features. These vision AI Models then give bad results in uncontrolled situations. For example, a leopard print sofa would be identified as a leopard, the models won’t work when a tiny amount of fixed pattern noise undetectable by humans was added to the images or the images were rotated say in the case of a post-accident upside down car. On the other hand, images whitened out on the inside while leaving the borders and the backgrounds of objects intact would get correctly identified. ChatGPT for all its impressive prose, poetry and essays is unable to, even now after all the tweaks made to it based on interactions with millions of subscribers, do simple multiplication of two large numbers which would be within the capabilities of a middle school student or a calculator from the 1970s.
The AI Models do not really have any level of human-like understanding but are great at mimicry and fooling people into believing that they are intelligent by parroting the vast trove of text they have ingested. It is for this reason that computational linguist Emily Bender called the Large Language Models such as ChatGPT and Google’s BART and BERT as “Stochastic Parrots” in a 2021 paper. Her Google co-authors – Timnit Gebru and Margaret Mitchell – were asked to take their names off the paper. When they refused, they were fired by Google.
This criticism is not just directed at the current large language models but at the entire paradigm of trying to develop artificial intelligence through ad-hoc statistical correlations. We don’t get good at things just by reading about them, that comes from practice, of seeing what works and what doesn’t. This is true even for purely intellectual tasks such as reading and writing. Even for formal disciplines such as Maths, one can’t get good at Maths without practising it. These AI Models have no purpose of their own. They therefore can’t understand meaning or produce meaningful text or images. Many AI critics have argued that real intelligence requires social “situatedness”.
Doing physical things in the real world requires dealing with complexity, non-linearity and chaos. It also requires practice in actually doing those things. It is for this reason that progress has been exceedingly slow in robotics: current robots can only handle fixed repetitive tasks involving identical rigid objects such as in an assembly line. Even after years of hype about driverless cars and huge amounts of funding for its research, fully automated driving still doesn’t appear feasible in the near future. It is perhaps ironic that instead of blue-collar jobs getting replaced as was widely expected, specific white-collar jobs such as transcribing text from speech, doing passable language translation or call centre jobs are most likely to be replaced by AI.
Current AI development based on detecting statistical correlations using “neural networks” which are treated as black-boxes promotes a pseudoscience-based myth of creating intelligence at the cost of developing a scientific understanding of how and why these networks work and engineering considerations on how to make them safe, reliable and fool proof. They instead emphasise on spectacles such as the creation of impressive demos and scoring in standardised tests based on memorised data.
The only significant commercial use cases of the current versions of AI are advertisements: targeting of buyers for social media and video streaming platforms. This does not require the high degree of reliability demanded from other engineering solutions, they just need to be “good enough”. And bad outputs can’t be punished under the existing legal framework although we know the havoc social media algorithms have caused through the propagation of fake news and the creation of hate filled filter bubbles.
Perhaps a silver lining in all this is given the bleak prospects of AI singularity, the fear of super-intelligent malicious AIs destroying humankind can be seen as overblown. Though that is of little comfort for those at the receiving end of “AI decision systems”: the poor, ethnic and religious minorities. We already have numerous examples of AI decision systems the world over, denying people legitimate insurance claims, medical and hospitalisation benefits and state welfare benefits. AI systems in the US have been implicated in imprisoning minorities to longer prison terms. There even have been reports of withdrawal of parental rights to minority parents based on spurious statistical correlations which often boil down to them not having enough money to properly feed and take care of their children. And of course, on fostering of hate speech on social media. As noted linguist, Noam Chomsky wrote in a recent article, “ChatGPT exhibits something like the banality of evil: plagiarism and apathy and obviation.”