Home » Blog » ideas » Heather Ford: Is the Web Eating Itself? LLMs versus verifiability

Heather Ford: Is the Web Eating Itself? LLMs versus verifiability

One of my favorite things in academia is that you can go a decade without seeing a friend and remain at least somewhat in touch with what they’re doing and thinking by reading their work. Dr. Heather Ford, Associate Professor of Communication at the University of Technology Sydney, where she leads the cluster on Data and AI Ethics, is in the US for two weeks, and we were lucky enough to bring her to UMass Amherst for a talk today. I haven’t seen Heather in person for over a decade, but I’ve had the chance to follow her writing, particularly her new book, Writing the Revolution: Wikipedia and the Survival of Facts in a Digital Age.

I knew Heather as one of the leading lights of the open knowledge world, co-founder of Creative Commons South Africa, leader of iCommons (which worked to create open educational resources from the Global South), scholar and critic of Wikimedia. In the last decade, she’s also become one of the most interesting and provocative thinkers about how knowledge gets created and disseminated in a digital age.

Heather Ford speaking at UMass Amherst

Her talk at UMass today, “Is the Web Eating Itself?” asks whether Wikimedia other projects can survive the rise of generative AI. Heather characterizes our current moment as occurring after the “rise of extractivism”, a moment where technology companies have tried to extract and synthesize almost the entire Web as a proxy for all human knowledge. These systems, exemplified by Chat GPT, suggest a very different way in which we will search for knowledge, using virtual assistants, chatbots and smart speakers as intermediaries. While there’s some good writing about the security and privacy implications of smart speakers, what does the experience of asking a confident, knowledgeable “oracle” and receiving an unambiguous answer do to our understanding of the Web? What happens when we shift from a set of webpages likely to answer our question to this singular answer?

This shift began, Heather argues, with synthesis technologies like the Google Knowledge graph. The Knowledge graph aggregates data about entities – people, places and things – from “reliable” sources like Wikipedia, the CIA World Factbook and Google products like Books and Maps. The knowledge graph allowed Google to try and understand the context of a search, but it also allowed it to direct users towards a singular answer, presented in Google’s “voice” rather than presenting a list of possible sources. For Heather, this was a critical moment when both Google and our relationship to it changed.

The rise of ChatGPT, embodied in Bing chat and many AI text generators, is another profound shift. These technologies, at root, have been “fed almost the entire internet to predict the next chunk of text – the game is to finish your sentence.” She quotes Harry Guiness as explaining that Chat GPT is extremely different from knowledge graphs: “It doesn’t actually know anyting about it. It’s not even copy/pasting from the internet and trusting the source of information. Instead, it’s simply predicting a string of words that will come next based on the billions of data points it has…”

AI tools require “datafied facts”, text from Quora, Reddit, Stack Exchange and Wikipedia – sites where people answer questions, like Quora and Stack Exchange, are especially helpful. AI systems remove the particularities of these sources, eliminating local nuance and context, in ways which can be very dangerous. Brian Hood, mayor of a small city near Melbourne, is exploring suing OpenAI for defamation, because it routinely produces texts identifying him as committing financial fraud. Actually, Hood was the whistleblower who revealed financial fraud, but that pertinent detail is not one ChatGPT has grasped onto. If Hood sues, the question of how the algorithm comes up with these false answers will be one litigated either in US or Australian court.

Heather explains that the “authoritative synthesis that comes from ChatGPT coincides with a moment of increasing distrust in institutions – and a rise of trust in automated processes that seem not to participate in truth battles because they are seemingly unpolitical.” (Indeed, readers of her most recent book understand that there’s little more political than the processes through which events in the world find their way into Wikipedia, Wikidata and from there into the Google Knowledge Graph. These politics don’t map neatly onto our understandings of liberal/conservative, but merit study and critique on their own merits.)

What happens if Wikipedia starts ingesting itself through generative AI? Wikipedia is one of the most important sources for training generative AIs. Will we experience model collapse, where models trained on their own data degrade in performance? Will projects like Wikipedia collapse under the weight of unverifiable information?

These concerns lead to real questions: should Wikimedia ban or embrace generative AI tools. Stack Exchange has banned ChatGPT for answers due to a tendency to create convincing, hard to debug, wrong answers. Should Wikimedia sue new intermediaries like OpenAI? Or merge with them? Call for regulation or reforms?

Core to the Wikipedia project is verifiability, an idea that Heather extends far beyond the idea of authorship and copyright. Across the Wikimedia universe, a “nation of states” that Heather reminds us includes everything from Wikimedia Commons, Wikidata and Wikipedia to Wikifunctions (an open library of computer code), verifiability is a central principle:

“(WP:VER) Readers must be able to check that any of the information within Wikipedia articles is not just made up. This means all material must be attributable to reliable, published sources.”

Verifiability, Heather explains, isn’t just an attribute of the content – it contains the idea of provenance, the notion that an piece of information can be traced back to its source. Heather suggests we think of verifiability as a set of rights and responsibilities – a right for users to have information that is meaningfully sourced and the responsibility for editors to attribute their sources. These rights enable important practices: accountability, accuracy, critical digital literacy and most important, agency. If you don’t know where a piece of information came from, you cannot challenge or change it. Verifiability enables ordinary users to change and correct inaccurate information.

(I’ve been thinking about this all day, because it’s such an important point. In an academic setting, citation is about provenance – it’s about demonstrating that you’re acknowledging the contribution of other thinkers and claiming only your contributions. But this is a version of citation and provenance that is linked to challenging inaccuracy and rewriting what came before, if necessary. It lets go of the academic default that everything that came before is true, and sees the verifiability of facts as a source of power. It’s a very cool and powerful idea.)

So what happens to Wikipedia in an age of large language models, which threaten both the sustainability of peer knowledge projects and their verifiability? Heather sees several disaster scenarios we need to consider.

– Editors might poison the well, filling Wikipedia with inaccurate articles, leading to higher maintenance costs and lower reliability
– Users may turn away from Wikipedia and to new interfaces like ChatGPT. A problem that began with the Google Knowledge Graph – where people attributed facts to Google, not Wikipedia, may erode Wikipedia’s brand and participation further
– LLMs may improve Wikipedia in English, but are less likely to help it improve in smaller languages. That, in turn, may make large languages like English even more central and increase problems of information inequality.
– Improved translations – one of the places AI has developed the most quickly, may be a force for cultural imperialism. Heather remembers Jimmy Wales visiting South Africa and asking South Africans to write Wikipedia articles in the 11 official languages of the nation. But automatic translation has meant that some small language Wikipedias are essentially machine translations of the English language Wikipedia – the Cebuano Wikipedia is the second-largest Wikipedia in terms of articles, but is almost entirely produced by automated scripts. As a result, it reflects the priorities of the English-language authors who wrote the English wikipedia, not the priorities of Cebuano speakers.

Despite these concerns, Heather reports from the most recent Wikimania conference that Wikimedians are generally optimistic. To the extent that Wikipedia is a project that seeks a compilation of all human knowledge, perhaps the more knowledge the better! And the ability to create coherent English may be very useful for non-native English speakers. There’s also the thought that perhaps Wikipedia can create a better LLM, one with verifiable information. And Heather reports that the Wikimedia foundation tends to feel that they’ve been using machine learning since 2017 – perhaps LLMs aren’t that threatening? And since Google paid Wikimedia a license fee for Wikidata, perhaps OpenAI and others are customers for Wikipedia – why sue your potential customers?

On the other side of the fence are those who worry that Wikipedia labor will become “mere fodder” to feed AI systems, invisible and unappreciated. Wikipedia, they fear, will never win the interface war – as an open source project, it will always be uglier and rawer than the slick interfaces developed by Silicon Valley. (And, some grumble, perhaps Wikimedia Foundation is too close to Silicon Valley ideology and capital to see the possible harms.)

Heather tells us that she sees two concerns as overriding:

– LLMs aren’t like grammar checks or other systems that have been used to improve human writing. Instead, they represent knowledge in an entirely different way than the knowledge graph.
– The trajectory of LLMs threatens verifiability in a way that threatens projects that depend on verifiability – accountability, accuracy and critical digital literacy.

The answer, Heather believes, is a campaign for verifiability. This means we need new methods for meaningful attribution in ways that fulfill the rights and obligations of factual data. These methods need to make sense on new interfaces – we need to understand how an oracular voice like ChatGPT “knows” what it knows and how we could trace that information back to challengable sources. And we need in-depth research on how verifiability relates to sustainability of projects like Wikipedia.

But verifiability is in decay, even in Wikimedia, she warns. Much of the data in Wikidata lacks meaningful citation. And because Wikidata is licensed under CC0, a very permissive license that puts the work in the public domain, there’s no obligation for those who use it to trace provenance back to Wikidata.

Wikipedia’s OpeAI plugin does not provide citations, and Wikimedia enterprise – the tool through which Wikidata is licensed to the Google Knowledge Graph and others, has a lack of citation guidance. Beyond citing Wikimedia on Google’s knowlege panels, most virtual assistants have little or no attribution.

Open licenses like Creative Commons don’t currently signal anything to extractive AI companies. They should – they should signal that the information in question is public knowledge and that it should have certain patterns of verifiability associated with it. Right now, Wikipedia is simply another pile of fodder for openAI – according to the Washington Post, it’s the #2 source of training data for many large LLMs (Google’s Patent DB is #1) and the number #3 source is Scribd, a large pile of copyrighted materials being used in (likely) violation of copyright. b-ok.org, a legendary book pirating site, is #190 on the WaPo list – ChatGPT and other LLMs are a complex pile of open knowledge and various forms of copyright violation.

We need a new definition and new terms for openness in the age of AI. Right now, the focus is on developers rather than content producers, on data for computers rather than for people. Data producers like Wikipedians need to be at the center of this debate. Unverifiable information should be flagged and steered away from rather than being the default for these new systems. What’s at stake is not just attribution, payment and copyright: it’s reclaiming agency in the face of AI through maintaining the production of verifiable information.


Heather is continuing to work through these ideas – I predict lots more in this direction, and look forward to seeing how she and others develop this key idea of verifiability as a response to the rise of the LLM. It was such a pleasure to have her at UMass.

5 thoughts on “Heather Ford: Is the Web Eating Itself? LLMs versus verifiability”

  1. Pingback: Weeknote 41 2023 – Librarian of Things

  2. Thank you for the summary. One quick correction: “Google paid Wikimedia a license fee for Wikidata”

    That’s incorrect. Wikidata is CC0 as mentioned later, and thus no license fee is required for Wikidata. Google does have a contract with Wikimedia Enterprise though, but that does not, and could not, involve a license fee for Wikidata.

  3. Insightful and informative. My own practice using ChatGPT and Google Bard is to limit my queries to only non-factual topics, e.g. suggestions for storylines, scenario-building and subjective judgments and interpretations of contents (books, movies, poetry, etc). My favourite example would be to ask what are the most popular English translations of Don Quixote, or how faithfully the movie musical Man of La Mancha follows the book, and then to ask for sample passages to back up the answers. The answers save me time to tediously check all the English translations of Quixote which would take me a few years!

  4. Pingback: ChatGPT exploded into public life a year ago. Now we know what went on behind the scenes | John Naughton - AI Outils

  5. Pingback: ChatGPT exploded into public life a year ago. Now we know what went on behind the scenes | John Naughton – ReadNews.AI

Leave a Reply

Your email address will not be published. Required fields are marked *