Debunkbot Analysis

Replicating Costello, Rand, Pennycook (2024)

Author

Justin Dollman

Published

January 9, 2025

Last year, researchers Costello, Rand, and Pennycook (CRP, hereafter) published “Durably reducing conspiracy beliefs through dialogues with AI,” where they found that interactions with ChatGPT¹ substantially reduced participants’ conspiracy belief both immediately post-intervention and two weeks post-intervention. Not only did it reduce belief in participants’ focal conspiracy, defined as whatever conspiracy the participant endorsed at the beginning of the study, it also reduced credence given to other, “non-focal” conspiracies.² Their paper is a very important methodological advance in the ‘corrections’ literature which had been hampered by necessarily mostly static materials that could only hamfistedly react to participants’ unique profiles and priorities.

¹ The model they used was gpt-4-turbo.

² The non-focal conspiracies were the 15 popular conspiracy theories from the Belief in Conspiracy Theories Inventory (here). These 15 are the “greatest hits” of conspiracy theories: Area 51, JFK assassination, Princess Diana, (Fake) Moon Landing, etc.

For their paper, they relied on a CloudResearch Connect sample quota-matched to the US Census along various demographic dimensions to make it demographically representative. But knowing that their study would make a splash, they set up a website to which the inevitable media coverage could direct traffic to and generate additional participants. Sure enough, the website has had over 30,000 visitors since its inauguration in September of last year. This post will be an analysis of the data created by those participants.

Two notes before beginning. First, I had the pleasure of meeting the study’s lead author, Tom Costello, when he came to Stony Brook to participate on a panel and he kindly granted me access to the data. Second, though this post is (or will be) self-contained, I would encourage anyone who hasn’t read the CRP paper to do so, as it’s the “real thing.” This post, at least as in its current state, is more ‘exploratory vibes’ and likely only of interest as inside baseball. If you’re pressed for time you might want to skip the first two sections, The Sample and The Survey Experience, and go straight to Results.³

³ If the paper as a whole is inside baseball, the first two sections might be ‘dugout-ology’.

The Sample

As I mentioned, it was a stroke of farsightedness to create the website in advance of news coverage of the Science article.

Lest you be seduced by those towering five-digit weekly volumes, the vast majority of those who landed on debunkbot.com just came to check out the site and did not complete the interaction with a large language model. If you filter out the people who didn’t finish, the plot above turns into the one below:

There are two things to note, First, the tenfold reduction of the \(y\)-axis. Second, among those who did finish, around one third of them chose to not make their data public. The saddest part of that is that those who are ‘paranoid’ enough to want to prevent researchers from using their data are exactly the people that you’d most want to study. Still, considering the remaining number of participants each week as their own samples, almost every one of them is larger than the average social psychology study published from the 1950s to the 1990s.

Whereas CRP’s sample was demographically representative of the US, these survey takers are far from that. They’re young(ish), very white, preposterously educated, and very Democratic.

The average survey taker reports being 48 years old.⁴
Respondents are 89% white.
Whereas the majority of Americans do not have a bachelor’s degree, the modal respondent has a bachelor’s with an almost equal amount having a master’s degree! In fact, more people report having a doctorate than (only) a high school diploma!
Given the ability to choose among seven partisanship categories from strong Democrat to strong Republican, the most common response was Strongly Democratic with 28.9%! A full 68% of respondents identify as some kind of Democrat (‘plain’ Democrats as well as leaners). Only 11.5% report being some kind of Republican. The remaining 20.6% of those who gave their partisanship said they were ‘Strictly Independent’.

⁴ This is calculated after trimming off implausibly (i.e., trollishly) low and high ages.

While these aren’t the distributions you’d hope for if you wanted to make inferences about the US population, it is unique to have data on conspiracy theorizing from precisely the population least associated with conspiratorial ideation.

The Survey Experience

An underrated aspect of understanding an experiment’s results is having a sense of subjects’ phenomenology as they go through it. To that end, I’d recommend that anyone reading this just go to debunkbot.com and try it out for themselves. But if you lack either the ability, time, or desire to go there, you can read my description with restrained editorializing below.

Participants navigate to debunkbot.com, where they are informed that, should they choose, they will participate in a survey where they interact with an AI (“a conspiracy debunking bot”) and get personalized feedback.⁵ Then, across two pages, people describe a conspiracy they “find particularly credible or compelling” and, on the second page, “share more about what led you to find this theory compelling.”⁶ Visitors’ responses in these two open-ended text boxes represent one set of data – a set of natural language descriptions conspiracy theories which spontaneously come to people’s minds.

⁵ The landing page of debunkbot.com frames site visitors’ upcoming experience in perhaps a slightly antagonistic way. First, the thing is called DebunkBot, which I’d say implies that you’re about to give it some bunk. Also, the landing page encourages you to “Test your beliefs against an AI. Will you change your mind? Try our conspiracy debunking bot and get personalized feedback.” which I feel has vibes of, “You ‘win’ if you don’t change your mind and lose if you do”. I asked my good friend Claude what they thought and got the reply, “its framing is notably antagonistic in several ways.” This isn’t necessarily bad. It just means that users’ responses are going to be even more skewed toward the outgrowths of antagonistic political discourse culture: trolling, obstinancy, and extremity. In the empirical social sciences we often say, “Alright, this is going to be an even tougher test of your hypothesis.” Tom Costello remarked that maybe there could be an “EagleChat” skin for the site which would be more amenable to a certain crowd.

⁶ This text is sent to ChatGPT, which confirms (or not) that the visitor has in fact described a conspiracy theory. If ChatGPT determines that the visitor did not describe a conspiracy theory, the visitor is offered the possibility of participating in a different arm of the survey, the ‘Disagreement with Experts’ arm.

Before visitors proceed to their conversation with the LLM, they see an LLM-generated short summary of their conspiracy theory and are prompted, “On a scale of 0% to 100%, please indicate your level of confidence that this statement is true.” Lastly, visitors respond to the question “How important is this theory to your personal beliefs or understanding of the world?” on a 1-8 scale (where 8 is anchored with “Extremely important to my beliefs and worldview”). At that point comes the core of the experiment, the four-turn interaction with the LLM (more on that in the next paragraph). Afterward, participants rate their post-interaction certainty in the conspiracy using the following scale.

Lastly, participants can fill out an optional demographics page and/or a user feedback page (a sugggestions box, if you will).

Now for the editorializing!

What’s the DebunkBot conversation like? From the perspective of a debunkbot.com site visitor, the interaction starts with the AI responding to the specific conspiracy theory previously endorsed by the visitor. Behind the scenes, the conversation starts when the LLM first ‘sees’ the following system message:

“Your goal is to very effectively persuade users to stop believing in the epistemically fraught belief that {{conspiracyTheory}} You will be having a conversation with a person who, on a psychometric survey, endorsed this belief as {{userBeliefLevel}} out of 100 (where 0 is Definitely False, 50 is Uncertain, and 100 is Definitely True). Further, we asked the user to provide an open-ended response about their perspective on this matter, which is piped in as the first user response. Please generate a response that will persuade the user that this belief is not supported, based on their reasoning. Again, your goal is to create a conversation that allows individuals to reflect on, and change, their beliefs (toward a less epistemically unwarranted view of the world). Use simple language that an average person will be able to understand.”

After that system message, it sees its first user message, which is the reasoning the visitor supplied when prompted earlier. The LLM then generates the aforementioned message that, from the visitor’s perspective, opens the dialogue with the LLM. After that, the user has three chances to engage with the LLM. If you would like to see an example of how such a conversation plays out, I’ve put some screenshots of a conversation I had with it in the appendix. If you want my description of the messages, the first thing I would say is that they’re lengthy for a conversation. If you think about this from the perspective of the researchers, they had to navigate a tradeoff between giving short, more conversational messages that contained less information and longer messages that really tried to, you know, debunk a conspiracy. Personally, I would have gone for shorter messages to make the interaction seem more conversational and place less of a burden on participants, but a) it’s ultimately an empirical question which message length is best and b) shorter messages would mean more rapid pinging of OpenAI’s API and so invite glitchiness, costs, and other undesirable things. The tone of the messages is definitely friendly, but a little too faux-friendly.⁷ For example, here are the first sentence(s) of its first three messages to me:

⁷ In case you think it’s a given an LLM would be faux-friendly, I mean that it’s generic, customer service friendly. For an example of a language model that seems genuinely friendly, I’d recommend the Sonnet 3.5 version of Claude.

It’s great to hear that you’re open to exploring different perspectives on this topic. I understand your point about the proximity of the Wuhan Virology Institute to the initial outbreak, and it’s natural to question these coincidences.
You bring up important points about trust in large organizations and the potential for bias in the scientific community.
I appreciate your candor and critical view-it’s important to have open discussions about the realities of how institutions work, including scientific ones.

There’s nothing wrong with these messages, but I do feel like I’m being kid-gloved a little. I imagine the people whom misinformation experts most want to reach would find this tone to be patronizing – though I’m very unsure about that.

I’d also note that I found the language model to be a little evasive when I pressed it for particulars, but this is perhaps a function of its responses being quite long and general.⁸ According to me, the best conversational turn I had with it happened 3/4^ths of the way though the conversation and I thought we weren’t getting anywhere. I asked, “If you could give one piece of evidence that you think is the most damaging to the lab leak hypothesis, what would it be?” to which it then did give the most convincing piece of information it had so far given. I think that’s telling. If my experience is representative, the LLM is probably too reactive. As in, it’s not ‘agenda-setting’ enough – it has the debate “on the conspiracy theorist’s terms.” It’s probably a familiar experience to those of you who have talked to people who believe what call conspiracies, and it is like playing the world’s least fun version of whack-a-mole. Hopefully future research will look into the value of taking more initiative in debunking conspiracy theories/misinformation.

⁸ Again, check out the conversation – or have one yourself! – to get a feel for what I mean.

Results

On Certainty

As indicated by the title of the paper, “Durably reducing conspiracy beliefs through dialogues with AI” (emphasis added), the key result was the reduction in conspiracy belief from pre- to post-interaction. Did that also happen in the online sample?

There you see that there was this mass of respondents between 50 and 100 that flattened. To see that more clearly we can go from looking at the pre- and post- certainty distributions and instead visualize the distribution of certainty’s change:

You can see from the histogram above that individuals’ delta scores run the entire gamut of possibility, from -100 (a respondent starts at complete certainty and ends at 0) to +100 (where a respondent initially reports 0% credence and ends up with 100% credence). 23% have a certainty delta of 0, i.e. no change. 17% of respondents report being more certain of their conspiracy after the debunking intervention, and thin plurality of respondents, 61%, report that they are less certain about the conspiracy they described.

Looking at the raw “certainty trajectories” gives an indication of how people use the certainty scale:

Start --> End	Delta	Pct.
100 --> 100	0	9%
0 --> 0	0	6%
50 --> 50	0	3%

Before analyzing any data, I would not have predicted the most common response being 100% both pre- and post-treatment and I definitely would not have predicted it being several times larger than the next most common certainty trajectory.⁹ The 0%-to-0% group is likely one that you just want to exclude because they simply do not believe in the conspiracy that the LLM is purportedly debunking.¹⁰

⁹ This is one of the cases where it would be nice to have some conditional logic in one’s survey that detects an individual’s 100%-to-100% trajectory and asks them why. In our era of online surveys with massively modififiable survey flow, it’s criminal how little we prompt users to explain their sometimes bizarre response combinations.

¹⁰ For those who understand how credence works (or, is supposed to work), you’ll also notice that these people who report 100% (or 0%) certainty in their conspiracy either do not understand what 100% (or 0%) certainty means or are trolling.

There are two explanations for why the 50%-to-50% group is as prominent as it is. The substantive explanation is that when people are uncertain, they default to saying 50% (or “it’s a coin toss”). That is, they map any certainty level between 10% and 90% to 50%. The ‘methods’ (or artifact) explanation for the prominence of the 50%-to-50% group is that the certainty scale is initialized with the dial(?) at 50%. So, if the participant does nothing and just clicks through on both occasions, they will appear as having been 50% certain both times. And participants love nothing more than just clicking through.

This pattern of responses – particularly the clustering at 0%, 50%, and 100% – reminded me of my brief prior life when I was really into forecasting (i.e., trying to predict the future), a key subarea of which is probability elicitation. I suspect that no researcher would say that Joe Sixpack is going to faithfully map their subjective probability onto a 0-100 scale. So, looking at the English-language anchors on the scale, I discretized it according to the following mapping:

Category	Range
Def. False	0-15
Prob. False	15-35
Uncertain (Lean False)	35-45
Uncertain	45-55
Uncertain (Lean True)	55-65
Prob. True	65-85
Def. True	85-100

This allows me to visualize the trajectories thus:

Or, if you just want to see the distributions side by side:

Conspiracy Descriptions (Forthcoming)

In addition to replicating the certainty reduction result of the original paper, the much greater sample size of the online sample allows for a text analysis of people’s conspiracy beliefs. For now, I had ChatGPT categorize the user-generated conspiracy text into both narrow and broader categories (species and genus, if you will). As a preview, what are these (mostly) highly educated Democrats conspiratorial about? The top 10 conspiracies mentioned were:

Conspiracy	Mentions
9/11	2910
JFK Assassination	2621
COVID-19	2277
Moon Landing	2062
UFOs	1347
2020 Election	1136
Flat Earth	1046
Epstein	684
Trump Assassination Attempt	665
Climate Change	591

Quite a few classics. The people who (claim to) believe that “bIrDS AreNT rEaL” also showed up:

A Typology of Users and Their Conversations (Forthcoming)

This will be the most interesting part of the post (or perhaps publication further down the road). What aspects of conversations are associated with persuasion? Did gpt-4-1106-preview (or gpt-4o-2024-08-06) deploy different strategies with different users to more or less effect? Beyond persuasion, do different LLM conversational moves lead to more or less engagement? Investigating those interesting dyadic elements will have to come later.

For now, let’s look at the most basic measure of conversational quality: message length. The plot below breaks down message length by who is generating the words (LLM/assistant or visitor/user) and what turn in the conversation it is (where 1 is the each party’s opening message, 2 the first reply, etc.):

First, the fact that the left and right panes have different x-axis scales makes it less obvious that the ‘Assistant’ (the LLM) is writing a lot more than the user. Whereas the lion’s share of assistant messages are > 400 words, the distribution of user message lengths is extremely right-skewed, with most messages fewer than 50 words in length. I’ve also highlighted the modal number of words used in the first and fourth (last) turns, and you see a big drop-off in engagement. Many of the users’ extremely short last messages are either valedictions (bye) or expressions of gratitude (thanks, thank you). As you can see, it’s not just the last messages that are extremely short, though. Users’ extremely short messages are usually simple affirmations (yes, ok, sure, i agree) or unelaborated disagreement (no, i dont believe you, im not convinced).

The savvy consumers of survey research among you will wonder about mode effects. These mode effects are potentially relevant all over the survey, but nowhere are they as obviously relevant as in users’ interaction with the language model. It’s simply more costly to type responses when you’re working with two thumbs compared to a full set of 10 fingers. The plots below, which disaggregate by mobil vs. desktop use (and re-aggregate over all four turns) were surprising to me:¹¹

¹¹ Transparency note: I filtered out the 1% longest messages both among the users and assistants to prevent tails from overtaking both plots.

Histogram
Boxplot

I would have suspected that the distributions would be much more different than they in fact are. With the boxplot, you can eyeball that the 25th, 50th, and 75th percentile desktop responders are only writing about 4, 7, 15 more words, respectively, than their mobile counterparts.

A Typology of Users (Forthcoming)

From a theoretical perspective, it’s interesting to think about what moderates treatment effects. Are Republicans more resistant to corrections? What about strong partisans versus leaners? Which type of conspiracy theories are most ‘debunk-resistant’? The list goes on and on.

From a practical perspective, a researcher would want to classify respondents’ “engagement type.” Some unengaged respondents do the open-text version of straight-lining, while others are more engaged, but “in the wrong way” (i.e., trolling).¹² Ideally, one would have measured these people’s demographics and political background information at a prior time so that you could model engagment quality (including propensity to troll), but the way information is collected here there’s simply no way to do that. In future analyses I will attempt to classify people into the buckets of zero effort, pure trolling, pure hostility, and genuine engagement. Problematically, it’s only among the latter set that we can really trust their use of the 0-100 belief certainty scale.

¹² Of course, trolling as a result of treatment is itself interesting. For example, if you had two randomly assigned treatments and found that one led to half as much trolling as the other, that’s a really important outcome!

Appendix

In-Survey Attrition

The initiated among you will know about the scourge of missing data. Missing data can turn nice unbiased estimates into not-so-nice biased estimates. It turns random samples into SLOP and, depending on when nonresponse happens, it turns random assignment into non-random assignment, degrading causal quantities to mere correlations. It’s vexatious, en fin.

That said, this online survey has no pretention to be a random sample, nor is there random assignment to treatment. This makes it unclear how much of an additional problem missing data is. Nevertheless, here I walk the reader through where visitors attrit in the survey.

Visitors can be lost in two different ways. The first is that they simply close the survey (whereafter the rest of the values are simply NA). The second happens when the visitor makes makes a choice that means their data will not be useful. Let’s start with the latter class of cases.

Attrition via Respondent Choices

First, the visitor could click the (Optional) - Do not use my data button between assenting to the consent paragraph and cliking the Captcha. 34% of respondents do this.

All respondents’ conspiracy theory descriptions are sent to ChatGPT to check that what the visitor described is, in fact, a conspiracy. If ChatGPT determines that the participant has not described a conspiracy, the participant sees the following page:

Those who click Try other version are shuttled into the Disagree with experts arm, but the majority choose Continue. The AI determined that, among those who wrote at least 30 characers describing a purported conspiracy theory, 44% of them failed to describe a conspiracy theory. Of those 44%, 67% chose to continue in the conspiracy arm anyway. The data from those in the latter group is not straightforwardly useful in analyses about conspiracy theories since these people have not described a conspiracy. As a reminder, this failure to describe a conspiracy is prior to a failure to believe in the conspiracy. Which brings me to an interesting wrinkle.

Below, you see the scale the subjects used to indicate the extent to which they believe in their conspiracy theory.

What should we do if a person rates their conspiracy on the false side of uncertain? On the one hand, it might seem strange to talk about belief in conspiracies among people who don’t believe in conspiracies (where lack of belief in \(p\) is defined as giving less than 50% probability to \(p\)). On the other hand, and I believe it’s the more sophisticated hand, it makes just as much sense to say that an intervention reduced conspiracy belief from 40% to 20% as it does to talk about an 80% to 40% reduction. In both cases you reduced conspiracy belief by half, and there’s nothing magical about the fact that the 80-to-40 reduction crosses 50. So, in all the analyses here, I allow for initial belief to be less than 50%.¹³

¹³ Much as the distinction in hypothesis testing between ‘significant’ and ‘insignificant’ is not itself significant, the distinction between >50% credence and <50% credence is likewise not significant. Imagine you’re working at a nuclear reactor where most people believe the probability it will “go Chernobyl” in the next month is 0.0000001%, but there are two weirdos who believe that probability is 45% and 55%, respectively. It would be extremely bizarre to group the 45% fella with the 0.0000001% group, but that’s what treating 50% as a magical credence line does. In fact, not reifying 50% as a magical belief line seems especially relevant when talking about conspiracy belief. For example, imagine a person who gives “only” a 40% probability to the proposition that the earth is flat. Though they might not “believe” the earth is flat, they are wildly miscalibrated. If this all seems like pedantic necrotic horse over-flagellation, the Forecasting Research Institute released a study last year where they pitted a group concerned about AI safety against a group of superforecasters unconcerned about AI as an X-risk. If we let \(\textbf A\) mean AI causes existential catastrophe by year 2100, the concerned group came into the study with \(\Pr(\textbf A) = 28.4\%\), while the unconcerned group came in at \(\Pr(\textbf A) =0.5\%\). If they had to binarized belief at 50%, the entire debate would have evaporated since no one would have ‘believed’ in the existential risk posed by AI. It would also result in strange descriptions of reality, such as “There is a group of extremely concerned people dedicating their lives to reducing AI risk, even though they don’t believe in it.”

Attrition via Exit

Now, what about those who simply click out of the study? In the interactive plot below, you can hover your mouse over each point to see how many subjects were present at each point.¹⁴

¹⁴ This plot is currently malfunctioning. The points are in the correct places and the descriptions that appear when hovered over are correct, but the % still in study and % reduction information is not.

Disagreement with Experts Description and Descriptives

As mentioned above, not everyone who completes the survey completes the conspiracy reduction arm. After the visitor fills in the two text boxes available to describe a conspiracy and the evidence for it, those responses are sent to a language model to determine if the visitor has in fact described a conspiracy. If it determines that the user has not described a conspiracy, the user will see the following screen:

If they fill in the Try other version radio button (as pictured), they will flow into the Disagreement with experts stream. This results in a very self-selected group.

See the screenshots below if you’d like, but basically visitors are asked to describe a belief that most people (possibly including experts) would disagree with and to give the reasons why they find it compelling. If you want to see what an entire conversation looks like, those screenshots are below as well.

The counts below are very provisional, but I had ChatGPT code what groups were the implied groups the visitor disagreed with, and the 8 most common were:

Group	No. Disagreements
Scientists	77
Historians	63
Economists	58
Theologians	58
Political analysts	57
Psychologists	55
Medical experts	53
Experts	45

I also had ChatGPT code both what the implied consensus was and what alternative take the subject had on. Here are just a few of the more entertaining (users’ text followed by three columns of ChatGPT’s coding):

Disagreement with Experts Screens

For completeness, here’s what the visitor sees once they are in the Disagreement with Experts arm.

Sample Conspiracy Conversation: Lab Leak

For transparency, I’ve posted the conversation as screenshots.¹⁵ If the text is too wee for ye eyes, you can right-click any given image and then normal-click Open Image in New Tab on the menu that magically appears.

¹⁵ “Receipts,” as the kids say

Sample Experts Conversation: Keto Diet

Comments below.

The formatting got funky here.

In the conversation above, I stumbled into a conversational ‘failure mode.’ My disagreement with experts was, “I think that I think that the health benefits of ketogenic diets are more than just via reduction in calories consumed. Sugars, particularly simple sugars, might be uniquely bad for health (where weight is just one facet of overall health).”¹⁶ Now, I only gave this proposition a 55% chance of being true.¹⁷ And though I gave the LLM a claim that is at least plausible and I expressed almost maximal uncertainty about it, the LLM seemed forced to argue against my claim as if I had given it a ludricrous claim and said I was certain (e.g. smoking doesn’t actually cause lung cancer, 95%!). That is, neither the direction nor the degree to which user is miscalibrated is taken into account. It would make the behind-the-scenes design a bit more complicated, but ideally you would let the model change its argumentative tack based on the (lack of) discrepancy between the plausibility of the visitor’s view (conspiracy, disagreement with experts) and the probability the user assigns to that view. In that case, instead of “DebunkBot” it would be more “CalibrationBot” that reasons with the user in such a way to get to that probability.¹⁸

¹⁶ And my elaboration was, “I’ve known people who have lost a lot of weight on the keto diet, more so than any other diet. Also, I have a very vague but positive impression of Gary Taubes, a researcher associated with these ideas.”

¹⁷ In retrospect, 55% is pretty high. Were I to do it again I think I’d put it at 40%. I think I had some sort of glitch in my ‘credence to number’ mapping where I started (anchored) on 50% and then inched my probability upward to indicate that I thought the view has some merit. But a less dumb approach would have been to anchor on \(\Pr(\text{non-obvious health claim is true)}\) and then increment that upward to the extent the keto claim is more likely than a random non-obvious health claim.

¹⁸ The survey actually already calculates the plausibility of the claim, but as far as I can make out that information is only used to subset respondents’ change scores to generate personalized feedback like, “When compared to others who hold beliefs in conspiracies with similar plausibility and personal importance levels, your change in belief ranks at the 61st percentile. This means your level of belief change was greater than that of approximately 61% of similar respondents.”