Kam and Burge Replication

In progress
Author

Justin Dollman

Published

August 23, 2025

In the following, the reader will find a (data) replication and LLM extension of Kam and Burge (2018), a qualitative investigation into respondents’ reactions to the four items of the racial resentment scale. I do not expect this to be of wide interest – it is mainly an internal document, a “proof of concept” for a project that will involve LLMs coding unstructured text on similar topics.

The contribution of the Kam and Burge (2018) paper was primarily its deployment of the “stop and reflect” intervention (Ericsson & Simon, 1980) on the RR scale. This entails stopping respondents immediately after they give their agree/disagree response to an RR item and asking, “Thinking about the question you just answered, exactly what things went through your mind? Please type your response below.” TODO: Finish introductory spiel.

As a reminder, the four-item battery contains the following items:

Quantitative Analysis

As mentioned in the introduction, the main contribution of KB18 was its qualitative approach to the RR scale. Because I am bad at time management, I started carrying out a small exploratory analysis on the quantitative part as well. When I come back to this, I will likely reproduce their regression results just for completeness’ sake, but for now here are some plots.

In case you’re looking at that peak at 0.5 and wondering if a plurality of respondents straightlined, only 5.5% respondents “straight-lined” the middle option. But if we actually look at the most common response patterns, they were the plurality!

responses percent n
3, 3, 3, 3 5.5% 99
5, 1, 1, 5 4.5% 81
1, 5, 5, 1 2.1% 38
3, 4, 3, 3 1.4% 26
4, 2, 2, 4 1.4% 26

Interestingly, after the straight-liners, we get opposing extremophiles, the two ideologically pure groups. The “5, 1, 1, 5” group represents a perfectly racially resentful response set, and its inverse “1, 5, 5, 1” represents the antiracist response set.

Here are the univariate response profiles to the four items:

These are coded so that 0 is the least “racially resentful” option and 1 is the most “racially resentful.” I don’t think I would have predicted that the “Italians, Irish, and Jewish …” item would be the one that is the “easiest” (in item response theory terms).

Lastly, if you’re wondering about items’ intercorrelations:

TODO: Re-run the regressions featured in the paper, check for measurement equivalence between the white and Black respondents1

  • 1 A key aspect of the paper was the (intention) over-sampling of Black respondents to have sample of 900 white and 900 Black respondents.

  • Qualitative Analaysis

    Before getting into the coding of the data, the minimal-trust reader is likely curious what the actual data looks like. Here you can see the first 15 rows of the data, subset to two columns:

    • Q44: respondents’ 1-5 agreement level to the question, “Generations of slavery and discrimination have created conditions that make it difficult for blacks to work their way out of the lower class” (1 being ‘Disagree strongly’ and 5 being ‘Agree strongly’)
    • Q45: the free text that ensued after, “Thinking about the question you just answered, exactly what things went through your mind? Please type your response below.”
    q44 q45
    3 none
    4 this is true, but we should all try a little harder
    4 drug dealers
    4 Stereotypes
    1 they just need to have the will to get ahead. no more excuses
    4 In my opinion, other races look down on the black community and in others' eyes, we are and probably always will be considered " low class".
    4 Although America is better I is hard to become equal the dominant people will always hire the people they are most comfortable with. Its hard to close the economic gap
    1 We are so far past those days. If someone has bad experiences they should move to new groups of friends or create their own path
    3 best
    3 moderately
    1 Why are you asking me racist questions?
    1 none
    2 no decreasing
    4 people can be racest, but they cando a lot.
    2 happy

    As you can see, response quality has … high variance.2

  • 2 This is actually a great use-case for LLMs intervening in surveys. Between submitting their free-text response and seeing the next item, an LLM could be asked to rate the effort/intelligibility of the free-text response and, if lacking, could re-prompt them in the same way that certain fields are mandatory when filling out surveys. That said, one- or two-word responses are so low effort that you don’t even need an LLM; you just need to count the number of spaces in their response.

  • Perhaps because of the substantial frequency of low-quality responses, the very first feature of the text KB’s RAs was:

    Codeable content? (code)

    • 0 = no: (use if: blank/R writes don’t know/R agrees or disagrees with question/general rule-one word answers are not code-able with the exceptions of equality or racism)
    • 1 = yes: (use if: R provides any idea that is understandable and suggests some potentially code-able reaction to the question-including remarks about the question wording/bias in the questions/vagueness in the questions)3
  • 3 I won’t go over the entire codebook here; you can check it out in the appendix!

  • TODO: Decide how much more framing/prose you want here. For now I’m just diving straight into the “results.”

    Human IRR

    Briefly, before examining how well GPT-4o labelled this data, we should see just how “gold” our “gold standard” human data is. For each of the eight substantive dimensions as well as the “is this even codeable?” dimension, we can calculate the interrater reliability of the two human raters. Due to resource constraints, the two raters did not both code the entire dataset. We will have to make do with the 783 cells that both coded.

    How did the human raters do?

    AI-AI IRR

    Now, let’s see how our LLMs did. The “hopefully irrelevant” variation I injected into the labelling task was how many items the LLM saw at a time.4 Because each respondent gave four responses, I had two ways to present the text to the LLM:

  • 4 You could call this “batch size.”

    • I could present the LLM with all four of a respondent’s responses at the same time and ask it to code each (giving a total of \(4 \times 9 = 36\) codes), or
    • I could present the LLM with just one response at a time.

    The ‘respondent-level batch’ approach has two benefits. First, it results in approximately 25% of the API calls, thereby saving some serious dollars. Second, the respondent might reference their previous responses. If that is the case, siloing each response creates an unresolvable anaphora, and we all hate unresolvable anaphoras. On the other hand, one benefit of siloing individual responses is that we might not trust LLMs to code each response’s content independent of the surrounding content.5 We also might worry that quality will degrade if ask for too many responses at once. That is, somewhere between trusting an LLM for 9 pieces of information and 36 pieces of information, our trust runs out.

  • 5 For concreteness, imagine with me that a respondent mentions Obama in their second open-ended response. The model then might carry that over to the third and/or fourth codes and incorrectly say that the respondent mentioned Obama once or twice more.

  • Without further ado, here are the results. I’ve (again) plotted the humans’ IRRs, this time as benchmarks for the silicon IRRs.6

  • 6 Upon my second draft, I’ll incorporate this into the main text, but for now a marginalium. The IRR for the LLMs is the reliability between one instance of GPT-4o seeing just one response at a time and another instance seeing all four responses at a time. Given a less limited time and money budget, I would have run each of those codings twice, which would have given me “pure noise” IRRs that would serve as baselines against which I could judge the IRR I actually observe. Also of note, whereas the humans’ IRR is based on a relative small subset of the data, the LLMs’ IRR is based on the entire dataset.

  • As you can see, there isn’t a universal winner on the dimension of IRR. To this researcher, the most curious (and troubling) aspect was the LLMs’ lack of relaibility on two really simple binary dimensions: bexpment and wexpment, explicit mentions of Blacks and whites, respectively.

    Reminder: high IRR doesn’t necessary mean high quality!

    Below is a little app I’m working on that allows researchers to efficiently get below the IRR’s single-number summary to uncover the source of a low IRR. You can select which of the nine dimensions you want to see raw data on and how many rows (examples) you want; you can also resample disagreements to see another batch of examples. For each dimension you select, you’ll also get a confusion matrix that shows that overall structure of (dis)agreement between the two coders.

    0 candidates
    0 shown
    Confusion matrix

    Human-AI Discrepancies

    Now that we’ve seen that it would take a little more prompt engineering to have an LLM correctly segment the text it’s looking at, let’s go with the single-coding and see the extent to which GPT-4o (dis)agreed with human raters. For each dimension, I’ll calculate two IRRs, one between human coder one and GPT-4o, and the other between human coder two and GPT-4o.

    0 candidates
    0 shown
    Confusion matrix

    Assorted TODOs

    • After building some nice automations into my callm package, I’ll subset the data and carry out the following experiments
      • see if Claude’s labelling behavior is substantially different
      • given the overzealousness with which GPT-4o perceived the presence of both codeable content and mentions of characteristics (both when coding isolated responses and when simultaneously coding all four responses), I’ll want to see if some chastisement in the system prompt prevents this
    • The most interesting extensions will be
      • Seeing how the authors’ regression results change after accounting for differential usefulness of responses. That is, use (something like) a selection model to model the conditional associations7 after taking into account that some responses didn’t even have the chance to e.g. mention individualism or traits of blacks.
      • Examining if different items evoked different values (since that was the purpose of the study, after all). That is, do items that prima facie’ly load onto individualism differentially precede mentions of the value of individualism?8
  • 7 A sloppier person would call these regression coefficients effects. But remember, the first rule is to be mostly harmless.

  • 8 I’m honestly not sure how the authors got away with not doing this given its centrality to the paper’s core contention!

  • Appendix

    In the appendix you’ll find a Markdown version of the Kam and Burge qualitative codebook. This is presumably what the two coders used to label responses. In the sections where I create the prompt, you’ll see the code read in open-ended-codebook.txt. That is simply a markdown file with the codebook.

    Codebook

    Codeable content? (code)

    • 0 = no: (use if: blank/R writes don’t know/R agrees or disagrees with question/general rule-one word answers are not code-able with the exceptions of equality or racism)
    • 1 = yes: (use if: R provides any idea that is understandable and suggests some potentially code-able reaction to the question-including remarks about the question wording/bias in the questions/vagueness in the questions)

    Blacks explicitly Mentioned (bexpment)

    • 0 = no blacks explicitly mentioned
    • 1 = blacks explicitly mentioned

    Whites explicitly mentioned (wexpment)

    • 0 = no whites explicitly mentioned
    • 1 = whites explicit (including Irish, Italians, Jews)

    Traits of Blacks (tblk)

    We’re looking for explicit mentions about disposition or characteristics

    • 0 = no mention
    • 1 = positive (endorsed or mentioned with no sign of disagreement: hardworking, self-reliant; suffering/struggling/facing/overcoming adversity overcoming race, and or smart)
    • 2 = positive & negative (endorsed or mentioned with no sign of disagreement)
    • 3 = negative (endorsed or mentioned with no sign of disagreement/ lazy, dependent / not taking responsibility for oneself / making excuses/ not overcoming adversity/taking advantage of “race” card/ angry/ dwelling on the past/ and or poor, low SES, uneducated)
    • 11 = Society – positive (but R disagrees)
    • 12 = Society – negative (but R disagrees)
    • 21 = blacks are like other groups in disposition; “blacks and whites are”

    Individualism / American Dream / “work your way up”/ self-determination (individ)

    • 0 = no individualism mentioned
    • 1 = affirmation of individualism as a principle-mentioned as an existing rule for all/universal (need a specific reference to working hard/helping oneself/earning one’s way/doing it on one’s own/other minorities worked their way up, blacks should too/ explicit mention of another group)
    • 2 = mentioned as “not enough” (“trying is not enough” for blacks/blacks are trying harder/blacks have to try harder/reference to the American dream being broken)
    • 3 = some people (any race) try to flout individualism (“get something for nothing”/ being lazy/not trying hard/blacks get more than they deserve/blacks are the favored race)

    Discrimination (discrim)

    • 0 = no mention
    • 1 = against discrimination; it should not exist; for equality
    • 2 = against discrimination but not for special treatment; against special treatment/privilege based on race
    • 10 = discrimination mentioned broadly
    • 11 = exists / is a problem for blacks (is a problem/blacks face stereotypes/racial pigment makes a difference/legacy of slavery)
    • 21 = exists and is a problem for whites (discrimination against whites/reverse racism/ special favors)
    • 31 = discrimination not as big of a problem as before and or slavery happened but is not the source of current problems/ blaming the past is not taking responsibility for individual actions/denial of continuing discrimination against blacks/ denial of a discriminatory societal feature that has an effect on blacks
    • 41 = discrimination is a problem for everyone; “everyone is a little bit racist”
    • 51 = “this is not a race thing” (misc, could be disposition, could be situation)

    Welfare (welf)

    explicit mentions of welfare (this code should not capture affirmative action)

    • 0 = no mention
    • 1 = discussion of welfare/welfare programs/government programs; checks; dependency; handouts; public aid; food stamps; TANF

    Obama explicitly mentioned (obama)

    • 0 = no mention of Obama
    • 1 = mentioned Obama

    Reactions to the questions themselves (rxn)

    • 0 = no reaction
    • 1 = racist /biased/prejudice
    • 2 = emotionally upset/angry
    • 3 = dispute accuracy of statement/question statement
    • 4 = state that it is a stupid question
    • 5 = don’t understand question

    Prepping Open-Ended Response Data, Creating the Prompt, LLM Calls

    Below is the code that creates the prompts I sent to the LLMs.

    # Note that these numbers (e.g., 43) correspond to respondents' open-ended text, not the agree-disagree response to that question
    q43_text <- "Irish, Italians, Jewish, and many other minorities overcame prejudice and worked their way up. Blacks should do the same without any special favors"
    q45_text <- "Generations of slavery and discrimination have created conditions that make it difficult for blacks to work their way out of the lower class"
    q47_text <- "Over the past few years, blacks have gotten less than they deserve"
    q49_text <- "It’s really a matter of some people not trying hard enough; if blacks would only try harder they could be just as well off as whites"

    One Q at a time

    markdown_codebook <- 
      readLines('open-ended-codebook.txt') %>% 
      glue_collapse(sep = '\n')
    
    response_skeleton <- '{
      "code": "...",
      "bexpment": "...",
      "wexpment": "...",
      "tblk": "...",
      "individ": "...",
      "discrim": "...",
      "welf": "...",
      "obama": "...",
      "rxn": "..."
    }'
    
    system_promptify <- function(item) {
      
      question_text <- case_when(
        item == 'q43' ~ q43_text,
        item == 'q45' ~ q45_text,
        item == 'q47' ~ q47_text,
        item == 'q49' ~ q49_text
      )
      
      single_response_prompt <- glue(
        'You are an expert coder of open-ended survey responses. Your task is to analyze and categorize a response according to the codebook provided below:
    
    <CODEBOOK>
    {markdown_codebook}
    </CODEBOOK>
    
    Return **only** a single JSON object, with no additional text or markdown fences.
    
    Required keys (in this order):
      "code", "bexpment", "wexpment", "tblk",
      "individ", "discrim", "welf", "obama", "rxn"
    
    For each key:
    
    - Use an integer drawn from the permitted set listed in the codebook.
    - If the response is uncodeable, use each key\'s "0 = not present / NA" value.
    
    Response skeleton:
    
    {response_skeleton}
    
    After survey-takers rated their agreement with the question, "{question_text}," they then saw "Thinking about the question you just answered, exactly what things went through your mind? Please type your response below."'
      )
      
      single_response_prompt
    }
    single_responses <- callm::single_turns(
      user_msgs = paste("Survey-taker's response:", qual_data$response),
      system_msgs = system_promptify(qual_data$item),
      org = 'openai',
      model = 'gpt-4o-2024-11-20' # most recent snapshot available at the time
      )
    
    saveRDS(single_responses, 'llm_data/coded_llm_responses.rds')

    Four Q at a time

    question_lookup <- c(
      q43 = q43_text,
      q45 = q45_text,
      q47 = q47_text,
      q49 = q49_text
    )
    
    qual_multi <- data_raw %>% 
      select(id, q43, q45, q47, q49) %>%
      pivot_longer(-id, names_to = "item", values_to = "response",
                   values_drop_na = TRUE) %>% 
      arrange(id, item) %>% 
      group_by(id) %>% 
      summarise(
        user_msg = glue_collapse(
          map2_chr(item, response, \(qid, txt) {
            glue("[{qid}] {question_lookup[[qid]]}\nResponse: {trimws(txt)}")
          }),
          sep = "\n\n"          # blank line between blocks
        ),
        .groups = "drop"
      )
    
    response_skeleton <- '{
      "code": <int>,
      "bexpment": <int>,
      "wexpment": <int>,
      "tblk": <int>,
      "individ": <int>,
      "discrim": <int>,
      "welf": <int>,
      "obama": <int>,
      "rxn": <int>
    }'
    
    multi_system_prompt <- glue(
    'You are an expert coder of open‑ended survey responses.  
    Each API call will give you **all of one respondent\'s answers** (up to four).  
    Your task is to code **each answer individually** using the codebook below.
    
    Every answer block begins with its question ID in square brackets, e.g. `[q43]`, followed by the full wording, then a line starting `Response:` with the respondent’s text.
    
    <CODEBOOK>
    {markdown_codebook}
    </CODEBOOK>
    
    Return **one JSON object** whose top‑level keys are only the IDs that actually appear (omit keys for unanswered questions). **Note**: even if you see e.g. "Response: nothing" this means that the respondent _did_ respond, but that their response was "nothing" (this would be a `"code": 0` situation). In sum, if you see a question ID in square brackets, the survey-taker responded to that question.
    
    For every present ID supply this exact nine‑field sub‑object, kee, ping the fields in order:
    
    {response_skeleton}
    
    Do **not** wrap the JSON in markdown fences or add commentary.')
    
    multi_responses <- callm::single_turns(
      user_msgs   = qual_multi$user_msg,
      system_msgs = rep(multi_system_prompt, nrow(qual_multi)),
      org         = "openai",
      model       = "gpt-4o-2024-11-20"
    )