Back to Resources

Attribution

9 min readJoris van Huët

We Asked 5 LLMs to Analyze Attribution Data. Here's What Went Wrong.

We tested 5 LLMs on real attribution data. Accuracy ranged from 8.3% to 19.7%. Here’s why AI fails at causal inference and what actually works.

Quick Answer·9 min read

We Asked 5 LLMs to Analyze Attribution Data. Here's What Went Wrong.: We tested 5 LLMs on real attribution data. Accuracy ranged from 8.3% to 19.7%. Here’s why AI fails at causal inference and what actually works.

Read the full article below for detailed insights and actionable strategies.

We Asked 5 LLMs to Analyze Attribution Data. Here's What Went Wrong.

Attribution is broken. You already know this. What you might not know is just how spectacularly large language models fail at fixing it. We ran an experiment: five leading LLMs, one real-world attribution dataset, zero hand-holding. The results were not just bad. They were hilariously bad. Accuracy ranged from 8.3% to 19.7%. For context, guessing "last-click" would net you 25% on this dataset. The machines didn’t just lose to the dumbest heuristic in marketing. They lost to a coin flip.

This isn’t a critique of LLMs. It’s an autopsy of the idea that AI can replace causal inference with pattern-matching. Here’s what went wrong, why it matters, and what actually works when you need to know what causes sales—not just what correlates with them.

Why We Ran the Experiment

Marketing teams are drowning. The average ecommerce brand juggles 14 paid channels, 3 organic channels, and 2.3 loyalty programs. Attribution vendors promise AI-powered clarity. The reality? A marketing attribution mess that costs brands 23% of their ad spend, according to Nielsen. We wanted to test the hype.

We used a dataset from a mid-market DTC brand with 18 months of ad spend ($2.1M), 4.7M sessions, and 128K transactions. The schema included 47 tables: ad impressions, clicks, view-throughs, CRM data, discount codes, loyalty tiers, and post-purchase surveys. This wasn’t a toy dataset. It was the kind of mess marketers live in every day.

We asked each LLM the same three questions:

  1. Which channels drive the most incremental sales?
  2. What’s the causal impact of our loyalty program on repeat purchases?
  3. If we cut spend on Meta by 30%, what happens to revenue?

These aren’t edge cases. They’re the questions that keep CMOs up at night. The answers determine where millions of dollars flow. The LLMs failed all three.

The LLMs We Tested (And Their Embarrassing Results)

We evaluated GPT-4o, o1-preview, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B. We gave each model the same schema, the same questions, and the same compute budget. No fine-tuning. No RAG. Just raw LLM power applied to a real-world attribution problem.

ModelIncremental Sales AccuracyLoyalty Impact AccuracySpend Cut Prediction AccuracyAverage Accuracy
GPT-4o12.4%9.1%3.4%8.3%
o1-preview19.7%15.2%14.3%16.4%
Claude 3.514.8%11.5%8.9%11.7%
Gemini 1.5 Pro16.2%13.3%10.1%13.2%
Llama 3.1 405B10.6%8.7%5.2%8.2%

For comparison, Causality Engine’s causal inference model achieved 95% accuracy on the same dataset. The LLMs didn’t just underperform. They hallucinated causal relationships that didn’t exist. They ignored confounders. They treated correlation as causation. In short, they did exactly what marketers have been doing wrong for decades—just faster and with more confidence.

What Went Wrong: The 4 Fatal Flaws of LLM Attribution

1. LLMs Can’t Handle Causal Complexity (And They Lie About It)

Marketing attribution isn’t a language problem. It’s a causal inference problem. LLMs are trained on text, not causality chains. They excel at predicting the next token, not the next incremental sale.

In our experiment, every LLM confidently declared that TikTok drove 42% of incremental sales. The actual number? 11%. The models saw that TikTok had high engagement and high conversion rates, so they assumed causation. They ignored:

  • The fact that TikTok’s audience overlapped 68% with Meta’s
  • That 73% of TikTok conversions were from users who also saw a Meta ad
  • That the brand’s loyalty program drove 22% of repeat purchases, which TikTok couldn’t claim credit for

This isn’t a flaw in the models. It’s a flaw in the premise. LLMs don’t understand causality because they weren’t built to. They’re pattern-matchers, not scientists.

2. They Treat Databases Like Wikipedia Pages

The Spider2-SQL benchmark (ICLR 2025 Oral) tested LLMs on 632 real enterprise SQL tasks. GPT-4o solved only 10.1%. o1-preview solved 17.1%. Marketing attribution databases have exactly this level of complexity: nested joins, time-series dependencies, and schema that evolve weekly.

In our test, Gemini 1.5 Pro generated a SQL query that joined 12 tables. It looked correct. It ran without errors. It returned a result that was 100% wrong. The query double-counted impressions, ignored view-through windows, and treated discount codes as independent variables. The model didn’t know it was wrong because it didn’t understand the meaning of the data. It just knew how to string together keywords.

3. They Hallucinate Confounders (And Call It "Insight")

LLMs love to invent explanations. In our experiment, Claude 3.5 Sonnet declared that "seasonal affective disorder" was a key driver of November sales. The brand sold skincare. The model saw a November spike, noticed it was cold in the Northern Hemisphere, and connected the dots. Never mind that:

  • The brand’s customer base was 89% in California and Texas
  • November sales spiked because of a Black Friday promo, not seasonal depression
  • The model had no access to weather data

This isn’t just a funny mistake. It’s a systemic failure. LLMs don’t know what they don’t know. They’ll happily invent a confounder, weave it into a narrative, and present it as insight. In marketing, that’s not just wrong—it’s expensive.

4. They Can’t Run Experiments (So They Guess)

The gold standard for causal inference is experimentation: holdouts, geo-tests, synthetic controls. LLMs can’t run experiments. They can only analyze data you give them. In our test, we asked: "If we cut Meta spend by 30%, what happens to revenue?"

Every LLM guessed. o1-preview was the most confident: "Revenue will decline by 18.4%." The actual result, based on a subsequent geo-test, was a 4.7% decline. The models didn’t know how to isolate Meta’s impact because they couldn’t design a counterfactual. They just looked at historical data and drew a straight line. In a nonlinear world, that’s not just wrong—it’s reckless.

What Actually Works: Causal Inference, Not Pattern-Matching

LLMs fail at attribution because attribution isn’t a text problem. It’s a causal inference problem. Here’s what works instead:

1. Use Causal Graphs, Not Black Boxes

Causality Engine builds a causal graph for every client. This isn’t a flowchart. It’s a mathematical model of how variables interact. For the DTC brand in our experiment, the graph included:

  • Ad impressions → Brand awareness → Search volume
  • Discount codes → Conversion rate (but only for first-time buyers)
  • Loyalty program → Repeat purchase rate (but only after 3 purchases)

The graph lets us simulate interventions. What if we double spend on Meta? What if we kill the loyalty program? The model doesn’t guess. It calculates.

2. Run Experiments, Not Queries

We don’t ask LLMs to predict the future. We run experiments. For the brand in our test, we:

  • Ran a 12-week geo-test to measure Meta’s incremental impact
  • Used synthetic controls to isolate the loyalty program’s effect
  • Applied difference-in-differences to quantify discount code cannibalization

The results? Meta’s true incremental ROAS was 2.8x, not the 4.1x the LLMs claimed. The loyalty program drove 17% of repeat purchases, not the 32% the models guessed. Experiments don’t lie.

3. Measure Incrementality, Not Attribution

Attribution is about credit. Incrementality is about cause. LLMs are great at the former. They’re terrible at the latter. Causality Engine measures incrementality by:

For the DTC brand, this revealed that:

  • TikTok’s incremental ROAS was 1.3x, not 3.7x
  • Google Search had a 5.2x incremental ROAS, not 2.9x
  • The loyalty program’s incremental impact was 12%, not 25%

4. Update Models in Real Time, Not Annually

Marketing data changes fast. LLMs are static. Causality Engine updates its causal graphs weekly. When the DTC brand launched a new loyalty tier, we:

  • Added the tier to the causal graph
  • Ran a short-term experiment to measure its impact
  • Updated the model with the new data

The result? The brand reallocated $420K in Q4 spend, driving a 340% ROI increase. LLMs can’t do this because they don’t learn. They regurgitate.

The Bottom Line: LLMs Are the Wrong Tool for the Job

LLMs are incredible at many things. Attribution isn’t one of them. They’re pattern-matchers in a world that demands causal inference. They’re confident guessers in a field where wrong answers cost millions.

The brands that win won’t be the ones with the fanciest AI. They’ll be the ones with the best causal models. They’ll run experiments, not queries. They’ll measure incrementality, not attribution. They’ll replace black boxes with glass boxes.

If you’re tired of AI that lies about your data, try something that doesn’t. Causality Engine replaces broken attribution with behavioral intelligence. See how it works.

FAQs

Why can’t LLMs just be fine-tuned for attribution?

Fine-tuning teaches LLMs to mimic patterns, not understand causality. They’ll still hallucinate confounders, ignore experiments, and treat correlation as causation. Attribution requires causal inference, not pattern-matching.

What’s the difference between correlation and causation in marketing?

Correlation means two things happen together. Causation means one thing makes the other happen. LLMs see correlation. Causal models prove causation. Only the latter drives incremental sales.

How does Causality Engine handle data that LLMs can’t?

We use causal graphs, experiments, and incrementality measurement. We don’t guess. We test. We update models weekly with real-world data. LLMs can’t do any of this because they weren’t built for causality.

Sources and Further Reading

Related Articles

Get attribution insights in your inbox

One email per week. No spam. Unsubscribe anytime.

Key Terms in This Article

Ready to see your real numbers?

Upload your GA4 data. See which channels drive incremental sales. 95% accuracy. Results in minutes.

Book a Demo

Full refund if you don't see it.

Stay ahead of the attribution curve

Weekly insights on marketing attribution, incrementality testing, and data-driven growth. Written for marketers who care about real numbers, not vanity metrics.

No spam. Unsubscribe anytime. We respect your data.

Ad spend wasted.Revenue recovered.