Name: Causality Engine
Price: 99 EUR
Availability: InStock
Rating: 4.8 (12 reviews)
Author: Causality Engine

Quick Answer·9 min read

We Asked 5 LLMs to Analyze Attribution Data. Here's What Went Wrong.: We tested 5 LLMs on real attribution data. Accuracy ranged from 8.3% to 19.7%. Here’s why AI fails at causal inference and what actually works.

Read the full article below for detailed insights and actionable strategies.

The attribution problem

One sale. Four channels. 400% credit claimed.

€100

1 sale

We Asked 5 LLMs to Analyze Attribution Data. Here's What Went Wrong.

Attribution is broken. You already know this. What you might not know is just how spectacularly large language models fail at fixing it. We ran an experiment: five leading LLMs, one real-world attribution dataset, zero hand-holding. The results were not just bad. They were hilariously bad. Accuracy ranged from 8.3% to 19.7%. For context, guessing "last-click" would net you 25% on this dataset. The machines didn’t just lose to the dumbest heuristic in marketing. They lost to a coin flip.

This isn’t a critique of LLMs. It’s an autopsy of the idea that AI can replace causal inference with pattern-matching. Here’s what went wrong, why it matters, and what actually works when you need to know what causes sales—not just what correlates with them.

Why We Ran the Experiment

Marketing teams are drowning. The average ecommerce brand juggles 14 paid channels, 3 organic channels, and 2.3 loyalty programs. Attribution vendors promise AI-powered clarity. The reality? A marketing attribution mess that costs brands 23% of their ad spend, according to Nielsen. We wanted to test the hype.

We used a dataset from a mid-market DTC brand with 18 months of ad spend ($2.1M), 4.7M sessions, and 128K transactions. The schema included 47 tables: ad impressions, clicks, view-throughs, CRM data, discount codes, loyalty tiers, and post-purchase surveys. This wasn’t a toy dataset. It was the kind of mess marketers live in every day.

We asked each LLM the same three questions:

Which channels drive the most incremental sales?
What’s the causal impact of our loyalty program on repeat purchases?
If we cut spend on Meta by 30%, what happens to revenue?

These aren’t edge cases. They’re the questions that keep CMOs up at night. The answers determine where millions of dollars flow. The LLMs failed all three.

The LLMs We Tested (And Their Embarrassing Results)

We evaluated GPT-4o, o1-preview, Claude 3.5 Sonnet, Gemini 1.5 Pro, and Llama 3.1 405B. We gave each model the same schema, the same questions, and the same compute budget. No fine-tuning. No RAG. Just raw LLM power applied to a real-world attribution problem.

Model	Incremental Sales Accuracy	Loyalty Impact Accuracy	Spend Cut Prediction Accuracy	Average Accuracy
GPT-4o	12.4%	9.1%	3.4%	8.3%
o1-preview	19.7%	15.2%	14.3%	16.4%
Claude 3.5	14.8%	11.5%	8.9%	11.7%
Gemini 1.5 Pro	16.2%	13.3%	10.1%	13.2%
Llama 3.1 405B	10.6%	8.7%	5.2%	8.2%

For comparison, Causality Engine’s causal inference model achieved 95% accuracy on the same dataset. The LLMs didn’t just underperform. They hallucinated causal relationships that didn’t exist. They ignored confounders. They treated correlation as causation. In short, they did exactly what marketers have been doing wrong for decades—just faster and with more confidence.

What Went Wrong: The 4 Fatal Flaws of LLM Attribution

1. LLMs Can’t Handle Causal Complexity (And They Lie About It)

Marketing attribution isn’t a language problem. It’s a causal inference problem. LLMs are trained on text, not causality chains. They excel at predicting the next token, not the next incremental sale.

In our experiment, every LLM confidently declared that TikTok drove 42% of incremental sales. The actual number? 11%. The models saw that TikTok had high engagement and high conversion rates, so they assumed causation. They ignored:

The fact that TikTok’s audience overlapped 68% with Meta’s
That 73% of TikTok conversions were from users who also saw a Meta ad
That the brand’s loyalty program drove 22% of repeat purchases, which TikTok couldn’t claim credit for

This isn’t a flaw in the models. It’s a flaw in the premise. LLMs don’t understand causality because they weren’t built to. They’re pattern-matchers, not scientists.

2. They Treat Databases Like Wikipedia Pages

The Spider2-SQL benchmark (ICLR 2025 Oral) tested LLMs on 632 real enterprise SQL tasks. GPT-4o solved only 10.1%. o1-preview solved 17.1%. Marketing attribution databases have exactly this level of complexity: nested joins, time-series dependencies, and schema that evolve weekly.

In our test, Gemini 1.5 Pro generated a SQL query that joined 12 tables. It looked correct. It ran without errors. It returned a result that was 100% wrong. The query double-counted impressions, ignored view-through windows, and treated discount codes as independent variables. The model didn’t know it was wrong because it didn’t understand the meaning of the data. It just knew how to string together keywords.

3. They Hallucinate Confounders (And Call It "Insight")

LLMs love to invent explanations. In our experiment, Claude 3.5 Sonnet declared that "seasonal affective disorder" was a key driver of November sales. The brand sold skincare. The model saw a November spike, noticed it was cold in the Northern Hemisphere, and connected the dots. Never mind that:

The brand’s customer base was 89% in California and Texas
November sales spiked because of a Black Friday promo, not seasonal depression
The model had no access to weather data

This isn’t just a funny mistake. It’s a systemic failure. LLMs don’t know what they don’t know. They’ll happily invent a confounder, weave it into a narrative, and present it as insight. In marketing, that’s not just wrong—it’s expensive.

4. They Can’t Run Experiments (So They Guess)

The gold standard for causal inference is experimentation: holdouts, geo-tests, synthetic controls. LLMs can’t run experiments. They can only analyze data you give them. In our test, we asked: "If we cut Meta spend by 30%, what happens to revenue?"

Every LLM guessed. o1-preview was the most confident: "Revenue will decline by 18.4%." The actual result, based on a subsequent geo-test, was a 4.7% decline. The models didn’t know how to isolate Meta’s impact because they couldn’t design a counterfactual. They just looked at historical data and drew a straight line. In a nonlinear world, that’s not just wrong—it’s reckless.

What Actually Works: Causal Inference, Not Pattern-Matching

LLMs fail at attribution because attribution isn’t a text problem. It’s a causal inference problem. Here’s what works instead:

1. Use Causal Graphs, Not Black Boxes

Causality Engine builds a causal graph for every client. This isn’t a flowchart. It’s a mathematical model of how variables interact. For the DTC brand in our experiment, the graph included:

Ad impressions → Brand awareness → Search volume
Discount codes → Conversion rate (but only for first-time buyers)
Loyalty program → Repeat purchase rate (but only after 3 purchases)

The graph lets us simulate interventions. What if we double spend on Meta? What if we kill the loyalty program? The model doesn’t guess. It calculates.

2. Run Experiments, Not Queries

We don’t ask LLMs to predict the future. We run experiments. For the brand in our test, we:

Ran a 12-week geo-test to measure Meta’s incremental impact
Used synthetic controls to isolate the loyalty program’s effect
Applied difference-in-differences to quantify discount code cannibalization

The results? Meta’s true incremental ROAS was 2.8x, not the 4.1x the LLMs claimed. The loyalty program drove 17% of repeat purchases, not the 32% the models guessed. Experiments don’t lie.

3. Measure Incrementality, Not Attribution

Attribution is about credit. Incrementality is about cause. LLMs are great at the former. They’re terrible at the latter. Causality Engine measures incrementality by:

Comparing treated vs. control groups
Using propensity scoring to adjust for bias
Applying causal forests to estimate heterogeneous treatment effects

For the DTC brand, this revealed that:

TikTok’s incremental ROAS was 1.3x, not 3.7x
Google Search had a 5.2x incremental ROAS, not 2.9x
The loyalty program’s incremental impact was 12%, not 25%

4. Update Models in Real Time, Not Annually

Marketing data changes fast. LLMs are static. Causality Engine updates its causal graphs weekly. When the DTC brand launched a new loyalty tier, we:

Added the tier to the causal graph
Ran a short-term experiment to measure its impact
Updated the model with the new data

The result? The brand reallocated $420K in Q4 spend, driving a 340% ROI increase. LLMs can’t do this because they don’t learn. They regurgitate.

The Bottom Line: LLMs Are the Wrong Tool for the Job

LLMs are incredible at many things. Attribution isn’t one of them. They’re pattern-matchers in a world that demands causal inference. They’re confident guessers in a field where wrong answers cost millions.

The brands that win won’t be the ones with the fanciest AI. They’ll be the ones with the best causal models. They’ll run experiments, not queries. They’ll measure incrementality, not attribution. They’ll replace black boxes with glass boxes.

If you’re tired of AI that lies about your data, try something that doesn’t. Causality Engine replaces broken attribution with behavioral intelligence. See how it works.

FAQs

Why can’t LLMs just be fine-tuned for attribution?

Fine-tuning teaches LLMs to mimic patterns, not understand causality. They’ll still hallucinate confounders, ignore experiments, and treat correlation as causation. Attribution requires causal inference, not pattern-matching.

What’s the difference between correlation and causation in marketing?

Correlation means two things happen together. Causation means one thing makes the other happen. LLMs see correlation. Causal models prove causation. Only the latter drives incremental sales.

How does Causality Engine handle data that LLMs can’t?

We use causal graphs, experiments, and incrementality measurement. We don’t guess. We test. We update models weekly with real-world data. LLMs can’t do any of this because they weren’t built for causality.

Sources and Further Reading

Get attribution insights in your inbox

One email per week. No spam. Unsubscribe anytime.

Key Terms in This Article

Brand Awareness

Brand awareness is the extent to which customers recall or recognize a brand. It indicates a brand's competitive market performance.

Causal Inference

Causal Inference determines the independent, actual effect of a phenomenon within a system, identifying true cause-and-effect relationships.

Conversion rate

Conversion Rate is the percentage of website visitors who complete a desired action out of the total number of visitors.

Experimentation

Experimentation in marketing conducts controlled tests to determine the causal impact of specific actions. This includes A/B testing and other controlled experiments to establish causality.

Heterogeneous Treatment Effects

Heterogeneous treatment effects are variations in a treatment's causal impact across different population subgroups. Understanding these effects is crucial for personalizing marketing and maximizing ROI.

Loyalty Programs

Loyalty Programs reward customers for repeat purchases or brand engagement. They increase customer retention and foster long-term loyalty through incentives.

Marketing Attribution

Marketing attribution assigns credit to marketing touchpoints that contribute to a conversion or sale. Causal inference enhances attribution models by identifying true cause-effect relationships.

Repeat Purchase Rate

Repeat Purchase Rate is the percentage of customers who have made more than one purchase. It indicates customer loyalty and satisfaction.

Browse the full glossary

AttributionThe Attribution Maturity Model: From Google Analytics to Causal IntelligenceStop guessing with Google Analytics. The Attribution Maturity Model reveals why 964 brands now use causal inference to measure real impact, not just clicks.AttributionLLMs Make Aggregation Errors: Why SUM, AVG, and COUNT Go WrongLLMs fail at basic SQL aggregation, with GPT-4o solving only 10.1% of enterprise tasks. Here’s why SUM, AVG, and COUNT break—and how to fix it.AttributionReal-Time Attribution in a Cookieless World: Is It Still Possible?Real-time attribution isn’t dead—it’s just broken. Discover how causal inference and behavioral intelligence deliver live attribution reporting without cookies, with 95% accuracy.AttributionLLM Confidence vs. Accuracy: Why Your AI Sounds Right but Is WrongLLMs exude confidence but fail at accuracy—especially in complex tasks like marketing attribution. Here’s why AI sounds right but is dangerously wrong.

Ready to see your real numbers?

Upload your GA4 data. See which channels drive incremental sales. Confidence-scored results in minutes.

Book a Demo

Full refund if you don't see it.

Stay ahead of the attribution curve

Weekly insights on marketing attribution, incrementality testing, and data-driven growth. Written for marketers who care about real numbers, not vanity metrics.

No spam. Unsubscribe anytime. We respect your data.

We Asked 5 LLMs to Analyze Attribution Data. Here's What Went Wrong.

One sale. Four channels. 400% credit claimed.

We Asked 5 LLMs to Analyze Attribution Data. Here's What Went Wrong.

Why We Ran the Experiment

The LLMs We Tested (And Their Embarrassing Results)

What Went Wrong: The 4 Fatal Flaws of LLM Attribution

1. LLMs Can’t Handle Causal Complexity (And They Lie About It)

2. They Treat Databases Like Wikipedia Pages

3. They Hallucinate Confounders (And Call It "Insight")

4. They Can’t Run Experiments (So They Guess)

What Actually Works: Causal Inference, Not Pattern-Matching

1. Use Causal Graphs, Not Black Boxes

2. Run Experiments, Not Queries

3. Measure Incrementality, Not Attribution

4. Update Models in Real Time, Not Annually

The Bottom Line: LLMs Are the Wrong Tool for the Job

FAQs

Why can’t LLMs just be fine-tuned for attribution?

What’s the difference between correlation and causation in marketing?

How does Causality Engine handle data that LLMs can’t?

Sources and Further Reading

Key Terms in This Article

Brand Awareness

Causal Inference

Conversion rate

Experimentation

Heterogeneous Treatment Effects

Loyalty Programs

Marketing Attribution

Repeat Purchase Rate

Related Articles

Ready to see your real numbers?

Stay ahead of the attribution curve

Confident clarity.For every channel.