Intent scoring on Shopify stores — 40 signals that actually predict reply

2025-06-04 · product · 7 min read · Pierre Richard

We score every prospect that enters the pipeline on a 0-100 intent scale before any email gets generated. That score is the difference between a 3.6% reply rate and a spam complaint. Today I want to walk through the 40-something features the model actually looks at, ranked by how predictive they turned out to be once we had real reply data to backtest against.

A bunch of the "obvious" signals turned out to be noise. A few non-obvious ones turned out to be the bulk of the signal. The good news for anyone trying to do this without our pipeline: you only need about 8 features to get most of the way there.

The dataset

15,400 Shopify-merchant prospects sent over Mar–May 2025, each annotated with the eventual outcome:

positive reply (3.6% — the prospect engaged constructively)
neutral reply (1.1% — bounce, OOO, mild interest)
negative reply (0.4% — unsubscribe, complaint)
no reply (94.9%)

We're predicting the positive class. Class balance is brutal, so AUC is the metric, not accuracy. Baseline AUC with just "store has a Shopify install" is 0.51 — barely above coin flip. With the full feature set we hit 0.79.

The features that did the heavy lifting

Ranked by SHAP importance on the production model:

1. Product catalog depth (importance: 0.18)

Stores with 50–500 SKUs reply at 3× the rate of stores with under 20 SKUs. Below 20 SKUs, the merchant is usually a hobbyist who hasn't hit the operational pain point our product solves. Above ~500 SKUs they tend to be enterprise and route us to procurement (which is its own pain). The mid-band is the sweet spot.

2. Recency of homepage updates (importance: 0.14)

We hash the homepage and re-fetch every 14 days. If the hash hasn't changed in 90+ days the store is essentially dormant from a buying-decision perspective — the merchant has stopped iterating, which means they've also stopped buying tools. Fresh homepages predict reply.

3. Presence of an email-capture popup (importance: 0.11)

A sign the merchant cares about list-building. Stores with a working capture popup reply at 2.4× the no-popup rate. We detect this by rendering the homepage in a headless browser and checking for <dialog>, common popup library selectors (Klaviyo, Privy, Justuno), and z-index-stacked overlays.

4. Klaviyo / Mailchimp pixel detection (importance: 0.09)

Stores with an active email-marketing platform installed are already in our buying motion — they've conceded "email marketing is a category I spend on." Easy upsell to "now make it AI-personalized." Stores without any ESP are usually 12 months earlier in the funnel than we want.

5. Checkout-flow signals (importance: 0.08)

Specifically: presence of express checkout (Shop Pay, Apple Pay), one-page vs multi-step checkout, post-purchase upsell apps detected. Stores investing in checkout optimization care about conversion at every step — they're our buyer.

6. Press / blog content velocity (importance: 0.07)

Stores publishing blog posts or press at >1/month rate reply at 2.1× the dormant rate. The blog signals "we have a marketing team that ships."

7. Domain age (importance: 0.06)

Counter-intuitive: 1–3 year old domains reply best. Brand-new (<6 months) stores haven't found product-market fit yet. Old (>5 years) stores are entrenched with their existing tooling.

8. Job postings detected (importance: 0.05)

If the store has open marketing or growth roles posted on LinkedIn or their /careers page, they're spending. Spending stores buy tools.

That's the top 8. Together they account for ~78% of the model's predictive power.

The features that turned out to be noise

A few we expected to matter that didn't:

Social-media follower counts

We thought IG/TikTok follower count would correlate with revenue and therefore willingness to pay. It doesn't, at least within the Shopify population. A 200K-follower fashion brand and a 4K-follower DTC supplement brand reply at statistically indistinguishable rates. Followers measure brand-marketing effort, not buying maturity.

Number of product reviews

Same story. Lots of reviews = the merchant has been around a while, but it doesn't separate the buyers from the non-buyers.

Tech-stack breadth (number of apps installed)

We thought "merchants with 30+ apps installed are app-buyers" would be predictive. Slightly negative coefficient, actually — those merchants are often suffering from app fatigue and reluctant to add another. The correlation runs through "marketing-team maturity" instead.

Geography

Within the US/EU/UK English-speaking population (which is ~85% of our list), country of origin has near-zero coefficient once you control for store-size signals. Don't waste time geo-filtering.

The minimum viable scoring stack

If you're trying to roll your own and don't have the engineering bandwidth for 40 features, here's the 80/20:

score = (
  0.30 * (catalog_size_in_band ? 1 : 0)        # 50-500 SKUs
  + 0.25 * (homepage_changed_within_90d ? 1 : 0)
  + 0.20 * (has_email_capture ? 1 : 0)
  + 0.15 * (has_klaviyo_or_mailchimp ? 1 : 0)
  + 0.10 * (has_express_checkout ? 1 : 0)
)

That gets you to about AUC 0.71 — most of the way. The rest of the gain is from the long-tail features and from learned interactions between them.

The right mental model

A good intent score is an answer to the question "is this prospect earlier or later in the buying funnel than the median lead I'm sending to?" Templates ignore the question. Scoring lets you concentrate spend on the specific 20% of the list where reply probability is 5× the average — and skip or warm the rest.

That's the multiplicative gain. Personalization × qualification, not just personalization.

Want this on your prospect list? pierre@parisai.click.