Using Data Engineering to Predict Which Customers Will Buy

To predict which customers are most likely to convert, companies must move away from static lead scoring and toward dynamic propensity modeling. A propensity model is a statistical scorecard that assigns a probability score, typically between 0 and 1, to every user in your database based on their historical behavior, demographic data, and product interactions.

In our work with mid-market SaaS companies, we often see teams relying on arbitrary point systems in their CRM. For example, a lead might get 10 points for downloading a whitepaper and 5 points for visiting the pricing page. The problem with this manual approach is that it assumes every action has a fixed, linear value. In reality, a customer who visits the pricing page three times in 24 hours is significantly more likely to buy than someone who visited once three months ago. To truly predict which customers will buy, you need a system that learns from these non-linear patterns.

By using an automated machine learning pipeline, your data team can identify the specific "signals" that actually lead to revenue. This allows your sales and marketing teams to stop chasing every lead and instead focus their resources on the top 10% of users who have a high probability of conversion.

Component Traditional Lead Scoring AI Propensity Modeling
Logic Manual, rule-based points Statistical probability scores
Data Scope Limited to CRM fields Cross-functional (Product, Web, CRM)
Adaptability Hardcoded; requires manual updates Self-learning as new data arrives
Accuracy High false-positive rate High precision and recall
Outcome "Warm" vs "Cold" labels Percentage likelihood of purchase

Building the Data Foundation to Predict Customer Purchase Likelihood

Before we can train a model, we need a clean, centralized dataset. Most organizations fail at predictive analytics because their data is trapped in silos across HubSpot, Stripe, Google Analytics, and their internal application database. We call this the "context gap." If your model cannot see that a user who just opened a marketing email also just hit a specific usage limit in your product, the prediction will be flawed.

The first step in our process involves using an ELT (Extract, Load, Transform) architecture. We typically use tools like Fivetran or Airbyte to land raw data into BigQuery or Snowflake. Once the data is in the warehouse, we use dbt to create a "Unified Customer Table." This table serves as the single source of truth for the model.

If you are unsure if your current infrastructure can handle this level of complexity, our AI Stack Audit provides a scored assessment of your data foundation. We look at your existing pipelines and identify the gaps that would prevent a propensity model from being accurate or reliable.

To prepare the data, we focus on three main categories of features:

  1. Firmographics and Demographics: Company size, industry, job title, and geographic location.
  2. Behavioral Activity: Number of logins, specific features used, time spent in the app, and support tickets submitted.
  3. Marketing Engagement: Email open rates, webinar attendance, and ad clicks.

Engineering Features for an AI Customer Propensity Model

Feature engineering is the process of transforming raw data into meaningful inputs for a machine learning model. This is where the most value is created in any predictive project. For example, instead of just using "total logins," we might create a feature for "logins in the last 7 days vs the average of the last 30 days." This captures the momentum of the user.

In our experience, the most predictive features for SaaS companies are often related to "Product Qualified Lead" (PQL) metrics. These are actions that correlate strongly with a user realizing the value of your product.

Here is a simplified SQL block showing how we might calculate these features in a dbt model:

sql
-- models/marts/predictive/fct_user_propensity_features.sql

WITH user_activity AS (
    SELECT
        user_id,
        COUNT(CASE WHEN event_type = 'session_start' AND event_timestamp >= CURRENT_DATE - 7 THEN 1 END) AS sessions_last_7d,
        COUNT(CASE WHEN event_type = 'feature_alpha_used' THEN 1 END) AS total_feature_alpha_usage,
        MAX(event_timestamp) AS last_active_at
    FROM {{ ref('stg_events') }}
    GROUP BY 1
),

crm_data AS (
    SELECT
        email,
        company_size,
        industry,
        is_trial_user
    FROM {{ ref('stg_hubspot_contacts') }}
)

SELECT
    u.user_id,
    c.company_size,
    c.industry,
    u.sessions_last_7d,
    u.total_feature_alpha_usage,
    DATE_DIFF(CURRENT_DATE, CAST(u.last_active_at AS DATE), DAY) AS days_since_last_session,
    -- Target variable for training: did they upgrade in the next 30 days?
    COALESCE(t.has_converted, FALSE) AS target_converted
FROM user_activity u
JOIN crm_data c ON u.user_id = c.user_id
LEFT JOIN {{ ref('dim_conversions') }} t ON u.user_id = t.user_id

By structuring the data this way, we provide the model with a clear "snapshot" of what a customer looked like right before they decided to buy. This allows the algorithm to find the common patterns among successful conversions.

Selecting the Right Model to Forecast Which Leads Will Convert

Once the features are ready, we must choose a model. For most mid-market companies, a complex neural network is overkill. We usually recommend starting with a Gradient Boosted Tree model like XGBoost or LightGBM. These models are excellent at handling tabular data and can deal with missing values effectively.

The goal is to produce a "Propensity Score." This score allows you to segment your database into tiers:

  • High Propensity (Score > 0.8): Send these directly to an Account Executive for immediate outreach.
  • Medium Propensity (Score 0.4 - 0.8): Enroll these in a high-touch automated email sequence or offer a 1-on-1 demo.
  • Low Propensity (Score < 0.4): Keep in a standard nurture track or low-cost remarketing.

When evaluating the model, we do not just look at "accuracy." Accuracy can be misleading if only 5% of your leads actually convert. Instead, we focus on Precision and Recall. Precision tells us what percentage of our "High Propensity" predictions were actually correct. Recall tells us what percentage of all actual buyers we successfully identified.

Ready to fix your data foundation?

Book a free diagnostic call and find out where your stack stands.

Book a Call

Deploying Predictions into Production Pipelines

A model is useless if the results stay in a notebook or a spreadsheet. To make the predictions actionable, we push the scores back into the tools your team uses every day. This process is often called "Reverse ETL."

We automate the pipeline so that every morning, the model runs against the latest data in the warehouse. The updated scores are then synced to HubSpot or Salesforce. This ensures that when a sales representative logs in, they see a "Propensity Score" field on every contact record, sorted from highest to lowest.

We teach these production workflows in our Learn AI Bootcamp, where we show data teams how to move beyond static dashboards and into building operational AI systems. We cover everything from model registration to monitoring for "feature drift," which happens when customer behavior changes over time and the model becomes less accurate.

Monitoring is critical. If your product releases a major new feature, the old "signals" for buying might change. Our team sets up automated UAT (User Acceptance Testing) and data quality monitors to alert us if the distribution of propensity scores shifts significantly. This ensures the sales team never loses trust in the data.

Measuring the Return on Investment of Predictive Analytics

The ultimate metric for success is the impact on your CAC (Customer Acquisition Cost) and your sales cycle length. When we implement these systems for clients, we typically see a significant improvement in Sales Development Representative (SDR) efficiency. Instead of making 100 random calls, they make 20 targeted calls to leads with a 0.85 propensity score.

Furthermore, these models help identify "hidden gems" in your database. These are users who might not have downloaded a whitepaper or filled out a form recently, but their product usage patterns indicate they are reaching a "breaking point" where they need the paid version of your software.

By closing the loop between data engineering and sales operations, you create a revenue engine that is driven by evidence rather than intuition. This is the difference between a team that "hopes" to hit their numbers and a team that has a predictable roadmap for growth.

Frequently Asked Questions About Customer Propensity

How much data do I need to predict customer purchase likelihood?

While more data is generally better, you do not need millions of rows to start. For a B2B SaaS company, we typically look for at least 500 to 1,000 historical conversion events to train a reliable model. If you have fewer than that, the model may struggle to find statistically significant patterns. In those cases, we often start by predicting "micro-conversions," such as a user inviting a teammate or setting up an integration, which serve as leading indicators for an eventual purchase.

Which machine learning model is best for predicting which customers will buy?

For most business use cases involving structured data from a CRM or SQL database, we recommend XGBoost or Random Forest. These models are robust, handle non-linear relationships well, and provide "feature importance" scores, which tell you exactly which behaviors are driving the predictions. While LLMs are useful for analyzing text data like support tickets or call transcripts, they are usually not the primary tool for numerical propensity scoring.

How often should a customer propensity model be retrained?

In a fast-growing company, we recommend retraining your model at least once a month. Customer behavior can change rapidly due to new product launches, seasonal trends, or shifts in the competitive landscape. If you notice that your model's precision is dropping, it is a sign of "model decay," and it is time to refresh the training dataset with the most recent three to six months of customer activity.

Can we build this if our CRM data is currently messy?

Yes, but the first phase of the project must focus on data cleaning. We often use dbt to standardize messy CRM fields (like inconsistent industry names or job titles) before passing them to the model. Predictive modeling actually serves as a great forcing function to finally fix your data quality issues, because the ROI of clean data becomes immediately apparent once you see the accuracy of the predictions.

How do we get the predictions back into the hands of the sales team?

We use a process called Reverse ETL. Once the propensity scores are calculated in your data warehouse, tools like Hightouch or Census can automatically sync those values back into specific fields in HubSpot or Salesforce. This ensures the sales team can see the scores directly on the lead or contact record without ever leaving their CRM.

Ready to build your predictive revenue engine?

If you are ready to move beyond static reporting and start using AI to drive your sales strategy, we can help. Our AI Stack Audit is the perfect starting point to identify the gaps in your data foundation and create a roadmap for predictive analytics. For teams that want to build these systems themselves, our Learn AI Bootcamp provides the hands-on training needed to deploy production-grade AI models.

Want to discuss your specific data architecture and goals? Book a free consultation with our team to see how we can help you build a system to predict which customers will buy.