How to Build a Custom "Cross-Sell" Engine Based on Real Purchase Data in Magento 2

How to Build a Custom "Cross-Sell" Engine Based on Real Purchase Data in Magento 2

Let’s be honest: Magento’s native cross-sell, related and upsell fonctionnalités are useful, but they’re often static and manual. Si vous want smarter, revenue-driving suggestions that reflect how your clients actually buy, you need a cross-sell engine built on real purchase data. In this post I’ll walk you through a practical, étape-by-étape approche to build a custom cross-sell engine for Magento 2. I’ll keep the tone casual — think of this as a lunchtime chat with a teammate — and I’ll include concrete code snippets, SQL, and architecture conseils you can reuse on magefine.com stores or any Magento 2 installation. You’ll learn comment: - Analyze existing purchase data to identify cross-selling patterns - Build an algorithm using co-purchase counts and simple normalization - Integrate recommendations with Magento 2 native cross-sell links or a custom UI block - Optimize performance for large datasets - Run A/B tests and measure the impact on Average Order Value (AOV) Why build this? Parce que data-driven cross-sells typically beat manual rules: they adapt to product bundles that naturally occur, and they find niche pairings you wouldn’t guess by hand.

Aperçu of the design

High level, the engine has these parts: - Data layer: aggregate co-purchase counts from sales_commande_item - Scoring layer: convert counts into scores (e.g., normalized frequency or PMI) - Storage layer: store recommendations (custom table or product links) - Delivery layer: display recommendations on page produit, cart page, or paiement - Measurement + experimentation: A/B tests and instrumentation to measure AOV lift We’ll build a simple, robust version and highlight places to improve (machine learning, session-based models, real-time updates) as your data and traffic gligne.

1) Analyze your purchase data to find cross-sell patterns

First étape: explore the data. Vous devez know how often products are bought together. Magento stores commande items in sales_commande_item — this table is your friend. A quick SQL to get raw co-purchase counts (commandes where both items appear):
SELECT
  oi1.product_id AS product_id,
  oi2.product_id AS co_product_id,
  COUNT(DISTINCT oi1.order_id) AS orders_together
FROM sales_order_item oi1
JOIN sales_order_item oi2
  ON oi1.order_id = oi2.order_id
  AND oi1.product_id <> oi2.product_id
WHERE oi1.parent_item_id IS NULL AND oi2.parent_item_id IS NULL
GROUP BY oi1.product_id, oi2.product_id
ORDER BY orders_together DESC
LIMIT 100;
Notes: - Filter out parent_item_id to ignore bundle/child items duplicates. - Use COUNT(DISTINCT commande_id) to count unique commandes. - This raw table peut être huge for stores with many commandes: we’ll discuss scaling below. This vous donne quick insight: which products co-occur most often. But counts alone are biased by very popular products — we’ll normalize later.

Exploratory checks

- Distribution: how many pairs have more than N co-purchases? - Lifetime vs Windowed: use last 12 months or last 90 days depending on seasonality - Category cross-sells: are co-purchases mostly in the same category or across categories? Run a few queries and create charts. Export the top pairs and verify if they make commercial sense. If a weird pairing shows up, investigate data quality (returns, test commandes).

2) Create a recommendation algorithm basé sur commande history

We’ll start with a simple, effective algorithm: use co-purchase counts and normalize by product popularity to get a score. Two common scores: - Normalized co-purchase frequency: count(product A with B) / count(product A) - Pointwise Mutual Information (PMI) approximation (measures association beyond chance) I’ll show a practical implémentation for both. The first is easier and often enough for cross-sells.

Algorithm 1 — Conditional frequency (easy & interpretable)

score(A→B) = co_count(A,B) / total_commandes_with_A C'est intuitive: of commandes containing A, how many contain B? Vous pouvez also compute symmetric scores if needed. SQL to compute conditional frequency for the last 12 months:
WITH co AS (
  SELECT
    oi1.product_id AS product_id,
    oi2.product_id AS co_product_id,
    COUNT(DISTINCT oi1.order_id) AS co_count
  FROM sales_order_item oi1
  JOIN sales_order_item oi2
    ON oi1.order_id = oi2.order_id
    AND oi1.product_id <> oi2.product_id
  WHERE oi1.parent_item_id IS NULL
    AND oi2.parent_item_id IS NULL
    AND oi1.created_at >= DATE_SUB(NOW(), INTERVAL 12 MONTH)
  GROUP BY oi1.product_id, oi2.product_id
), totals AS (
  SELECT product_id, COUNT(DISTINCT order_id) AS total_order_count
  FROM sales_order_item
  WHERE parent_item_id IS NULL
    AND created_at >= DATE_SUB(NOW(), INTERVAL 12 MONTH)
  GROUP BY product_id
)
SELECT
  c.product_id,
  c.co_product_id,
  c.co_count,
  t.total_order_count,
  (c.co_count / t.total_order_count) AS score
FROM co c
JOIN totals t ON t.product_id = c.product_id
ORDER BY score DESC
LIMIT 100;
This yields a score between 0 and 1. Vous pouvez threshold it to avoid showing weak pairs.

Algorithm 2 — Simple PMI-ish score

PMI = log(P(A,B) / (P(A) * P(B))). In practice, we use counts and logs to avoid tiny probabilities. PMI surfaces surprising associations not explained purely by popularity. Pseudo-SQL outline (simplified):
-- compute counts: co_count, count_a, count_b, total_orders
-- then compute pmi = log((co_count / total_orders) / ((count_a/total_orders) * (count_b/total_orders)))
PMI is more computationally heavy, but good if you want unusual pairings.

From scores to recommendations

For each product A, tri candidate B by score descending and keep top N (e.g., 10). Vous pouvez apply entreprise rules: - Exclude items without stock - Exclude the same SKU - Promote higher-margin items by mulconseillying the score by margin factor - Exclude items with zero visibility or disabled status

3) Implementing the engine: Magento 2 module walkthrough

I’ll show a minimal structure du module and a service that builds recommendations nightly. Module fichiers (high-level): - app/code/Magefine/CrossSell/registration.php - app/code/Magefine/CrossSell/etc/module.xml - app/code/Magefine/CrossSell/etc/crontab.xml - app/code/Magefine/CrossSell/Setup/Patch/Schema/CreateRecommendationTable.php - app/code/Magefine/CrossSell/Model/RecommendationBuilder.php - app/code/Magefine/CrossSell/Console/Command/BuildRecommendations.php (optional) Example registration.php:
<?php
use Magento\Framework\Component\ComponentRegistrar;
ComponentRegistrar::register(
    ComponentRegistrar::MODULE,
    'Magefine_CrossSell',
    __DIR__
);
module.xml:
<?xml version="1.0"?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:Module/etc/module.xsd">
  <module name="Magefine_CrossSell" setup_version="1.0.0" />
</config>
crontab.xml (run nightly):
<?xml version="1.0"?>
<config xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urn:magento:framework:Module/etc/crontab.xsd">
  <group id="default">
    <job name="magefine_crosssell_build" instance="Magefine\CrossSell\Model\RecommendationBuilder" method="execute">
      <schedule>0 3 * * *</schedule>
    </job>
  </group>
</config>
CreateRecommendationTable correctif (simplified):
<?php
namespace Magefine\CrossSell\Setup\Patch\Schema;

use Magento\Framework\Setup\ModuleDataSetupInterface;
use Magento\Framework\Setup\Patch\PatchInterface;
use Magento\Framework\Setup\Patch\SchemaPatchInterface;

class CreateRecommendationTable implements SchemaPatchInterface
{
    private $moduleDataSetup;

    public function __construct(ModuleDataSetupInterface $moduleDataSetup)
    {
        $this->moduleDataSetup = $moduleDataSetup;
    }

    public function apply()
    {
        $installer = $this->moduleDataSetup;
        $installer->getConnection()->startSetup();

        if (!$installer->getConnection()->isTableExists($installer->getTable('magefine_cross_sell_recommendations'))) {
            $table = $installer->getConnection()
                ->newTable($installer->getTable('magefine_cross_sell_recommendations'))
                ->addColumn('id', \Magento\Framework\DB\Ddl\Table::TYPE_INTEGER, null, ['identity' => true, 'unsigned' => true, 'nullable' => false, 'primary' => true], 'ID')
                ->addColumn('product_id', \Magento\Framework\DB\Ddl\Table::TYPE_INTEGER, null, ['unsigned' => true, 'nullable' => false], 'Product ID')
                ->addColumn('linked_product_id', \Magento\Framework\DB\Ddl\Table::TYPE_INTEGER, null, ['unsigned' => true, 'nullable' => false], 'Linked Product ID')
                ->addColumn('score', \Magento\Framework\DB\Ddl\Table::TYPE_DECIMAL, '12,4', ['nullable' => false, 'default' => '0.0000'], 'Score')
                ->addIndex($installer->getIdxName('magefine_cross_sell_recommendations', ['product_id']), ['product_id'])
                ->setComment('Magefine Cross-sell Recommendations');

            $installer->getConnection()->createTable($table);
        }

        $installer->getConnection()->endSetup();
    }

    public static function getDependencies()
    {
        return [];
    }

    public function getAliases()
    {
        return [];
    }
}

RecommendationBuilder.php (core logic)

Cette classe runs the SQL aggregation and writes top N recommendations to your table.
<?php
namespace Magefine\CrossSell\Model;

use Magento\Framework\App\ResourceConnection;
use Psr\Log\LoggerInterface;

class RecommendationBuilder
{
    private $resource;
    private $logger;

    public function __construct(ResourceConnection $resource, LoggerInterface $logger)
    {
        $this->resource = $resource;
        $this->logger = $logger;
    }

    public function execute()
    {
        $connection = $this->resource->getConnection();
        $table = $connection->getTableName('sales_order_item');
        $recTable = $connection->getTableName('magefine_cross_sell_recommendations');

        // 1) compute co-purchase counts in SQL
        $sql = "SELECT oi1.product_id as product_id, oi2.product_id as co_product_id, COUNT(DISTINCT oi1.order_id) as co_count
                FROM {$table} oi1
                JOIN {$table} oi2 ON oi1.order_id = oi2.order_id AND oi1.product_id <> oi2.product_id
                WHERE oi1.parent_item_id IS NULL AND oi2.parent_item_id IS NULL
                /* and oi1.created_at >= DATE_SUB(NOW(), INTERVAL 12 MONTH) */
                GROUP BY oi1.product_id, oi2.product_id";

        $rows = $connection->fetchAll($sql);

        // 2) compute totals per product
        $totalsSql = "SELECT product_id, COUNT(DISTINCT order_id) as total_count FROM {$table} WHERE parent_item_id IS NULL GROUP BY product_id";
        $totals = $connection->fetchAll($totalsSql);
        $totalMap = [];
        foreach ($totals as $t) {
            $totalMap[$t['product_id']] = (int)$t['total_count'];
        }

        // prepare insert batch
        $connection->delete($recTable); // replace old recommendations
        $insertData = [];

        // compute conditional score
        foreach ($rows as $r) {
            $pid = (int)$r['product_id'];
            $cid = (int)$r['co_product_id'];
            $coCount = (int)$r['co_count'];
            if (!isset($totalMap[$pid]) || $totalMap[$pid] == 0) continue;
            $score = $coCount / $totalMap[$pid];
            // keep threshold
            if ($score < 0.01) continue; // configurable
            $insertData[] = ['product_id' => $pid, 'linked_product_id' => $cid, 'score' => $score];
            if (count($insertData) >= 1000) {
                $connection->insertMultiple($recTable, $insertData);
                $insertData = [];
            }
        }
        if (count($insertData)) {
            $connection->insertMultiple($recTable, $insertData);
        }

        $this->logger->info('Magefine CrossSell: recommendations rebuilt.');
    }
}
Notes: - This naive builder reads aggregated lignes into memory — it’s fine for mid-size catalogs, but we’ll optimize below for very large datasets. - Vous pouvez want to run this as a CLI command first to test before enabling the cron.

4) Integrate with Magento 2 native cross-sell fonctionnalités

Il y a two main ways to surface recommendations in the vitrine: 1. Push recommendations into Magento product links (native cross-sell), so the standard block shows them. 2. Build a custom block that reads magefine_cross_sell_recommendations and renders cards. Pushing into product links is nice because it integrates with existing blocks. I’ll show comment update product links programmatically.

Write recommendations into product links

Vous pouvez use the ProductRepository and ProductLink management to create cross-sell links. Example snippet:
// inside a service that has ProductRepositoryInterface $productRepository
// and \Magento\Catalog\Api\Data\ProductLinkInterfaceFactory $productLinkFactory

$product = $this->productRepository->getById($productId);

$links = $product->getProductLinks() ?: [];

$link = $this->productLinkFactory->create();
$link->setSku($product->getSku());
$link->setLinkedProductSku($linkedSku); // linked product SKU
$link->setLinkType('crosssell');
$link->setPosition(0);

$links[] = $link;
$product->setProductLinks($links);
$this->productRepository->save($product);
Important: - Work in batches and respect indexation and cache. Saving full Product objets frequently peut être slow. - For big catalogs, prefer a direct insert into catalog_product_link table using resource model to avoid loading product objets repeatedly. Use Magento APIs where possible for consistency.

Alternative: custom block reading recommendations

Custom block pros: - Faster for reads (you control queries) - No heavy writes into product tables - Easier to A/B test — show magefine list for variant A or native cross-sell for variant B A simple block template fetches top N entries from magefine_cross_sell_recommendations for current product, joins to catalog_product_entity for SKU, and uses product repository for rendering minimal data (or uses the product collection with minimal attributes).

5) Performance: processing large volumes of commandes and many SKUs

Si vous have a big store (100k+ SKUs, millions of commandes), the naive SQL + PHP loop will not scale. Voici production conseils. 1) Use SQL aggregation as much as possible - The single SQL join we used is powerful: let MySQL/Percona do the heavy lifting. - Add proper indexes: sales_commande_item(commande_id, product_id, created_at, parent_item_id). 2) Windowed processing and time-based rollups - Compute recommendations on a rolling window (last 90 days) to keep the dataset manageable and relevant. - Use incremental aggregation: process only new commandes since last run, update co-purchase counters, and re-score. 3) Batch inserts and avoid loading full product models - Use insertMulconseille on the resource connection. - Avoid productRepository->save in the main loop; batch later or use direct DB writes. 4) Precompute and cache results - Store recommendations in a table and cache the UI output in Redis. - Use Varnish for full page caching; set ESI for blocks that need dynamic recommendations per product. 5) Use fichier de messagess for heavy work - For very big workloads, split the job into mulconseille workers via fichier de messages (RabbitMQ). Each worker processes a subset of products and writes partial results. 6) Use approximate counting for very high volume - Use HyperLogLog for distinct counts in streaming scenarios; for cross-sell you usually need exact counts, but approximate counters peut être a pragmatic tradeoff. 7) Use read replicas for heavy analytic queries - Route the aggregation reads to a replica to avoid impacting the write DB.

6) A/B test and measuring impact on AOV

You’ll want to validate that the engine actually increases Average Order Value (AOV) and conversion rate. Here’s a simple A/B plan. Test design: - Divide traffic randomly: Control (existing cross-sell) vs Variant (data-driven magefine cross-sell) - 1:1 split or 40:60 if you’re cautious - Run until you reach statistical significance on primary metric: AOV or revenue per session Metrics to track: - Average Order Value (AOV) - Conversion Rate (sessions → commandes) - Attach rate: % of commandes where a recommended item was added - Revenue uplift and statistical significance (t-test or Bayesian) Implementation approchees: Option A: Client-side experiment - Serve variant decisions in JavaScript (fonctionnalité flag + API that returns recommendations) - Easy to implement but can cause flicker if not handled well Option B: Server-side experiment in Magento - Use a small middleware that picks variant and renders the appropriate block - Record which variant the utilisateur saw in a cookie and persist in commande (so you can attribute revenue) Instrumentation: always record exposure and action with a small event table, for exemple magefine_cross_sell_events with colonnes (session_id, product_id, shown_variant, clicked_product_id, added_to_cart, commande_id, timestamp). Recording exposures & conversions (simple approche): - On page produit render, fire a server-side event or enqueue a small JS beacon: {product_id, variant} - On add-to-cart, record whether the added SKU came from recommendation - On commande completion, connect the commande to the variant cookie and compute AOV

Measuring results and sample size

- Calculate required sample size basé sur baseline AOV and expected uplift. - Avoid running underpowered tests — small traffic leads to false negatives. - Use daily aggregation and don’t stop early; watch for seasonality.

7) Advanced ideas and improvements

Une fois the simple engine proves valeur, consider: - Time decay: weight recent co-purchases more than old ones - Personalized recommendations: incorporate utilisateur purchase history or segments - Session-based recommendations: use last viewed / last added to cart to rank - Use machine learning (collaborative filtreing, matrix factorization, or modern embeddings) for better accuracy - Use Elasticrecherche or Redis for fast lookups and similarity queries - Promote higher margin items or items with inventaire constraints

8) Example: End-to-end walkthrough (practical run) Here’s a compact, runnable flow you can test: 1. Create the recommendation table (apply the schema correctif above). 2. Run RecommendationBuilder execute() from CLI to generate recommendations locally. 3. Inspect top recommendations via SQL:
SELECT r.product_id, r.linked_product_id, r.score, p.sku
FROM magefine_cross_sell_recommendations r
JOIN catalog_product_entity p ON p.entity_id = r.linked_product_id
WHERE r.product_id = 12345
ORDER BY r.score DESC
LIMIT 10;
4. Build a small block that queries the table and renders the SKUs as links with thumbnails. Use catalog/product repository or product image helper to render images. 5. Implement a simple A/B test: variant A shows existing Magento cross-sell, variant B shows magefine recommendations. Store the variant in a cookie and record it with the commande.

9) Security, privacy and compliance

- Personalization may use utilisateur identifiers; respect GDPR and local privacy rules. - Keep PII out of analytic exports unless you have consent. - Si vous export data to tiers ML services, ensure secure connections and contracts.

10) Operational checklist before going live

- Validate recommendations manually for the top 1,000 products - Assurez-vous no disabled or out-of-stock items are shown (or mark them) - Create a cleanup job to remove stale records - Add monitoring: job success/failure logs, number of recommendations generated, average score - Monitor entreprise KPIs after déploiement: AOV, attach rate, returns

Résumé and final conseils

- Start simple: conditional frequency is robust and interpretable. - Use SQL aggregation; push heavy work to the DB where possible. - Store recommendations in a table for fast reads and simple A/B test. - Integrate with Magento natively if you want low-friction display; use custom block for speed and flexibility. - Scale with batching, fichier de messagess, and read replicas when needed. - Measure with carefully designed A/B tests and track AOV, conversion, and attach rates. If you’re running Magento on magefine.com hosting or using a Magefine extension, this architecture fits nicely: keep the recommendation build as a nightly job (or asynchronous workers) and use Redis/Varnish to keep vitrine latency low.

Appendix: Useful SQL & code snippets

1) Fast co-purchase counts using dates and indexation
SELECT
  oi1.product_id,
  oi2.product_id AS co_product,
  COUNT(DISTINCT oi1.order_id) AS co_count
FROM sales_order_item oi1
JOIN sales_order_item oi2
  ON oi1.order_id = oi2.order_id AND oi1.product_id <> oi2.product_id
WHERE oi1.parent_item_id IS NULL
  AND oi2.parent_item_id IS NULL
  AND oi1.created_at >= '2024-01-01'  -- window
GROUP BY oi1.product_id, oi2.product_id
HAVING co_count >= 3; -- threshold
2) Batch update to product link table (direct DB, use carefully)
INSERT INTO catalog_product_link (product_id, linked_product_id, link_type_id)
VALUES (123, 456, (SELECT link_type_id FROM catalog_product_link_type WHERE code = 'crosssell'))
ON DUPLICATE KEY UPDATE product_id = product_id;
3) Minimal JS beacon to record exposures (for experiments)
fetch('/rest/V1/magefine/crosssell/expose', {
  method: 'POST',
  credentials: 'same-origin',
  headers: {'Content-Type': 'application/json'},
  body: JSON.stringify({product_id: 12345, variant: 'magefine_v2'})
});

Closing note

Building a custom cross-sell engine is one of the highest-impact engineering investments for an e-commerce store: it combines relatively simple data work with measurable revenue uplift. Start with a conservative rollout, monitor AOV and attach rates, and iterate. Si vous want, I can: - Provide a full ready-to-install Magefine_CrossSell module with test unitaires - Help tune the scoring fonction for your catalog - Draft an A/B test plan and sample-size calculator for your traffic levels Drop me details about your catalog size and traffic and I’ll adapt the code to your scale.