Overview
At GoGlocal I owned the full data product - not a component of someone else’s system, the whole thing - and the people who used its output every day were the merchandising team.
The problem was cross-border e-commerce at scale: 1,000+ SKUs selling across Amazon, eBay, Walmart, and Lazada, each with different listing requirements, pricing dynamics, and buyer behavior. The merchandising team was handling all of it by hand. Every hour spent researching a competitor’s price or classifying a product was an hour not spent on strategy. Margins in cross-border e-commerce are thin; operational inefficiency is a competitive disadvantage that compounds.
As Manager of Data Science, I directed the strategy and built the core machine learning systems that fed the team the two things they spent the most time producing manually: where a product belongs and what it should sell for. 50% reduction in manual effort. 30% improvement in revenue-estimation efficiency.
The Problem
Cross-border e-commerce at the SKU level is repetitive and data-intensive. Product classification, keyword optimization, pricing research, competitor checks, listing creation - done by hand, across four marketplaces, for over a thousand products. At that volume it was operationally untenable, and it scaled linearly with headcount instead of with the catalog.
The compounding challenge: different marketplaces have different taxonomies, listing formats, and pricing signals. A price or classification that worked on Amazon did not transfer to Lazada. Every output the team consumed had to be marketplace-aware, or it was wrong on three marketplaces out of four.
Data & Inputs
- Marketplace data: Amazon, eBay, Walmart, Lazada product listings - scraped at scale across multiple regions
- Competitor pricing data - scraped and normalized across different currency, tax, and fee structures
- Historical sales data: SKU-level sell-through rates, velocity, seasonality
- Product attribute data: descriptions, images, weights, dimensions, categories
- Demand signals: search volume, bestseller rank, review velocity
Data scraping was itself a significant engineering challenge. Scraping at scale across regions with anti-bot measures required a robust, distributed pipeline - that was the infrastructure that had to exist before any Machine Learning or any usable output reached the team.
Approach
Three layers. Each one feeds the next, and the team consumed the output of all three.
Layer 1: SKU Intelligence (classification & attribute extraction) NLP-driven product classification using semantic similarity and attribute extraction. Given a product with its description and attributes, the system automatically placed it in the correct marketplace taxonomy, extracted the key searchable attributes, and generated optimized listing copy using fine-tuned OpenAI models. This is what replaced the team manually deciding which category a SKU belonged to on each marketplace.
Image optimization using Stable Diffusion for background replacement and product visualization - a real commercial requirement for marketplace listing quality that most teams handle by hand.
Layer 2: Pricing Intelligence & Competitor Analysis Scraped competitor pricing was normalized and used to train price-response models at the category level. The system recommended an optimal price point per SKU per marketplace, accounting for marketplace fees, currency, and competitive position. Combined with sales forecasting and inventory levels, it produced an expected profit margin per SKU per marketplace, so the team could prioritize the most valuable listing opportunities rather than guess. This pricing output is what the merchandising team used to set prices across marketplaces.
Layer 3: Demand Forecasting & Inventory Planning SKU-level demand forecasting using XGBoost and LightGBM with marketplace-specific features. Trend tracking to catch velocity changes early - before stockouts or overstock developed - and replenishment recommendations based on forecast plus lead time plus safety stock.
Engineering & Implementation
- Scraping infrastructure: Distributed scraping using Ray on AWS - horizontal scaling across multiple scrapers, with rotation logic and rate limiting
- NLP pipeline: Semantic similarity using sentence transformers for product matching across marketplace taxonomies
- LLM integration: Fine-tuned OpenAI models for listing copy generation - prompts engineered for marketplace-specific requirements
- Image pipeline: Stable Diffusion for product image background replacement and enhancement
- Serving: FastAPI endpoints for the merchant-facing analytics product - real-time SKU recommendations and forecast updates that the merchandising team queried directly
- Data store: PostgreSQL for normalized product and pricing data; Redis for hot pricing data
Multi-tenant architecture to support different merchant accounts with isolated data and compute.
What Changed for the Team
The merchandising team stopped doing per-SKU pricing research and manual category mapping and started working from the system’s output:
- Pricing: prices across Amazon, eBay, Walmart, and Lazada were set from the model’s recommended, fee- and currency-adjusted price points instead of one-off manual research per SKU - turning pricing from pure manual judgment into a data-driven decision.
- Listing: products were classified, attribute-tagged, and given listing copy automatically, cutting listing-creation time from hours to minutes per product.
- Prioritization: expected margin per SKU per marketplace let the team work the most valuable listings first rather than processing the catalog blindly.
The net effect, measured across 1,000+ SKUs on 4 marketplaces: 50% reduction in manual effort and 30% improvement in revenue-estimation efficiency - faster, more accurate per-SKU forecasts.
Limitations & What I’d Do Differently
The marketplace taxonomy mapping was hand-crafted for the initial set of categories. A proper learned taxonomy alignment would scale better as new product categories are added - without requiring manual extension every time the catalog expands.
The demand forecasting models were trained per-marketplace in isolation. A hierarchical model that shares information across marketplaces would improve accuracy for low-velocity SKUs where per-marketplace data is sparse.
Stack
Python, Ray, AWS (EC2, S3, Lambda), OpenAI API (fine-tuned), Stable Diffusion, sentence-transformers, XGBoost, LightGBM, FastAPI, PostgreSQL, Redis