Overview
At GoGlocal, I owned the full data product — from raw marketplace data scraping to ML-driven SKU automation at scale. The problem was cross-border e-commerce at scale: 1,000+ SKUs selling across Amazon, eBay, Walmart, and Lazada, each with different listing requirements, pricing dynamics, and buyer behavior patterns. Manual management was the bottleneck.
As Manager of Data Science, I directed the strategy and built the core machine learning systems that automated the most labor-intensive workflows — cutting manual effort by 50% and improving revenue estimation efficiency by 30%.
The Problem
Cross-border e-commerce at the SKU level is repetitive and data-intensive work: product categorization, keyword optimization, pricing research, inventory planning, and listing creation — all manually, across multiple marketplaces, for hundreds of products. At 1,000+ SKUs, this was operationally untenable.
The additional challenge was the diversity of the data: different marketplaces have different taxonomies, different listing formats, different pricing signals. A solution that worked on Amazon didn’t automatically work on Lazada.
Why It Mattered
Every hour spent manually listing a product was an hour not spent on pricing strategy or inventory positioning. The margin in cross-border e-commerce is thin — operational efficiency is a competitive advantage. Getting forecasts right at the SKU level meant the difference between optimal inventory positions and costly overstock or stockout situations.
Data & Inputs
- Marketplace data: Amazon, eBay, Walmart, Lazada product listings — scraped at scale across multiple regions
- Competitor pricing data — scraped and normalized across different currency, tax, and fee structures
- Historical sales data: SKU-level sell-through rates, velocity, seasonality
- Product attribute data: descriptions, images, weights, dimensions, categories
- Demand signals: search volume, bestseller rank, review velocity
Data scraping was itself a significant engineering challenge — scraping at scale across regions with anti-bot measures required a robust, distributed pipeline.
Approach
The system was built in three layers:
Layer 1: SKU Intelligence
NLP-driven product classification using semantic similarity and attribute extraction. Given a product with its description and attributes, the system automatically placed it in the correct marketplace taxonomy, extracted key searchable attributes, and generated optimized listing copy using fine-tuned OpenAI models.
Image optimization using Stable Diffusion for background replacement and product visualization — a significant commercial requirement for marketplace listing quality.
Layer 2: Pricing Intelligence
Scraped competitor pricing data was normalized and used to train price-response models at the category level. The system recommended optimal price points per marketplace, accounting for marketplace fees, currency, and competitive position.
Real-time sales forecasting (demand at current price) combined with inventory levels generated expected profit margin per SKU per marketplace — enabling prioritization of the most valuable listing opportunities.
Layer 3: Demand Forecasting & Inventory Planning
SKU-level demand forecasting using XGBoost and LightGBM with marketplace-specific features. Trend tracking to identify velocity changes early. Replenishment recommendations based on forecast + lead time + safety stock.
Engineering & Implementation
- Scraping infrastructure: Distributed scraping using Ray on AWS — horizontal scaling across multiple scrapers, with rotation logic and rate limiting
- NLP pipeline: Semantic similarity using sentence transformers for product matching across marketplace taxonomies
- LLM integration: Fine-tuned OpenAI models for listing copy generation — prompts engineered for marketplace-specific requirements
- Image pipeline: Stable Diffusion for product image background replacement and enhancement
- Serving: FastAPI endpoints for the merchant-facing analytics product — real-time SKU recommendations and forecast updates
- Data store: PostgreSQL for normalized product and pricing data; Redis for hot pricing data
Multi-tenant architecture to support different merchant accounts with isolated data and compute.
Results & Impact
- 50% reduction in manual effort through workflow automation
- 30% improvement in revenue estimation efficiency — faster, more accurate forecasts per SKU
- 1,000+ SKUs automated across 4 marketplaces in multiple regions
- NLP pipeline reducing listing creation time from hours to minutes per product
- Pricing intelligence system enabling data-driven pricing decisions where previously pure manual judgment was used
Limitations & What I’d Do Differently
The marketplace taxonomy mapping was hand-crafted for the initial set of categories — a proper learned taxonomy alignment would scale better as new product categories are added.
The demand forecasting models were trained per-marketplace in isolation. A hierarchical model that shares information across marketplaces would improve accuracy for low-velocity SKUs where per-marketplace data is sparse.
Stack
Python, Ray, AWS (EC2, S3, Lambda), OpenAI API (fine-tuned), Stable Diffusion, sentence-transformers, XGBoost, LightGBM, FastAPI, PostgreSQL, Redis