Data Cleaning and Quality
Why Data Cleaning Is Critical
AI agents rely on structured, high-quality product data to make purchasing decisions. A product with a vague description, incorrect category, or missing attributes — even if exposed through all protocol endpoints — will be difficult for AI agents to understand and recommend correctly. ORBEXA includes a built-in AI data cleaning engine — the Refinery Pipeline — that automatically improves data quality before it reaches protocol endpoints.Unified Refinery Pipeline
The core implementation lives inUnifiedRefineryPipeline.ts and is a multi-stage data processing pipeline:
Stage 1: Field Mapping
The Problem
Different platforms use different field names: Shopify usesbody_html for product descriptions, WooCommerce uses description, and a CSV import might use “Product Desc” or “Item Description.”
The Solution
ORBEXA’s field mapping engine (dataQualityService.ts) automatically maps fields from various platforms to UCP/ACP standard fields:
- Platform preset mappings: Built-in mapping rules for major platforms like Shopify and WooCommerce
- Smart inference: Uses semantic analysis to infer the corresponding standard field for non-standard field names
- Manual overrides: Merchants can manually specify mapping relationships in the console
Stage 2: AI Description Optimization
AI is used to automatically improve product description quality:- Fill in missing information: If key details like material or dimensions are absent from the description but present in images or attributes, they are automatically added
- Standardize expressions: Unifies units of measurement, color terminology, and sizing formats
- Improve AI readability: Rewrites casual or ambiguous descriptions into clearly structured text
Stage 3: Visual Attribute Extraction
Attributes are automatically identified and extracted from product images:- Color recognition: Extracts dominant colors and color palettes from images
- Material identification: Identifies fabric, metal, wood, and other material types
- Style classification: Recognizes clothing styles, furniture aesthetics, and other visual characteristics
- Defect detection: Flags image quality issues (blurry, poorly cropped, etc.)
Stage 4: AIO Standardization
AIO (AI Optimization) standardization ensures product data fully meets the consumption requirements of AI agents:- Unified field formatting (dates, prices, weights, etc.)
- Category taxonomy alignment
- Multi-language field handling
- Data completeness validation
Hierarchical Flywheel Learning
The Refinery Pipeline’s mapping rules employ a three-tier flywheel learning mechanism:Merchant-Level Rules
Mapping rules specific to a particular merchant. For example, if a merchant’s CSV file always uses a “Model No.” column for what maps to the UCPsku field, this rule applies only to that merchant.
Category-Level Rules
Cross-merchant rules within a specific product category. For example, “Size” fields in apparel products typically map to thesize attribute — this rule applies to all apparel merchants.
Global Rules
Universal rules that apply to all merchants. For example, “Price,” “Retail Price,” and “Unit Price” all map to theprice field.
The Flywheel Effect
Human-in-the-Loop (HITL)
AI cleaning cannot be 100% accurate. ORBEXA provides an HITL review dashboard:- Low-confidence flagging: When the AI has low confidence in a mapping or cleaning result, it is automatically flagged for review
- Manual review interface: Merchants or operators can view before/after comparisons of AI cleaning, and accept or correct results
- Feedback loop: Manual corrections are fed back to the AI model to improve subsequent cleaning accuracy
Product AI Service
Merchants can perform on-demand operations on individual products through the Product AI Service:- Regenerate description: Use AI to rewrite the product description
- Re-clean: Re-run the complete Refinery Pipeline on an already-cleaned product
- Attribute completion: Trigger visual extraction for products with missing attributes
Summary
The Refinery Pipeline ensures the quality of data entering protocol endpoints through four stages: field mapping, AI description optimization, visual attribute extraction, and AIO standardization. Three-tier flywheel learning continuously accumulates mapping knowledge, and HITL review provides a human quality safety net.Next chapter: API Reference and Rate Limiting — Complete endpoint catalog, authentication methods, and rate limiting policies