Skip to main content

Data Cleaning and Quality

Why Data Cleaning Is Critical

AI agents rely on structured, high-quality product data to make purchasing decisions. A product with a vague description, incorrect category, or missing attributes — even if exposed through all protocol endpoints — will be difficult for AI agents to understand and recommend correctly. ORBEXA includes a built-in AI data cleaning engine — the Refinery Pipeline — that automatically improves data quality before it reaches protocol endpoints.

Unified Refinery Pipeline

The core implementation lives in UnifiedRefineryPipeline.ts and is a multi-stage data processing pipeline:
Raw data -> Field mapping -> AI description optimization -> Visual attribute extraction -> AIO standardization -> Quality scoring -> Output
Each stage can run independently or be chained sequentially.

Stage 1: Field Mapping

The Problem

Different platforms use different field names: Shopify uses body_html for product descriptions, WooCommerce uses description, and a CSV import might use “Product Desc” or “Item Description.”

The Solution

ORBEXA’s field mapping engine (dataQualityService.ts) automatically maps fields from various platforms to UCP/ACP standard fields:
  • Platform preset mappings: Built-in mapping rules for major platforms like Shopify and WooCommerce
  • Smart inference: Uses semantic analysis to infer the corresponding standard field for non-standard field names
  • Manual overrides: Merchants can manually specify mapping relationships in the console

Stage 2: AI Description Optimization

AI is used to automatically improve product description quality:
  • Fill in missing information: If key details like material or dimensions are absent from the description but present in images or attributes, they are automatically added
  • Standardize expressions: Unifies units of measurement, color terminology, and sizing formats
  • Improve AI readability: Rewrites casual or ambiguous descriptions into clearly structured text
Merchants can manually trigger description regeneration or re-cleaning of individual products through the Product AI Service.

Stage 3: Visual Attribute Extraction

Attributes are automatically identified and extracted from product images:
  • Color recognition: Extracts dominant colors and color palettes from images
  • Material identification: Identifies fabric, metal, wood, and other material types
  • Style classification: Recognizes clothing styles, furniture aesthetics, and other visual characteristics
  • Defect detection: Flags image quality issues (blurry, poorly cropped, etc.)
These visual attributes are added to the product information as supplementary data, improving AI agent comprehension accuracy.

Stage 4: AIO Standardization

AIO (AI Optimization) standardization ensures product data fully meets the consumption requirements of AI agents:
  • Unified field formatting (dates, prices, weights, etc.)
  • Category taxonomy alignment
  • Multi-language field handling
  • Data completeness validation

Hierarchical Flywheel Learning

The Refinery Pipeline’s mapping rules employ a three-tier flywheel learning mechanism:

Merchant-Level Rules

Mapping rules specific to a particular merchant. For example, if a merchant’s CSV file always uses a “Model No.” column for what maps to the UCP sku field, this rule applies only to that merchant.

Category-Level Rules

Cross-merchant rules within a specific product category. For example, “Size” fields in apparel products typically map to the size attribute — this rule applies to all apparel merchants.

Global Rules

Universal rules that apply to all merchants. For example, “Price,” “Retail Price,” and “Unit Price” all map to the price field.

The Flywheel Effect

Merchant-level rules (highest precision, narrowest coverage)
      | Promoted after accumulating enough samples
Category-level rules (moderate precision, moderate coverage)
      | Promoted after continuous validation
Global rules (highest universality, broadest coverage)
Every time a merchant manually corrects a mapping, the system learns the rule. When multiple merchants in the same category produce similar corrections, the rule is automatically promoted to category level. When a category-level rule passes global validation, it is promoted to a global rule. The more merchants that connect to ORBEXA, the faster the flywheel spins and the more accurate mappings become.

Human-in-the-Loop (HITL)

AI cleaning cannot be 100% accurate. ORBEXA provides an HITL review dashboard:
  • Low-confidence flagging: When the AI has low confidence in a mapping or cleaning result, it is automatically flagged for review
  • Manual review interface: Merchants or operators can view before/after comparisons of AI cleaning, and accept or correct results
  • Feedback loop: Manual corrections are fed back to the AI model to improve subsequent cleaning accuracy
HITL is not an optional feature — it is a core component of the Refinery Pipeline, ensuring data quality has a human safety net.

Product AI Service

Merchants can perform on-demand operations on individual products through the Product AI Service:
  • Regenerate description: Use AI to rewrite the product description
  • Re-clean: Re-run the complete Refinery Pipeline on an already-cleaned product
  • Attribute completion: Trigger visual extraction for products with missing attributes

Summary

The Refinery Pipeline ensures the quality of data entering protocol endpoints through four stages: field mapping, AI description optimization, visual attribute extraction, and AIO standardization. Three-tier flywheel learning continuously accumulates mapping knowledge, and HITL review provides a human quality safety net.
Next chapter: API Reference and Rate Limiting — Complete endpoint catalog, authentication methods, and rate limiting policies