Data Cleaning and Quality

Why Data Cleaning Is Critical

AI agents rely on structured, high-quality product data to make purchasing decisions. A product with a vague description, incorrect category, or missing attributes — even if exposed through all protocol endpoints — will be difficult for AI agents to understand and recommend correctly. ORBEXA includes a built-in AI data cleaning engine — the Refinery Pipeline — that automatically improves data quality before it reaches protocol endpoints.

Unified Refinery Pipeline

ORBEXA’s data quality engine uses a multi-stage processing pipeline:

Raw data -> Field mapping -> AI description optimization -> Visual attribute extraction -> AI-powered standardization -> Quality scoring -> Output

Each stage can run independently or be chained sequentially.

Stage 1: Field Mapping

The Problem

Different platforms use different field names: Shopify uses body_html for product descriptions, WooCommerce uses description, and a CSV import might use “Product Desc” or “Item Description.”

The Solution

ORBEXA’s field mapping engine automatically maps fields from various platforms to UCP/ACP standard fields:

Platform preset mappings: Built-in mapping rules for major platforms like Shopify and WooCommerce
Smart inference: Uses semantic analysis to infer the corresponding standard field for non-standard field names
Manual overrides: Merchants can manually specify mapping relationships in the console

Stage 2: AI Description Optimization

AI is used to automatically improve product description quality:

Fill in missing information: If key details like material or dimensions are absent from the description but present in images or attributes, they are automatically added
Standardize expressions: Unifies units of measurement, color terminology, and sizing formats
Improve AI readability: Rewrites casual or ambiguous descriptions into clearly structured text

Merchants can manually trigger description regeneration or re-cleaning of individual products through the Product AI Service.

Stage 3: Visual Attribute Extraction

Attributes are automatically identified and extracted from product images:

Color recognition: Extracts dominant colors and color palettes from images
Material identification: Identifies fabric, metal, wood, and other material types
Style classification: Recognizes clothing styles, furniture aesthetics, and other visual characteristics
Defect detection: Flags image quality issues (blurry, poorly cropped, etc.)

These visual attributes are added to the product information as supplementary data, improving AI agent comprehension accuracy.

Stage 4: AI-Powered Standardization

AI-powered standardization ensures product data fully meets the consumption requirements of AI agents:

Unified field formatting (dates, prices, weights, etc.)
Category taxonomy alignment
Multi-language field handling
Data completeness validation

Hierarchical Flywheel Learning

The Refinery Pipeline’s mapping rules employ a three-tier flywheel learning mechanism:

Merchant-Level Rules

Mapping rules specific to a particular merchant. For example, if a merchant’s CSV file always uses a “Model No.” column for what maps to the UCP sku field, this rule applies only to that merchant.

Category-Level Rules

Cross-merchant rules within a specific product category. For example, “Size” fields in apparel products typically map to the size attribute — this rule applies to all apparel merchants.

Global Rules

Universal rules that apply to all merchants. For example, “Price,” “Retail Price,” and “Unit Price” all map to the price field.

The Flywheel Effect

Merchant-level rules (highest precision, narrowest coverage)
      | Promoted after accumulating enough samples
Category-level rules (moderate precision, moderate coverage)
      | Promoted after continuous validation
Global rules (highest universality, broadest coverage)

Every time a merchant manually corrects a mapping, the system learns the rule. When multiple merchants in the same category produce similar corrections, the rule is automatically promoted to category level. When a category-level rule passes global validation, it is promoted to a global rule. The more merchants that connect to ORBEXA, the faster the flywheel spins and the more accurate mappings become.

Human-in-the-Loop (HITL)

AI cleaning cannot be 100% accurate. ORBEXA provides an HITL review dashboard:

Low-confidence flagging: When the AI has low confidence in a mapping or cleaning result, it is automatically flagged for review
Manual review interface: Merchants or operators can view before/after comparisons of AI cleaning, and accept or correct results
Feedback loop: Manual corrections are fed back to the AI model to improve subsequent cleaning accuracy

HITL is not an optional feature — it is a core component of the Refinery Pipeline, ensuring data quality has a human safety net.

Product AI Service

Merchants can perform on-demand operations on individual products through the Product AI Service:

Regenerate description: Use AI to rewrite the product description
Re-clean: Re-run the complete Refinery Pipeline on an already-cleaned product
Attribute completion: Trigger visual extraction for products with missing attributes

Summary

The Refinery Pipeline ensures the quality of data entering protocol endpoints through four stages: field mapping, AI description optimization, visual attribute extraction, and AI-powered standardization. Three-tier flywheel learning continuously accumulates mapping knowledge, and HITL review provides a human quality safety net.

Next chapter: API Reference and Rate Limiting — Complete endpoint catalog, authentication methods, and rate limiting policies

ORBEXA MCP Server tools and resources ORBEXA API reference and rate limiting

​Data Cleaning and Quality

​Why Data Cleaning Is Critical

​Unified Refinery Pipeline

​Stage 1: Field Mapping

​The Problem

​The Solution

​Stage 2: AI Description Optimization

​Stage 3: Visual Attribute Extraction

​Stage 4: AI-Powered Standardization

​Hierarchical Flywheel Learning

​Merchant-Level Rules

​Category-Level Rules

​Global Rules

​The Flywheel Effect

​Human-in-the-Loop (HITL)

​Product AI Service

​Summary