AgenticPay

Agents based on large language models are increasingly expected to autonomously handle negotiation and transactions. However, existing benchmarks predominantly focus on text-only, bilateral, zero-sum price haggling, failing to capture the complexity of real-world commerce. To address this gap, we introduce AgenticPay, a unified framework and benchmark for evaluating how well multimodal agents reach high-welfare, multi-dimensional agreements in realistic markets. Built around four core components (Environments, Tasks, Agents, and Metrics), AgenticPay comprises 160 multimodal tasks spanning 4 real-world business scenarios (E-commerce, Food Delivery, Ride-hailing, and Apartment Rental) and 8 market topologies, scaling from 1-to-1 bargaining to many-to-many (N-to-N) competitive markets.

Beyond price haggling, agents must read product images, infer each opponent's hidden preferences, and trade off multiple binding contract terms (e.g., price, lease duration, return policy, delivery speed). Agents communicate through multi-round natural-language dialogue, with each turn proposing or revising a full contract, and outcomes are scored by a utility-based framework that rewards agreements maximizing joint welfare. Evaluations on state-of-the-art proprietary (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro Preview) and open-weight (Qwen3-VL-32B-Instruct, InternVL3-38B) multimodal models reveal substantial gaps in non-zero-sum value creation and market reasoning: even the strongest agent (Gemini 3.1 Pro Preview) reaches a GlobalScore of only 42.3/100, and performance consistently degrades as markets scale from bilateral to multi-sided, with the average GlobalScore dropping by 5.7 points and open-weight models suffering the largest declines (up to 11.5 points). These findings establish AgenticPay as a foundational testbed for multimodal agentic commerce.

Unified framework with four core components: Environments, Tasks, Agents, and Metrics, exposed through a Gymnasium-like API for reproducible evaluation and easy extension.
Multimodal product grounding: Tasks include product images, visual route context, listings, menus, and rich text attributes that agents must read and reason over.
Multi-dimensional binding contracts: Agents negotiate full contracts with price plus binding terms such as lease duration, return policy, delivery speed, and packaging, rather than a single scalar price.
Opponent modeling under hidden preferences: Buyers and sellers have private valuations and constraints, requiring explicit mental models of the opponent for non-zero-sum value creation.
Multi-turn natural-language dialogue: Each turn proposes or revises a complete contract through structured natural-language messages.
Utility-based metrics: GlobalScore, BuyerScore, and SellerScore evaluate feasibility, welfare, surplus split, and negotiation efficiency.
160 multimodal tasks across 4 scenarios & 8 topologies: E-commerce, Food Delivery, Ride-hailing, and Apartment Rental scenarios, scaling from 1-to-1 bargaining to many-to-many (N-to-N) markets.
Broad model support: Buyer and seller agents powered by text and vision-language models, including OpenAI-compatible APIs, vLLM, SGLang, and Qwen3-VL.

BibTeX

@article{liu2026agenticpay,
  title={From Chats to Markets: AgenticPay for LLM-Powered Negotiation in Multi-Agent Commerce},
  author={Liu, Xianyang and Gu, Shangding and Song, Dawn},
  journal={arXiv preprint arXiv:2602.06008},
  year={2026}
}

From Chats to Markets: AgenticPay for LLM-Powered Negotiation in Multi-Agent Commerce

Abstract

Benchmark Features

Scenarios and Tasks

Evaluation

BibTeX