From Chats to Markets: AgenticPay for LLM-Powered Negotiation in Multi-Agent Commerce

Xianyang Liu1, Shangding Gu1, Dawn Song1
1University of California Berkeley
AgenticPay framework overview

Figure 1: AgenticPay framework overview.

Abstract

Agents based on large language models are increasingly expected to autonomously handle negotiation and transactions. However, existing benchmarks predominantly focus on text-only, bilateral, zero-sum price haggling, failing to capture the complexity of real-world commerce. To address this gap, we introduce AgenticPay, a unified framework and benchmark for evaluating how well multimodal agents reach high-welfare, multi-dimensional agreements in realistic markets. Built around four core components (Environments, Tasks, Agents, and Metrics), AgenticPay comprises 160 multimodal tasks spanning 4 real-world business scenarios (E-commerce, Food Delivery, Ride-hailing, and Apartment Rental) and 8 market topologies, scaling from 1-to-1 bargaining to many-to-many (N-to-N) competitive markets.

Beyond price haggling, agents must read product images, infer each opponent's hidden preferences, and trade off multiple binding contract terms (e.g., price, lease duration, return policy, delivery speed). Agents communicate through multi-round natural-language dialogue, with each turn proposing or revising a full contract, and outcomes are scored by a utility-based framework that rewards agreements maximizing joint welfare. Evaluations on state-of-the-art proprietary (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro Preview) and open-weight (Qwen3-VL-32B-Instruct, InternVL3-38B) multimodal models reveal substantial gaps in non-zero-sum value creation and market reasoning: even the strongest agent (Gemini 3.1 Pro Preview) reaches a GlobalScore of only 42.3/100, and performance consistently degrades as markets scale from bilateral to multi-sided, with the average GlobalScore dropping by 5.7 points and open-weight models suffering the largest declines (up to 11.5 points). These findings establish AgenticPay as a foundational testbed for multimodal agentic commerce.

Benchmark Features

  • Unified framework with four core components: Environments, Tasks, Agents, and Metrics, exposed through a Gymnasium-like API for reproducible evaluation and easy extension.
  • Multimodal product grounding: Tasks include product images, visual route context, listings, menus, and rich text attributes that agents must read and reason over.
  • Multi-dimensional binding contracts: Agents negotiate full contracts with price plus binding terms such as lease duration, return policy, delivery speed, and packaging, rather than a single scalar price.
  • Opponent modeling under hidden preferences: Buyers and sellers have private valuations and constraints, requiring explicit mental models of the opponent for non-zero-sum value creation.
  • Multi-turn natural-language dialogue: Each turn proposes or revises a complete contract through structured natural-language messages.
  • Utility-based metrics: GlobalScore, BuyerScore, and SellerScore evaluate feasibility, welfare, surplus split, and negotiation efficiency.
  • 160 multimodal tasks across 4 scenarios & 8 topologies: E-commerce, Food Delivery, Ride-hailing, and Apartment Rental scenarios, scaling from 1-to-1 bargaining to many-to-many (N-to-N) markets.
  • Broad model support: Buyer and seller agents powered by text and vision-language models, including OpenAI-compatible APIs, vLLM, SGLang, and Qwen3-VL.

Scenarios and Tasks

AgenticPay covers 4 real-world business scenarios (E-commerce, Food Delivery, Ride-hailing, and Apartment Rental) instantiated across 8 market topologies, ranging from 1-to-1 bilateral bargaining, to 1-to-N competitive markets where one buyer or seller faces multiple counterparties, up to N-to-N matching markets with many buyers and sellers interacting in parallel.

AgenticPay scenario and task categories

Figure 2: Scenario and task categories.

Evaluation

AgenticPay scores negotiations with a utility-based framework that rewards agreements maximizing joint welfare, reporting feasibility, buyer and seller utility, surplus split, and negotiation efficiency. We evaluate state-of-the-art proprietary models (GPT-5.4, Claude Sonnet 4.6, Gemini 3.1 Pro Preview) and open-weight multimodal models (Qwen3-VL-32B-Instruct, InternVL3-38B).

Results reveal substantial gaps in non-zero-sum value creation and market reasoning: even the strongest agent, Gemini 3.1 Pro Preview, reaches a GlobalScore of only 42.3/100. Performance also consistently degrades as markets scale from bilateral to multi-sided, with the average GlobalScore dropping by 5.7 points, and open-weight models suffering the largest declines (up to 11.5 points).

BibTeX

@article{liu2026agenticpay,
  title={From Chats to Markets: AgenticPay for LLM-Powered Negotiation in Multi-Agent Commerce},
  author={Liu, Xianyang and Gu, Shangding and Song, Dawn},
  journal={arXiv preprint arXiv:2602.06008},
  year={2026}
}