Polymarket Data

Complete Data Infrastructure for Polymarket — Fetch, Process, Analyze

A comprehensive toolkit and dataset for Polymarket prediction markets. Fetch trading data directly from Polygon blockchain and Gamma API, process into multiple analysis-ready formats, and analyze with ease.

Zhengjie Wang^1,2, Leiyu Chao^1,3, Yu Bao^1,4, Lian Cheng^1,3, Jianhan Liao^1,5, Yikang Li^1,†

¹Shanghai Innovation Institute ²Westlake University ³Shanghai Jiao Tong University
⁴Harbin Institute of Technology ⁵Fudan University

^†Corresponding author

TL;DR

We provide 107GB of trading data from Polymarket containing 1.1 billion records across 268K+ markets, along with a complete toolkit to fetch, process, and analyze the data. Perfect for market research, behavioral studies, and quantitative analysis.

Get all historical data before 2026: Download the complete dataset from HuggingFace, or use this toolkit to fetch the latest data yourself.

Highlights

Complete Data: 1.1 billion trading records from Polymarket's inception to present
Direct Data Access: Fetch data directly from Polygon blockchain, no third-party dependencies
Multiple Formats: 5 analysis-ready datasets for different research needs
Real-time Updates: Continuous mode to sync new data every 2 seconds
Resume Support: Auto-save progress, restart anytime without data loss
Efficient Storage: Parquet format with compression, supports incremental writes

vs Third-party Data Sources

Field	Polymarket Data	Third-party
block_number	Yes	No
contract name	Yes	No
maker_fee / taker_fee / protocol_fee	Yes	No
order_hash	Yes	No
market_id (auto-linked)	Yes	Yes
Missing token auto-fill	Yes	Yes

Dataset Overview

File	Size	Records	Description
`orderfilled.parquet`	31GB	293.3M	Raw blockchain events from OrderFilled logs
`trades.parquet`	32GB	293.3M	Processed trades with market metadata linkage
`markets.parquet`	68MB	268,706	Market information and metadata
`quant.parquet`	21GB	170.3M	Clean market data with unified YES perspective
`users.parquet`	23GB	340.6M	User behavior data split by maker/taker roles

Total: 107GB, 1.1 billion records

Download from HuggingFace: SII-WANGZJ/Polymarket_data

Use Cases

Market Research & Analysis

Study prediction market dynamics and price discovery mechanisms
Analyze market efficiency and information aggregation
Research crowd wisdom and forecasting accuracy

Behavioral Studies

Track individual user trading patterns and decision-making
Study market participant behavior under different conditions
Analyze risk preferences and trading strategies

Data Science & Machine Learning

Train models for price prediction and market forecasting
Feature engineering for time-series analysis
Develop algorithms for market analysis

Academic Research

Economics and finance research on prediction markets
Social science studies on collective intelligence
Computer science research on blockchain data analysis

Quick Start

Installation

# Clone repository
git clone https://github.com/SII-WANGZJ/Polymarket_data.git
cd Polymarket_data

# Install dependencies
pip install -r requirements.txt

# Or install as package
pip install -e .

Download Dataset

# Install HuggingFace CLI
pip install huggingface_hub

# Download specific file
hf download SII-WANGZJ/Polymarket_data quant.parquet --repo-type dataset

# Download all files
hf download SII-WANGZJ/Polymarket_data --repo-type dataset

Usage

1. Continuous Real-time Mode (Recommended)

Automatically fetch new blocks and keep running 24/7:

# Start continuous fetching
./scripts/continuous_start.sh

# View logs
tail -f logs/continuous_fetch.log

# Stop gracefully
./scripts/continuous_stop.sh

Features:

Batch mode: When behind by ≥100 blocks, fetch 100 blocks at once
Real-time mode: When caught up, fetch 1 block every 2 seconds
Auto data cleaning: Generate 4 parquet files in real-time
Graceful shutdown: Ensures all files are properly closed on exit

2. Batch Historical Data

Fetch specific range of historical blocks:

# Fetch last 10,000 blocks
python -m polymarket.cli fetch-onchain --blocks 10000

# Resume from last checkpoint
python -m polymarket.cli fetch-onchain --continue

# Fetch specific block range
python -m polymarket.cli fetch-onchain --start 80000000 --end 80010000

3. Full Pipeline

Complete workflow: fetch markets → fetch on-chain → process data:

# Run full pipeline
./scripts/update_all.sh

# Or step by step
./scripts/fetch_markets.sh        # Fetch market metadata
./scripts/fetch_onchain.sh 5000   # Fetch on-chain data
./scripts/clean_data.sh           # Clean and process data

4. Python API

Use as a library in your Python code:

from polymarket import LogFetcher, EventDecoder, extract_trades
from polymarket import load_token_mapping

# 1. Fetch on-chain logs
fetcher = LogFetcher()
logs = fetcher.fetch_range_in_batches(start_block, end_block)

# 2. Decode events
decoder = EventDecoder()
decoded = decoder.decode_batch(logs)
events = decoder.format_batch(decoded)

# 3. Load token mapping and extract trades
token_mapping = load_token_mapping()
trades_df = extract_trades(events, token_mapping)

# 4. Save to parquet
trades_df.to_parquet('trades.parquet')

Project Structure

Polymarket_data/
├── polymarket/              # Core Python package
│   ├── cli/                 # Command-line interface
│   ├── fetchers/            # Data fetchers (RPC, Gamma API)
│   ├── processors/          # Data processors (decoder, cleaner)
│   └── tools/               # Utility tools (merge, sort, etc.)
├── scripts/                 # Shell scripts for common tasks
├── polymarket_data/         # Dataset documentation
├── data/                    # Data storage (gitignored)
├── logs/                    # Logs (gitignored)
├── README.md
├── LICENSE
└── requirements.txt

Data Schema

OrderFilled Events (Raw)

Field	Description
timestamp	Unix timestamp
block_number	Block number
transaction_hash	Transaction hash
contract	Contract name (CTF_EXCHANGE or NEGRISK_CTF_EXCHANGE)
maker / taker	Trading parties' addresses
maker_asset_id / taker_asset_id	Asset IDs
maker_amount_filled / taker_amount_filled	Filled amounts
maker_fee / taker_fee / protocol_fee	Fees (in wei)
order_hash	Order hash

Trades (Processed)

Field	Description
market_id	Market ID (auto-linked from token)
answer	Option name (YES/NO/etc.)
price	Trade price (0-1)
usd_amount / token_amount	USDC and token amounts
maker_direction / taker_direction	Buy/sell direction

quant.parquet - Clean Market Data

Filtered and normalized trade data with unified token perspective (YES token).

Key Features:

Unified perspective: All trades normalized to YES token (token1)
Clean data: Contract trades filtered out, only real user trades
Complete information: Maker/taker roles preserved
Best for: Market analysis, price studies, time-series forecasting

Schema:

{
    'transaction_hash': str,      # Blockchain transaction hash
    'block_number': int,          # Block number
    'datetime': datetime,         # Transaction timestamp
    'market_id': str,             # Market identifier
    'maker': str,                 # Maker wallet address
    'taker': str,                 # Taker wallet address
    'token_amount': float,        # Amount of tokens traded
    'usd_amount': float,          # USD value
    'price': float,               # Trade price (0-1)
}

users.parquet - User Behavior Data

Split maker/taker records with unified buy direction for user analysis.

Key Features:

Split records: Each trade becomes 2 records (one maker, one taker)
Unified direction: All converted to BUY (negative amounts = selling)
User sorted: Ordered by user for trajectory analysis
Best for: User profiling, PnL calculation, wallet analysis

Schema:

{
    'transaction_hash': str,      # Transaction hash
    'block_number': int,          # Block number
    'datetime': datetime,         # Timestamp
    'market_id': str,             # Market identifier
    'user': str,                  # User wallet address
    'role': str,                  # 'maker' or 'taker'
    'token_amount': float,        # Signed amount (+ buy, - sell)
    'usd_amount': float,          # USD value
    'price': float,               # Trade price
}

markets.parquet - Market Metadata

Market information and outcome token details.

Best for: Linking trades to market context, filtering by market attributes

See DATA_DESCRIPTION.md for complete schema documentation.

Data Processing Pipeline

Polygon Blockchain (RPC)    Gamma API
         ↓                      ↓
  orderfilled.parquet    markets.parquet
         ↓
  trades.parquet (+ Market linkage)
         ↓
         ├─→ quant.parquet (Unified YES perspective)
         │   └─→ Filter contracts + Normalize tokens
         │
         └─→ users.parquet (Split maker/taker)
             └─→ Split records + Unified BUY direction

Key Transformations:

quant.parquet:
- Filter out contract trades (keep only user trades)
- Normalize all trades to YES token perspective
- Preserve maker/taker information
- Result: 170.3M records (from 293.3M)
users.parquet:
- Split each trade into 2 records (maker + taker)
- Convert all to BUY direction (signed amounts)
- Sort by user for easy querying
- Result: 340.6M records (from 293.3M × 2, some filtered)

Example Analysis

1. Calculate Market Statistics

import pandas as pd

df = pd.read_parquet('quant.parquet')

# Market-level statistics
market_stats = df.groupby('market_id').agg({
    'usd_amount': ['sum', 'mean'],     # Total volume and average trade size
    'price': ['mean', 'std', 'min', 'max'],  # Price statistics
    'transaction_hash': 'count'         # Number of trades
}).round(4)

print(market_stats.head())

2. Track Price Evolution

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_parquet('quant.parquet')
df['datetime'] = pd.to_datetime(df['datetime'])

# Select a specific market
market_id = 'your-market-id'
market_data = df[df['market_id'] == market_id].sort_values('datetime')

# Plot price over time
plt.figure(figsize=(12, 6))
plt.plot(market_data['datetime'], market_data['price'])
plt.title(f'Price Evolution - Market {market_id}')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

3. Analyze User Behavior

import pandas as pd

df = pd.read_parquet('users.parquet')

# Calculate net position per user per market
user_positions = df.groupby(['user', 'market_id']).agg({
    'token_amount': 'sum',          # Net position (positive = long, negative = short)
    'usd_amount': 'sum',            # Total USD traded
    'transaction_hash': 'count'     # Number of trades
}).reset_index()

# Find most active users
active_users = user_positions.groupby('user').agg({
    'market_id': 'count',           # Number of markets traded
    'usd_amount': 'sum'             # Total volume
}).sort_values('usd_amount', ascending=False)

print(active_users.head(10))

4. Market Volume Analysis

import pandas as pd

df = pd.read_parquet('quant.parquet')
markets = pd.read_parquet('markets.parquet')

# Join with market metadata
df = df.merge(markets[['market_id', 'question']], on='market_id', how='left')

# Top markets by volume
top_markets = df.groupby(['market_id', 'question']).agg({
    'usd_amount': 'sum'
}).sort_values('usd_amount', ascending=False).head(20)

print(top_markets)

Data Quality

Complete History: No missing blocks or gaps in blockchain data
Verified Sources: All OrderFilled events from 2 official exchange contracts
Blockchain Verified: Cross-checked against Polygon RPC nodes
Regular Updates: Automated daily pipeline for fresh data
Open Source: Fully reproducible collection process

Contracts Tracked:

Exchange Contract 1: 0x4bFb41d5B3570DeFd03C39a9A4D8dE6Bd8B8982E
Exchange Contract 2: 0xC5d563A36AE78145C45a50134d48A1215220f80a

CLI Commands

# Fetch market metadata
python -m polymarket.cli fetch-markets

# Fetch on-chain data
python -m polymarket.cli fetch-onchain --blocks 1000
python -m polymarket.cli fetch-onchain --continue

# Process data
python -m polymarket.cli process
python -m polymarket.cli clean

# Full update
python -m polymarket.cli update

Utility Tools

# Merge multiple parquet files
python -m polymarket.tools.merge_parquet file1.parquet file2.parquet -o merged.parquet

# Sort parquet by timestamp
python -m polymarket.tools.sort_parquet input.parquet -o sorted.parquet

# Refetch failed blocks
python -m polymarket.tools.refetch_failed_blocks --start 80000000 --end 80100000

Configuration

Environment Variables

# Optional: Alchemy API key for faster RPC access
export ALCHEMY_API_KEY=your_key_here

Custom RPC Endpoint

Edit polymarket/config.py:

RPC_ENDPOINTS = [
    "https://polygon-rpc.com",
    "your_custom_endpoint",
]

Contributing

We welcome contributions to improve the dataset and tools:

Report Issues: Found bugs or data quality issues? Open an issue
Suggest Features: Ideas for new features? Let us know!
Contribute Code: Improve our pipeline via pull requests

License

MIT License - Free for commercial and research use.

See LICENSE file for details.

Contact & Support

Email: wangzhengjie@sii.edu.cn
Issues: GitHub Issues
Dataset: HuggingFace

Citation

If you use this dataset or toolkit in your research, please cite:

@misc{polymarket_data_2026,
  title={Polymarket Data: Complete Data Infrastructure for Polymarket},
  author={Wang, Zhengjie and Chao, Leiyu and Bao, Yu and Cheng, Lian and Liao, Jianhan and Li, Yikang},
  year={2026},
  howpublished={\url{https://huggingface.co/datasets/SII-WANGZJ/Polymarket_data}},
  note={A comprehensive dataset and toolkit for Polymarket prediction markets}
}

Acknowledgments

Polymarket for building the leading prediction market platform
Polygon for providing reliable blockchain infrastructure
HuggingFace for hosting and distributing large datasets
The open-source community for tools and libraries

Disclaimer

This tool is for research and educational purposes. Users are responsible for complying with Polymarket's terms of service and applicable regulations.

Built for the research and data science community

HuggingFace • GitHub • Documentation

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
logs		logs
polymarket		polymarket
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Polymarket Data

Complete Data Infrastructure for Polymarket — Fetch, Process, Analyze

TL;DR

Highlights

vs Third-party Data Sources

Dataset Overview

Use Cases

Market Research & Analysis

Behavioral Studies

Data Science & Machine Learning

Academic Research

Quick Start

Installation

Download Dataset

Usage

1. Continuous Real-time Mode (Recommended)

2. Batch Historical Data

3. Full Pipeline

4. Python API

Project Structure

Data Schema

OrderFilled Events (Raw)

Trades (Processed)

quant.parquet - Clean Market Data

users.parquet - User Behavior Data

markets.parquet - Market Metadata

Data Processing Pipeline

Example Analysis

1. Calculate Market Statistics

2. Track Price Evolution

3. Analyze User Behavior

4. Market Volume Analysis

Data Quality

CLI Commands

Utility Tools

Configuration

Environment Variables

Custom RPC Endpoint

Contributing

License

Contact & Support

Citation

Acknowledgments

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages