Skip to content

AiDreamWorld/Polymarket_data

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Polymarket Data

Complete Data Infrastructure for Polymarket — Fetch, Process, Analyze

A comprehensive toolkit and dataset for Polymarket prediction markets. Fetch trading data directly from Polygon blockchain and Gamma API, process into multiple analysis-ready formats, and analyze with ease.

Zhengjie Wang1,2, Leiyu Chao1,3, Yu Bao1,4, Lian Cheng1,3, Jianhan Liao1,5, Yikang Li1,†

1Shanghai Innovation Institute    2Westlake University    3Shanghai Jiao Tong University
4Harbin Institute of Technology    5Fudan University

Corresponding author

HuggingFace Dataset GitHub Repository License Python 3.12+


TL;DR

We provide 107GB of trading data from Polymarket containing 1.1 billion records across 268K+ markets, along with a complete toolkit to fetch, process, and analyze the data. Perfect for market research, behavioral studies, and quantitative analysis.

Get all historical data before 2026: Download the complete dataset from HuggingFace, or use this toolkit to fetch the latest data yourself.

Highlights

  • Complete Data: 1.1 billion trading records from Polymarket's inception to present
  • Direct Data Access: Fetch data directly from Polygon blockchain, no third-party dependencies
  • Multiple Formats: 5 analysis-ready datasets for different research needs
  • Real-time Updates: Continuous mode to sync new data every 2 seconds
  • Resume Support: Auto-save progress, restart anytime without data loss
  • Efficient Storage: Parquet format with compression, supports incremental writes

vs Third-party Data Sources

Field Polymarket Data Third-party
block_number Yes No
contract name Yes No
maker_fee / taker_fee / protocol_fee Yes No
order_hash Yes No
market_id (auto-linked) Yes Yes
Missing token auto-fill Yes Yes

Dataset Overview

File Size Records Description
orderfilled.parquet 31GB 293.3M Raw blockchain events from OrderFilled logs
trades.parquet 32GB 293.3M Processed trades with market metadata linkage
markets.parquet 68MB 268,706 Market information and metadata
quant.parquet 21GB 170.3M Clean market data with unified YES perspective
users.parquet 23GB 340.6M User behavior data split by maker/taker roles

Total: 107GB, 1.1 billion records

Download from HuggingFace: SII-WANGZJ/Polymarket_data

Use Cases

Market Research & Analysis

  • Study prediction market dynamics and price discovery mechanisms
  • Analyze market efficiency and information aggregation
  • Research crowd wisdom and forecasting accuracy

Behavioral Studies

  • Track individual user trading patterns and decision-making
  • Study market participant behavior under different conditions
  • Analyze risk preferences and trading strategies

Data Science & Machine Learning

  • Train models for price prediction and market forecasting
  • Feature engineering for time-series analysis
  • Develop algorithms for market analysis

Academic Research

  • Economics and finance research on prediction markets
  • Social science studies on collective intelligence
  • Computer science research on blockchain data analysis

Quick Start

Installation

# Clone repository
git clone https://github.com/SII-WANGZJ/Polymarket_data.git
cd Polymarket_data

# Install dependencies
pip install -r requirements.txt

# Or install as package
pip install -e .

Download Dataset

# Install HuggingFace CLI
pip install huggingface_hub

# Download specific file
hf download SII-WANGZJ/Polymarket_data quant.parquet --repo-type dataset

# Download all files
hf download SII-WANGZJ/Polymarket_data --repo-type dataset

Usage

1. Continuous Real-time Mode (Recommended)

Automatically fetch new blocks and keep running 24/7:

# Start continuous fetching
./scripts/continuous_start.sh

# View logs
tail -f logs/continuous_fetch.log

# Stop gracefully
./scripts/continuous_stop.sh

Features:

  • Batch mode: When behind by ≥100 blocks, fetch 100 blocks at once
  • Real-time mode: When caught up, fetch 1 block every 2 seconds
  • Auto data cleaning: Generate 4 parquet files in real-time
  • Graceful shutdown: Ensures all files are properly closed on exit

2. Batch Historical Data

Fetch specific range of historical blocks:

# Fetch last 10,000 blocks
python -m polymarket.cli fetch-onchain --blocks 10000

# Resume from last checkpoint
python -m polymarket.cli fetch-onchain --continue

# Fetch specific block range
python -m polymarket.cli fetch-onchain --start 80000000 --end 80010000

3. Full Pipeline

Complete workflow: fetch markets → fetch on-chain → process data:

# Run full pipeline
./scripts/update_all.sh

# Or step by step
./scripts/fetch_markets.sh        # Fetch market metadata
./scripts/fetch_onchain.sh 5000   # Fetch on-chain data
./scripts/clean_data.sh           # Clean and process data

4. Python API

Use as a library in your Python code:

from polymarket import LogFetcher, EventDecoder, extract_trades
from polymarket import load_token_mapping

# 1. Fetch on-chain logs
fetcher = LogFetcher()
logs = fetcher.fetch_range_in_batches(start_block, end_block)

# 2. Decode events
decoder = EventDecoder()
decoded = decoder.decode_batch(logs)
events = decoder.format_batch(decoded)

# 3. Load token mapping and extract trades
token_mapping = load_token_mapping()
trades_df = extract_trades(events, token_mapping)

# 4. Save to parquet
trades_df.to_parquet('trades.parquet')

Project Structure

Polymarket_data/
├── polymarket/              # Core Python package
│   ├── cli/                 # Command-line interface
│   ├── fetchers/            # Data fetchers (RPC, Gamma API)
│   ├── processors/          # Data processors (decoder, cleaner)
│   └── tools/               # Utility tools (merge, sort, etc.)
├── scripts/                 # Shell scripts for common tasks
├── polymarket_data/         # Dataset documentation
├── data/                    # Data storage (gitignored)
├── logs/                    # Logs (gitignored)
├── README.md
├── LICENSE
└── requirements.txt

Data Schema

OrderFilled Events (Raw)

Field Description
timestamp Unix timestamp
block_number Block number
transaction_hash Transaction hash
contract Contract name (CTF_EXCHANGE or NEGRISK_CTF_EXCHANGE)
maker / taker Trading parties' addresses
maker_asset_id / taker_asset_id Asset IDs
maker_amount_filled / taker_amount_filled Filled amounts
maker_fee / taker_fee / protocol_fee Fees (in wei)
order_hash Order hash

Trades (Processed)

Field Description
market_id Market ID (auto-linked from token)
answer Option name (YES/NO/etc.)
price Trade price (0-1)
usd_amount / token_amount USDC and token amounts
maker_direction / taker_direction Buy/sell direction

quant.parquet - Clean Market Data

Filtered and normalized trade data with unified token perspective (YES token).

Key Features:

  • Unified perspective: All trades normalized to YES token (token1)
  • Clean data: Contract trades filtered out, only real user trades
  • Complete information: Maker/taker roles preserved
  • Best for: Market analysis, price studies, time-series forecasting

Schema:

{
    'transaction_hash': str,      # Blockchain transaction hash
    'block_number': int,          # Block number
    'datetime': datetime,         # Transaction timestamp
    'market_id': str,             # Market identifier
    'maker': str,                 # Maker wallet address
    'taker': str,                 # Taker wallet address
    'token_amount': float,        # Amount of tokens traded
    'usd_amount': float,          # USD value
    'price': float,               # Trade price (0-1)
}

users.parquet - User Behavior Data

Split maker/taker records with unified buy direction for user analysis.

Key Features:

  • Split records: Each trade becomes 2 records (one maker, one taker)
  • Unified direction: All converted to BUY (negative amounts = selling)
  • User sorted: Ordered by user for trajectory analysis
  • Best for: User profiling, PnL calculation, wallet analysis

Schema:

{
    'transaction_hash': str,      # Transaction hash
    'block_number': int,          # Block number
    'datetime': datetime,         # Timestamp
    'market_id': str,             # Market identifier
    'user': str,                  # User wallet address
    'role': str,                  # 'maker' or 'taker'
    'token_amount': float,        # Signed amount (+ buy, - sell)
    'usd_amount': float,          # USD value
    'price': float,               # Trade price
}

markets.parquet - Market Metadata

Market information and outcome token details.

Best for: Linking trades to market context, filtering by market attributes

See DATA_DESCRIPTION.md for complete schema documentation.

Data Processing Pipeline

Polygon Blockchain (RPC)    Gamma API
         ↓                      ↓
  orderfilled.parquet    markets.parquet
         ↓
  trades.parquet (+ Market linkage)
         ↓
         ├─→ quant.parquet (Unified YES perspective)
         │   └─→ Filter contracts + Normalize tokens
         │
         └─→ users.parquet (Split maker/taker)
             └─→ Split records + Unified BUY direction

Key Transformations:

  1. quant.parquet:

    • Filter out contract trades (keep only user trades)
    • Normalize all trades to YES token perspective
    • Preserve maker/taker information
    • Result: 170.3M records (from 293.3M)
  2. users.parquet:

    • Split each trade into 2 records (maker + taker)
    • Convert all to BUY direction (signed amounts)
    • Sort by user for easy querying
    • Result: 340.6M records (from 293.3M × 2, some filtered)

Example Analysis

1. Calculate Market Statistics

import pandas as pd

df = pd.read_parquet('quant.parquet')

# Market-level statistics
market_stats = df.groupby('market_id').agg({
    'usd_amount': ['sum', 'mean'],     # Total volume and average trade size
    'price': ['mean', 'std', 'min', 'max'],  # Price statistics
    'transaction_hash': 'count'         # Number of trades
}).round(4)

print(market_stats.head())

2. Track Price Evolution

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_parquet('quant.parquet')
df['datetime'] = pd.to_datetime(df['datetime'])

# Select a specific market
market_id = 'your-market-id'
market_data = df[df['market_id'] == market_id].sort_values('datetime')

# Plot price over time
plt.figure(figsize=(12, 6))
plt.plot(market_data['datetime'], market_data['price'])
plt.title(f'Price Evolution - Market {market_id}')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

3. Analyze User Behavior

import pandas as pd

df = pd.read_parquet('users.parquet')

# Calculate net position per user per market
user_positions = df.groupby(['user', 'market_id']).agg({
    'token_amount': 'sum',          # Net position (positive = long, negative = short)
    'usd_amount': 'sum',            # Total USD traded
    'transaction_hash': 'count'     # Number of trades
}).reset_index()

# Find most active users
active_users = user_positions.groupby('user').agg({
    'market_id': 'count',           # Number of markets traded
    'usd_amount': 'sum'             # Total volume
}).sort_values('usd_amount', ascending=False)

print(active_users.head(10))

4. Market Volume Analysis

import pandas as pd

df = pd.read_parquet('quant.parquet')
markets = pd.read_parquet('markets.parquet')

# Join with market metadata
df = df.merge(markets[['market_id', 'question']], on='market_id', how='left')

# Top markets by volume
top_markets = df.groupby(['market_id', 'question']).agg({
    'usd_amount': 'sum'
}).sort_values('usd_amount', ascending=False).head(20)

print(top_markets)

Data Quality

  • Complete History: No missing blocks or gaps in blockchain data
  • Verified Sources: All OrderFilled events from 2 official exchange contracts
  • Blockchain Verified: Cross-checked against Polygon RPC nodes
  • Regular Updates: Automated daily pipeline for fresh data
  • Open Source: Fully reproducible collection process

Contracts Tracked:

  • Exchange Contract 1: 0x4bFb41d5B3570DeFd03C39a9A4D8dE6Bd8B8982E
  • Exchange Contract 2: 0xC5d563A36AE78145C45a50134d48A1215220f80a

CLI Commands

# Fetch market metadata
python -m polymarket.cli fetch-markets

# Fetch on-chain data
python -m polymarket.cli fetch-onchain --blocks 1000
python -m polymarket.cli fetch-onchain --continue

# Process data
python -m polymarket.cli process
python -m polymarket.cli clean

# Full update
python -m polymarket.cli update

Utility Tools

# Merge multiple parquet files
python -m polymarket.tools.merge_parquet file1.parquet file2.parquet -o merged.parquet

# Sort parquet by timestamp
python -m polymarket.tools.sort_parquet input.parquet -o sorted.parquet

# Refetch failed blocks
python -m polymarket.tools.refetch_failed_blocks --start 80000000 --end 80100000

Configuration

Environment Variables

# Optional: Alchemy API key for faster RPC access
export ALCHEMY_API_KEY=your_key_here

Custom RPC Endpoint

Edit polymarket/config.py:

RPC_ENDPOINTS = [
    "https://polygon-rpc.com",
    "your_custom_endpoint",
]

Contributing

We welcome contributions to improve the dataset and tools:

  1. Report Issues: Found bugs or data quality issues? Open an issue
  2. Suggest Features: Ideas for new features? Let us know!
  3. Contribute Code: Improve our pipeline via pull requests

License

MIT License - Free for commercial and research use.

See LICENSE file for details.

Contact & Support

Citation

If you use this dataset or toolkit in your research, please cite:

@misc{polymarket_data_2026,
  title={Polymarket Data: Complete Data Infrastructure for Polymarket},
  author={Wang, Zhengjie and Chao, Leiyu and Bao, Yu and Cheng, Lian and Liao, Jianhan and Li, Yikang},
  year={2026},
  howpublished={\url{https://huggingface.co/datasets/SII-WANGZJ/Polymarket_data}},
  note={A comprehensive dataset and toolkit for Polymarket prediction markets}
}

Acknowledgments

  • Polymarket for building the leading prediction market platform
  • Polygon for providing reliable blockchain infrastructure
  • HuggingFace for hosting and distributing large datasets
  • The open-source community for tools and libraries

Disclaimer

This tool is for research and educational purposes. Users are responsible for complying with Polymarket's terms of service and applicable regulations.


Built for the research and data science community

HuggingFaceGitHubDocumentation

About

A comprehensive dataset of 1.1 billion trading records from Polymarket, processed into multiple analysis-ready formats. Features cleaned data, unified token perspectives, and user-level transformations — ready for market research, behavioral studies, and quantitative analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 98.3%
  • Shell 1.7%