Welcome to Science-Star! Get up and running in minutes with this simple guide. ⚡
Make sure you have completed the Installation first! 🔧
Science-Star supports both Humanity's Last Exam (HLE) and GAIA benchmarks. Choose your dataset:
HLE:
sh scripts/run_hle.shGAIA:
sh scripts/run_gaia.shThese scripts will:
- 🤖 Launch the evaluation with gpt-4o-mini (multi-agent: Manager Agent + Sub-Agent for Search)
- 📊 Process the dataset using LLM agents with search, crawl, and inspector tools
- 💾 Save results to
output/<run_name>/<timestamp>/answers.jsonl - ⏱️ Use configuration from
configs/hle.yamlorconfigs/gaia.yaml
Want to customize your run? Use dot notation to override config parameters:
# Multi-agent (CodeAgent + search agent) — recommended
PYTHONPATH=.:science_star python3 science_star/run_multi_agent.py \
config=configs/hle.yaml \
models.name=gpt-4o-mini \
dataset.subset=small \
agents.max_steps=12 \
runtime.run_name=my-custom-run
# Single-agent (ToolCallingAgent only)
PYTHONPATH=.:science_star python3 science_star/run_single_agent.py \
config=configs/gaia.yaml \
models.name=gpt-4o-mini \
runtime.run_name=my-single-agent-runKey Parameters (dot notation):
config: Config file path (configs/hle.yamlorconfigs/gaia.yaml)models.name: LLM model (gpt-4o-mini, gpt-4o, etc.)dataset.subset: Dataset size (small|medium|large|nullfor full)dataset.category: (HLE) Focus on domains: bio, chem, cs, math, physics, etc.dataset.level: (GAIA) Difficulty: level1, level2, level3dataset.selected_tasks: (GAIA) 1-based task IDs for quick tests, e.g.[1,2,3]agents.max_steps: Maximum reasoning steps per problemruntime.run_name: Custom name for your experimentruntime.concurrency: Parallel task execution (default: 1)
Watch your agents work in real-time! The console will show:
- ✅ Completed tasks
- 🔄 Current reasoning steps
- 📊 Success rates
- ⏱️ Execution times
Launch the interactive dataset explorer (supports both HLE and GAIA):
streamlit run visualization/vis_dataset.pyRun from the project root so paths resolve correctly. This opens a web interface where you can:
- 📖 Browse Questions: Navigate through dataset entries with LaTeX/Markdown rendering
- 🏷️ Filter by Category: Explore Biology, Chemistry, Math, Physics, and more (HLE)
- 🔎 Search & Jump: Quick navigation to specific entries
- 📊 View Metadata: See question types, difficulty levels, and annotations
- 📁 Multi-Dataset: Switch between HLE and GAIA subsets
- Smart Rendering: Mathematical equations and scientific notation display perfectly
- Multi-Format Support: Works with both HLE and GAIA dataset schemas
- Real-time Navigation: Instant switching between questions
- Responsive Design: Works on desktop and mobile
streamlit run visualization/vis_output.pyPerfect for analyzing experiment results! Supports:
- HLE and GAIA answer formats
- Filtering by category, parsing error, iteration limit, correctness
- Task details with reasoning steps, agent errors, and judgment results
- 📖 Check Project Structure for detailed architecture
- 🐛 Found a bug? Open an issue on GitHub
Happy experimenting! 🚀✨ Your scientific AI agents are ready to tackle complex problems!