Skip to content

Add Schrag2026Pediatric SSVEP dataset (n=47, ages 5-18)#1069

Open
bruAristimunha wants to merge 2 commits intodevelopfrom
add-schrag2026-pediatric-ssvep
Open

Add Schrag2026Pediatric SSVEP dataset (n=47, ages 5-18)#1069
bruAristimunha wants to merge 2 commits intodevelopfrom
add-schrag2026-pediatric-ssvep

Conversation

@bruAristimunha
Copy link
Copy Markdown
Collaborator

@bruAristimunha bruAristimunha commented May 7, 2026

Summary

Adds Schrag2026Pediatric — pediatric SSVEP-BCI dataset from Schrag et al. (2026) hosted on Zenodo (CC-BY-ND-4.0). 47 children (ages 5–18, 40.4% female) recorded at 256 Hz on 16 channels with the g.tec g.GAMMAsys + g.USBamp + g.GAMMAcap system, ground Fpz, earlobe reference.

Each subject contributes:

  • Personalization (T1) — 12 visual stimuli (4 contrasts × 3 sizes), all flickering at 10 Hz, used by the original authors to pick a per-child personalized stimulus.
  • SSVEP game (T2 + T3) — 4-target online game at 6.25 / 10 / 11.11 / 14.28 Hz, played twice (once with each subject's personal stimulus, once with a high-contrast standard) across two themed maps.

By default the loader exposes the SSVEP game runs only — two sessions per subject ("0standard", "1personal"), 5 s trials at four target frequencies. include_personalization=True also loads T1 as a third session ("2personalization"); all its trials carry the "10" event since every personalization stimulus flickers at 10 Hz.

Important caveat (also in the class docstring): trial labels for the game sessions come from the recorded fbCCA classifier output (the Selected SPO column of the per-game movement CSV) — i.e. the frequency the system identified during the live game, which then drove avatar movement. They are not ground-truth target frequencies. Treating y as such biases benchmarks toward fbCCA's behaviour. Trial-vs-CSV count drift is min-truncated when small (≤ 10 percent), otherwise the run's labels are dropped to avoid silent shifts.

Files

  • moabb/datasets/schrag2026.py — new dataset class + helpers
  • moabb/datasets/__init__.py — registration
  • moabb/datasets/summary_ssvep.csv — summary row
  • docs/source/api.rst — autosummary entry under SSVEP datasets
  • docs/source/whats_new.rst — changelog entry

Implementation notes

  • XDF + Unity markers via pyxdf (soft-imported). Modelled after aguilera_rodriguez2025.py (XDF) and kumar2024.py (single-zip Zenodo). Marker stream is selected by name (UnityMarkerStream) — each XDF also carries an empty gUSBamp-1Markers stream that wins a type-based match in some files.
  • Single 1.2 GB DatasetData.zip is downloaded once and extracted per-subject on demand via safe_extract_zip(... members=...) so first-use latency stays in seconds for one-subject runs. Extraction is staged into a sibling temp dir then os.replace-d into place — race-safe under concurrent pytest workers.
  • Demographics (_AGES, _SEXES) hardcoded from Participant_Demographic_Info.csv in the deposit; verified to match byte-for-byte.
  • License set to CC-BY-ND-4.0 per the live Zenodo deposit (the preprint PDF says CC-BY-4.0; Zenodo metadata is authoritative for the data — comment in the source explains the discrepancy).

Cross-checks performed

  • 31/31 metadata fields verified against (a) preprint PDF, (b) Participant_Demographic_Info.csv, (c) live Zenodo API.
  • Loaded subjects 1, 2, 10, 20, 30, 40, 47 end-to-end. Trial counts 33–87 per session, all 4 SSVEP classes represented in each clean run.
  • SSVEP paradigm round-trip on 5 clean subjects: X.shape=(428, 16, 1281), balanced classes {6.25: 119, 10: 116, 11.11: 95, 14.28: 98}.
  • Post 7–45 Hz bandpass channel std ~15–34 µV (sane for pediatric EEG).
  • Atomic _extract_subject verified idempotent; no temp leftovers after concurrent extraction races.
  • P001 personal session has 15% trial-vs-CSV drift; the safeguard correctly drops labels with a clear log.error rather than silently shift them.

Style note

The class follows MOABB's existing BaseDataset shape but its module-level helpers (_load_xdf_streams, _read_unity_markers, _build_raw, _load_game_run, _load_personalization_run, _match_freq, _extract_subject) are deliberately flat / procedural and use the variable names (marker_ts, markers, eeg_stream, …) from the upstream Schrag / Comaduran reference notebooks (epoching-example.ipynb in the Zenodo deposit), so the original authors can read it top-to-bottom.

Test plan

  • Lint clean (ruff check + ruff format)
  • Module imports without side effects
  • Schrag2026Pediatric() instantiates with 47 subjects, 4 events
  • dataset_search(paradigm="ssvep", events=["10"]) finds it
  • SSVEP paradigm round-trip on multiple subjects
  • include_personalization=True loads T1 (40 trials per subject)
  • sessions=["personal"] and sessions=["0standard"] filter correctly
  • Atomic extraction is idempotent and race-safe
  • CI: pytest, doc build (will be triggered by this PR)
  • Authors confirm CC-BY-ND-4.0 vs CC-BY-4.0 license discrepancy

References

Pediatric SSVEP-BCI dataset from Schrag et al. (2026): 47 children
(ages 5-18, 40.4% female) recorded with g.tec g.GAMMAsys + g.USBamp
at 256 Hz on 16 scalp channels. Two-stage protocol:
- Stimulus personalization (12 stimuli, 4 contrasts x 3 sizes at 10 Hz)
- Online 4-target SSVEP game (6.25 / 10 / 11.11 / 14.28 Hz),
  played twice per subject (personal vs standard stimulus, two maps).

By default the loader exposes the SSVEP game runs as two sessions
("0standard", "1personal") with 5 s trials at four target frequencies;
include_personalization=True opens a third "2personalization" session
(all trials labelled "10" -- the shared 10 Hz flicker).

Trial labels for the game come from the live fbCCA classifier output
(Selected SPO column in the per-game movement CSV); this is documented
in the class docstring as not-quite-ground-truth. Trial / CSV count
drift is min-truncated when small (<= 10 percent), otherwise the run's
labels are dropped to avoid silent shifts.

Data hosted as a single 1.2 GB Zenodo zip (10.5281/zenodo.19440997,
CC-BY-ND-4.0); per-subject extraction is staged via tempfile +
os.replace for race-safe concurrent runs.

Preprint DOI: 10.21203/rs.3.rs-9347306/v1

- Add moabb/datasets/schrag2026.py
- Register in moabb/datasets/__init__.py and summary_ssvep.csv
- Add to docs/source/api.rst SSVEP autosummary and whats_new.rst
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1446b3ff31

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread moabb/datasets/schrag2026.py Outdated
Comment on lines +553 to +557
if zip_path.suffix != ".zip":
target = zip_path.with_suffix(".zip")
if not target.exists():
zip_path.rename(target)
zip_path = target
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve the downloader cache filename

Because the Zenodo API URL ends in /content, data_dl caches this archive at a path named content; renaming it to content.zip removes the exact file that data_dl checks on the next call. Since data_path() calls data_dl() before checking whether the subject was already extracted, any later access to an already-extracted subject will still re-download the full ~1.2 GB archive every time. Keep the cached path intact or download/cache under the final zip filename instead.

Useful? React with 👍 / 👎.

Audit of the dataset class showed six private helpers each used by
exactly one caller, which obscured the linear flow when reading
top-down. Inline the small ones; keep helpers used by both loaders.

- Inline _normalize_spo (now: ``_match_freq`` next to its caller)
- Inline _personalization_label as a 2-line ``rsplit`` in the loader
- Inline _movement_csv_for_eeg as a path expression at the call site
- Inline _wanted_session_keys, _find_game_files, _safe_pair_count
- Move _load_xdf_streams, _read_unity_markers, _build_raw, _load_*_run,
  _extract_subject to module level so the file reads top-to-bottom
- Rename ``marker_text`` -> ``markers`` and ``start_idx`` -> ``trial_starts``
  to match the variable names used in the upstream Schrag/Comaduran
  reference notebooks (``epoching-example.ipynb`` in the Zenodo
  deposit)

Behavior unchanged. All previously-verified properties hold:
- demographics still match the Zenodo CSV byte-for-byte
- METADATA fields preserved (DOI, license, freqs, n_classes, n_subjects)
- 10 percent drift safeguard still drops shifted-label runs
- include_personalization=True still yields 40 T1 trials
- session filtering ("personal" and "0standard" forms) still works
- _extract_subject is still atomic (tempfile + os.replace)
- SSVEP paradigm round-trip identical (428 trials, balanced 4 classes)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant