Note

Go to the end to download the full example code.

Querying the Dataset¶

Discover what is in the dataset without downloading anything.

Every cell in this example either queries the S3 bucket directly (laion_fmri.discovery) or reads bundled metadata that ships with the package (laion_fmri.splits). No subject data is fetched. Where a query needs locally-downloaded files, the corresponding download(...) and Subject-API calls are shown commented out, so you can copy them without this script triggering a download.

Pick the subject you want to look at on the line below:

SUBJECT = "sub-01"

Initialize a data directory¶

Discovery and split listings don’t need data on disk, but dataset_initialize is still required so that any subsequent (commented-out) download(...) would have a destination.

import os

from laion_fmri.config import dataset_initialize

data_dir = os.path.join(os.getcwd(), "laion_fmri_quickstart")
os.makedirs(data_dir, exist_ok=True)
dataset_initialize(data_dir)

from laion_fmri.discovery import (
    describe,
    get_rois,
    get_subjects,
    inspect_bucket,
)

Top-level summary¶

describe() prints a one-screen overview: bucket name, subject count, and the first subject’s ROI list. Run it first to confirm the bucket is reachable.

describe()

LAION-fMRI Dataset
  Bucket:    s3://laion-fmri
  Subjects:  5 (sub-01, sub-03, sub-05, sub-06, sub-07)
  ROIs:      EBA, FBA, FFA1, FFA2, IPCS, IPS0, LO1, LO2, MPA, MST, MT, OFA, OPA, PPA, SPCS, TO1, TO2, V1d, V1v, V2d, V2v, V3A, V3B, V3d, V3v, VO1, VO2, VWFA1, VWFA2, hV4, laionEVC, laiondorsal, laiongeneral, laionlateral, laionventral, lobjects, mfswords, pSTSfaces, pSTSwords, vobjects

Subjects in the bucket¶

get_subjects lists every subject the bucket exposes, including ones whose data is only partially uploaded – so the count matches the dataset’s published size, not just the subjects with complete data.

print(f"All subjects: {get_subjects()}")
print(f"Querying subject: {SUBJECT}")

All subjects: ['sub-01', 'sub-03', 'sub-05', 'sub-06', 'sub-07']
Querying subject: sub-01

ROI queries: specific / category / all¶

ROIs ship in eight categories on the bucket. Use the category= filter when you want to scope a query to one functional family (e.g. just the face-area ROIs); call get_rois without a filter when you want the full inventory.

ROI_CATEGORIES = (
    "body", "character", "face", "laion",
    "motion", "object", "place", "retinotopy",
)

print(f"All ROIs ({len(get_rois(SUBJECT))}):")
print(get_rois(SUBJECT))
print()
for cat in ROI_CATEGORIES:
    rois = get_rois(SUBJECT, category=cat)
    print(f"{cat}: {rois}")

All ROIs (40):
['EBA', 'FBA', 'FFA1', 'FFA2', 'IPCS', 'IPS0', 'LO1', 'LO2', 'MPA', 'MST', 'MT', 'OFA', 'OPA', 'PPA', 'SPCS', 'TO1', 'TO2', 'V1d', 'V1v', 'V2d', 'V2v', 'V3A', 'V3B', 'V3d', 'V3v', 'VO1', 'VO2', 'VWFA1', 'VWFA2', 'hV4', 'laionEVC', 'laiondorsal', 'laiongeneral', 'laionlateral', 'laionventral', 'lobjects', 'mfswords', 'pSTSfaces', 'pSTSwords', 'vobjects']

body: ['EBA', 'FBA']
character: ['VWFA1', 'VWFA2', 'mfswords', 'pSTSwords']
face: ['FFA1', 'FFA2', 'OFA', 'pSTSfaces']
laion: ['laionEVC', 'laiondorsal', 'laiongeneral', 'laionlateral', 'laionventral']
motion: ['MST', 'MT']
object: ['lobjects', 'vobjects']
place: ['MPA', 'OPA', 'PPA']
retinotopy: ['IPCS', 'IPS0', 'LO1', 'LO2', 'SPCS', 'TO1', 'TO2', 'V1d', 'V1v', 'V2d', 'V2v', 'V3A', 'V3B', 'V3d', 'V3v', 'VO1', 'VO2', 'hV4']

Bucket diagnostic listing¶

inspect_bucket prints the immediate top-level prefixes plus a count of subject directories under each derivative tree – useful when discovery returns surprises.

inspect_bucket()

Bucket: s3://laion-fmri
Top-level prefixes (1):
  derivatives/
derivatives/glmsingle-tedana/: 5 entries, 5 sub-* entries
derivatives/rois/: 5 entries, 5 sub-* entries

Bundled train/test splits (no download required)¶

laion_fmri.splits ships predefined train/test partitions of the stimulus set so callers can compare against the published baselines without re-running any clustering or sampling.

from laion_fmri.splits import (
    get_train_test_ids,
    list_ood_types,
    list_pools,
    list_splits,
    load_split,
)

print(f"Pools:     {list_pools()}")
print(f"Splits:    {list_splits()}")
print(f"OOD types: {list_ood_types()}")

Pools:     ['shared', 'sub-01', 'sub-03', 'sub-05', 'sub-06', 'sub-07']
Splits:    ['random_0', 'random_1', 'random_2', 'random_3', 'random_4', 'cluster_k5_0', 'cluster_k5_1', 'cluster_k5_2', 'cluster_k5_3', 'cluster_k5_4', 'tau', 'ood']
OOD types: ['cropped', 'gabor', 'gaudy', 'illusion-classic', 'illusion-natural', 'relations', 'selfmade', 'shape', 'unusual']

Inspect one split¶

load_split(name, pool=...) returns a Split describing the split’s sizes and family. get_train_test_ids is the convenience wrapper that gives you the actual ID lists in one call.

split = load_split("random_0", pool="shared")
print(f"Split:    {split.name}")
print(f"Pool:     {split.pool}")
print(f"Family:   {split.split_family}")
print(f"n_train:  {split.n_train}")
print(f"n_test:   {split.n_test}")

train_ids, test_ids = get_train_test_ids("random_0", pool="shared")
print(f"Loaded:   {len(train_ids)} train / {len(test_ids)} test ids")

Split:    random_0
Pool:     shared
Family:   random
n_train:  897
n_test:   224
Loaded:   897 train / 224 test ids

OOD splits with a type filter¶

The ood split partitions held-out stimuli by category; the ood_types= argument restricts which categories are kept in the test set.

_, test_shape = get_train_test_ids(
    "ood", pool="shared", ood_types=["shape"],
)
print(f"OOD shape only:  test ids = {len(test_shape)}")

OOD shape only:  test ids = 82

Per-subject queries that need local data¶

Discovery covers what is available; once you commit to working with a specific subject, the per-subject API on Subject is what you reach for. Those methods read on-disk files, so they presuppose a download. The block below is commented out to keep this example offline – copy the lines you need into your own script after running download(...) for the subject and session you care about.

from laion_fmri.download import download
# one session for one subject (~few hundred MB):
download(subject="sub-01", ses="ses-01")

from laion_fmri.subject import load_subject
sub = load_subject("sub-01")

# Sessions present on disk
print(sub.get_sessions())                      # ['ses-01', ...]

# Trial info: runs, repetitions, stimulus labels
trials = sub.get_trial_info(session="ses-01")
# columns: session, run, beta_index, label
print(trials.columns.tolist())
print(trials["run"].unique())                  # runs in this session
print(len(trials))                             # trial count

# Single-trial betas with the multi-level ROI grammar
betas_one  = sub.get_betas(session="ses-01", roi="FFA1")
betas_face = sub.get_betas(session="ses-01", roi="face")
betas_all  = sub.get_betas(session="ses-01", roi="all")

# Multi-format ROI loading
roi = sub.get_roi_data("FFA1", format="all", hemi="all")
# roi["FFA1"] is a nested dict:
# {
#   "volume": <1-D bool>,
#   "gii": {"hemi-L": {"func.gii": ..., "label": ...},
#           "hemi-R": {...}},
# }

Cross-subject discovery¶

Loop get_subjects() to ask the same questions of every subject in the bucket. ROI counts can differ across subjects (some ROIs don’t exist for everyone).

for sub_id in get_subjects():
    n_face = len(get_rois(sub_id, category="face"))
    n_total = len(get_rois(sub_id))
    print(f"  {sub_id}: {n_total:>3} ROIs total, {n_face} face")

sub-01:  40 ROIs total, 4 face
sub-03:  43 ROIs total, 6 face
sub-05:  41 ROIs total, 6 face
sub-06:  41 ROIs total, 5 face
sub-07:  41 ROIs total, 6 face

Stimulus metadata (forward-compat)¶

The stimuli/ prefix is reserved for the stimulus images and their metadata table; it isn’t populated yet. Once it lands, the call below would print the catalogue (commented out for the same offline-by-default reason as the Subject queries above):

# download(subject="sub-01", include_stimuli=True)
# sub = load_subject("sub-01")
# if sub.has_stimuli():
#     stim = sub.get_stimulus_metadata()
#     print(stim.head())
#     print(f"Total stimuli: {len(stim)}")
#     print(f"Shared:        {stim['shared'].sum()}")
#     print(f"Categories:    {stim['category'].value_counts()}")

Total running time of the script: (0 minutes 42.040 seconds)

Gallery generated by Sphinx-Gallery