Note
Go to the end to download the full example code.
Querying the Dataset¶
Discover what is in the dataset without downloading anything.
Every cell in this example either queries the S3 bucket directly
(laion_fmri.discovery) or reads bundled metadata that ships with
the package (laion_fmri.splits). No subject data is fetched.
Where a query needs locally-downloaded files, the corresponding
download(...) and Subject-API calls are shown commented out,
so you can copy them without this script triggering a download.
Pick the subject you want to look at on the line below:
SUBJECT = "sub-01"
Initialize a data directory¶
Discovery and split listings don’t need data on disk, but
dataset_initialize is still required so that any subsequent
(commented-out) download(...) would have a destination.
import os
from laion_fmri.config import dataset_initialize
data_dir = os.path.join(os.getcwd(), "laion_fmri_quickstart")
os.makedirs(data_dir, exist_ok=True)
dataset_initialize(data_dir)
from laion_fmri.discovery import (
describe,
get_rois,
get_subjects,
inspect_bucket,
)
Top-level summary¶
describe() prints a one-screen overview: bucket name, subject
count, and the first subject’s ROI list. Run it first to confirm
the bucket is reachable.
describe()
LAION-fMRI Dataset
Bucket: s3://laion-fmri
Subjects: 5 (sub-01, sub-03, sub-05, sub-06, sub-07)
ROIs: EBA, FBA, FFA1, FFA2, IPCS, IPS0, LO1, LO2, MPA, MST, MT, OFA, OPA, PPA, SPCS, TO1, TO2, V1d, V1v, V2d, V2v, V3A, V3B, V3d, V3v, VO1, VO2, VWFA1, VWFA2, hV4, laionEVC, laiondorsal, laiongeneral, laionlateral, laionventral, lobjects, mfswords, pSTSfaces, pSTSwords, vobjects
Subjects in the bucket¶
get_subjects lists every subject the bucket exposes,
including ones whose data is only partially uploaded – so the
count matches the dataset’s published size, not just the
subjects with complete data.
print(f"All subjects: {get_subjects()}")
print(f"Querying subject: {SUBJECT}")
All subjects: ['sub-01', 'sub-03', 'sub-05', 'sub-06', 'sub-07']
Querying subject: sub-01
ROI queries: specific / category / all¶
ROIs ship in eight categories on the bucket. Use the
category= filter when you want to scope a query to one
functional family (e.g. just the face-area ROIs); call
get_rois without a filter when you want the full inventory.
ROI_CATEGORIES = (
"body", "character", "face", "laion",
"motion", "object", "place", "retinotopy",
)
print(f"All ROIs ({len(get_rois(SUBJECT))}):")
print(get_rois(SUBJECT))
print()
for cat in ROI_CATEGORIES:
rois = get_rois(SUBJECT, category=cat)
print(f"{cat}: {rois}")
All ROIs (40):
['EBA', 'FBA', 'FFA1', 'FFA2', 'IPCS', 'IPS0', 'LO1', 'LO2', 'MPA', 'MST', 'MT', 'OFA', 'OPA', 'PPA', 'SPCS', 'TO1', 'TO2', 'V1d', 'V1v', 'V2d', 'V2v', 'V3A', 'V3B', 'V3d', 'V3v', 'VO1', 'VO2', 'VWFA1', 'VWFA2', 'hV4', 'laionEVC', 'laiondorsal', 'laiongeneral', 'laionlateral', 'laionventral', 'lobjects', 'mfswords', 'pSTSfaces', 'pSTSwords', 'vobjects']
body: ['EBA', 'FBA']
character: ['VWFA1', 'VWFA2', 'mfswords', 'pSTSwords']
face: ['FFA1', 'FFA2', 'OFA', 'pSTSfaces']
laion: ['laionEVC', 'laiondorsal', 'laiongeneral', 'laionlateral', 'laionventral']
motion: ['MST', 'MT']
object: ['lobjects', 'vobjects']
place: ['MPA', 'OPA', 'PPA']
retinotopy: ['IPCS', 'IPS0', 'LO1', 'LO2', 'SPCS', 'TO1', 'TO2', 'V1d', 'V1v', 'V2d', 'V2v', 'V3A', 'V3B', 'V3d', 'V3v', 'VO1', 'VO2', 'hV4']
Bucket diagnostic listing¶
inspect_bucket prints the immediate top-level prefixes plus a
count of subject directories under each derivative tree – useful
when discovery returns surprises.
inspect_bucket()
Bucket: s3://laion-fmri
Top-level prefixes (1):
derivatives/
derivatives/glmsingle-tedana/: 5 entries, 5 sub-* entries
derivatives/rois/: 5 entries, 5 sub-* entries
Bundled train/test splits (no download required)¶
laion_fmri.splits ships predefined train/test partitions of
the stimulus set so callers can compare against the published
baselines without re-running any clustering or sampling.
from laion_fmri.splits import (
get_train_test_ids,
list_ood_types,
list_pools,
list_splits,
load_split,
)
print(f"Pools: {list_pools()}")
print(f"Splits: {list_splits()}")
print(f"OOD types: {list_ood_types()}")
Pools: ['shared', 'sub-01', 'sub-03', 'sub-05', 'sub-06', 'sub-07']
Splits: ['random_0', 'random_1', 'random_2', 'random_3', 'random_4', 'cluster_k5_0', 'cluster_k5_1', 'cluster_k5_2', 'cluster_k5_3', 'cluster_k5_4', 'tau', 'ood']
OOD types: ['cropped', 'gabor', 'gaudy', 'illusion-classic', 'illusion-natural', 'relations', 'selfmade', 'shape', 'unusual']
Inspect one split¶
load_split(name, pool=...) returns a Split describing the
split’s sizes and family. get_train_test_ids is the
convenience wrapper that gives you the actual ID lists in one
call.
split = load_split("random_0", pool="shared")
print(f"Split: {split.name}")
print(f"Pool: {split.pool}")
print(f"Family: {split.split_family}")
print(f"n_train: {split.n_train}")
print(f"n_test: {split.n_test}")
train_ids, test_ids = get_train_test_ids("random_0", pool="shared")
print(f"Loaded: {len(train_ids)} train / {len(test_ids)} test ids")
Split: random_0
Pool: shared
Family: random
n_train: 897
n_test: 224
Loaded: 897 train / 224 test ids
OOD splits with a type filter¶
The ood split partitions held-out stimuli by category; the
ood_types= argument restricts which categories are kept in the
test set.
_, test_shape = get_train_test_ids(
"ood", pool="shared", ood_types=["shape"],
)
print(f"OOD shape only: test ids = {len(test_shape)}")
OOD shape only: test ids = 82
Per-subject queries that need local data¶
Discovery covers what is available; once you commit to working
with a specific subject, the per-subject API on
Subject is what you reach for. Those
methods read on-disk files, so they presuppose a download. The
block below is commented out to keep this example offline –
copy the lines you need into your own script after running
download(...) for the subject and session you care about.
from laion_fmri.download import download
# one session for one subject (~few hundred MB):
download(subject="sub-01", ses="ses-01")
from laion_fmri.subject import load_subject
sub = load_subject("sub-01")
# Sessions present on disk
print(sub.get_sessions()) # ['ses-01', ...]
# Trial info: runs, repetitions, stimulus labels
trials = sub.get_trial_info(session="ses-01")
# columns: session, run, beta_index, label
print(trials.columns.tolist())
print(trials["run"].unique()) # runs in this session
print(len(trials)) # trial count
# Single-trial betas with the multi-level ROI grammar
betas_one = sub.get_betas(session="ses-01", roi="FFA1")
betas_face = sub.get_betas(session="ses-01", roi="face")
betas_all = sub.get_betas(session="ses-01", roi="all")
# Multi-format ROI loading
roi = sub.get_roi_data("FFA1", format="all", hemi="all")
# roi["FFA1"] is a nested dict:
# {
# "volume": <1-D bool>,
# "gii": {"hemi-L": {"func.gii": ..., "label": ...},
# "hemi-R": {...}},
# }
Cross-subject discovery¶
Loop get_subjects() to ask the same questions of every subject
in the bucket. ROI counts can differ across subjects (some ROIs
don’t exist for everyone).
sub-01: 40 ROIs total, 4 face
sub-03: 43 ROIs total, 6 face
sub-05: 41 ROIs total, 6 face
sub-06: 41 ROIs total, 5 face
sub-07: 41 ROIs total, 6 face
Stimulus metadata (forward-compat)¶
The stimuli/ prefix is reserved for the stimulus images and
their metadata table; it isn’t populated yet. Once it lands, the
call below would print the catalogue (commented out for the same
offline-by-default reason as the Subject queries above):
# download(subject="sub-01", include_stimuli=True)
# sub = load_subject("sub-01")
# if sub.has_stimuli():
# stim = sub.get_stimulus_metadata()
# print(stim.head())
# print(f"Total stimuli: {len(stim)}")
# print(f"Shared: {stim['shared'].sum()}")
# print(f"Categories: {stim['category'].value_counts()}")
Total running time of the script: (0 minutes 42.040 seconds)