Load¶
from laion_fmri.subject import load_subject
sub = load_subject("sub-03")
A Subject reads one file per accessor. Every accessor
maps to exactly one file on disk, returned as raw arrays;
combining sessions, averaging, or rebinning is the caller’s
responsibility.
Two brain masks are available:
source="anatomical"(default) – the anatomically-derived brain mask shipped underderivatives/anatomical/sub-XX/ses-PrismaAnat/anat/ ..._res-1pt8_desc-brain_mask.nii.gz. Wider than the rsquare-derived mask (it includes voxels with no GLMsingle signal too). Requiresdownload(include_anatomical=True).source="rsquare"– derived on the fly from the subject-level mean-R^2 map (..._stat-rsquare_desc-R2mean_statmap.nii.gz). Voxels with any non-zero GLMsingle fit are considered “in brain”. No extra download required.
Both masks are consistent across sessions for a given subject,
so betas stacked along the trial axis stay aligned on the
voxel axis (within one source).
sub.get_brain_mask() # anat-derived, res-1pt8
sub.get_brain_mask(source="rsquare") # rsquare-derived
sub.get_brain_mask(res=None) # full-resolution anat mask
sub.get_n_voxels(source="rsquare") # n voxels under rsquare
res defaults to "1pt8" – the functional grid – so the
returned mask aligns with the voxel axis of get_betas and
get_noise_ceiling and with the rsquare-derived mask. Pass
res=None to read the full-resolution anatomical mask
instead; the returned 1-D array is much larger and is meant
for callers working with full-resolution data outside the
loader cascade. res is ignored for source="rsquare"
(the rsquare-derived mask is published at one resolution only).
Several voxel-axis accessors take a matching mask_source
kwarg so the choice flows through downstream:
get_betas(..., mask_source="rsquare"),
get_noise_ceiling(..., mask_source="rsquare"),
to_nifti(..., mask_source="rsquare"),
get_voxel_coordinates(mask_source="rsquare"). These all
pin the mask at res="1pt8" internally, since their data is
on the functional grid.
Core accessors¶
sub.get_sessions() # ['ses-01', ...]
sub.get_brain_mask() # bool, (n_total_voxels,)
sub.get_n_voxels() # number of brain-mask voxels
sub.get_betas(session="ses-01") # float32, (n_trials, n_voxels)
sub.get_betas(session=["ses-01", "ses-02"]) # dict[ses, ndarray]
sub.get_noise_ceiling(session="ses-01") # float32, (n_voxels,)
sub.get_noise_ceiling(desc="Noiseceiling12rep") # subject-level variant
sub.get_trial_info(session="ses-01") # pandas DataFrame
Filters on get_betas¶
roi="..."or list – ROI mask(s); see “ROI queries” below for the full grammar.mask=ndarray[bool]– custom voxel mask.nc_threshold=0.2– keep voxels whose per-session noise ceiling exceeds the threshold.stimuli="shared"/"unique"– restrict to trials whose stimulus is in the shared/unique subset.streaming=False(default) materializes the full 4-D NIfTI in RAM and then masks it. One-shot decompression of the.nii.gz, fastest when memory is plentiful, but peak RAM is the full file plus the masked output – roughly 12 GB for a real session.streaming=Truereads the file volume-by-volume and applies the combined brain + ROI + NC mask inline. Peak RAM stays at one volume (~10-50 MB) plus the masked output. Use this on memory-constrained machines like Colab. Works on both.niiand.nii.gz; the compressed path uses a custom gzip pipeline that streams without re-decompressing on each slice.
ROI queries¶
ROI inputs accept three forms (or a list mixing them):
"FFA1"– a specific ROI name."face"– every ROI in that category (the bucket groups ROIs intobody,character,face,laion,motion,object,place,retinotopy)."all"– every ROI for the subject.
Categories and ROI names are disjoint, so a single string disambiguates by lookup. Pass a list to combine several at once – overlapping voxels appear only once in the result.
sub.get_available_rois() # flat list of every ROI
sub.get_available_categories() # 8 categories
sub.get_available_rois(category="face") # face-area ROIs only
sub.get_roi_mask("FFA1") # 1-D bool mask
sub.get_roi_mask("face") # union of all face ROIs
sub.get_roi_mask("all") # union of every ROI on disk
sub.get_roi_masks(["FFA1", "face"]) # dict keyed by your inputs
ROI queries on get_betas follow the same grammar:
sub.get_betas(session="ses-01", roi="FFA1")
sub.get_betas(session="ses-01", roi="face")
sub.get_betas(session="ses-01", roi=["face", "place"])
Note that get_roi_mask / get_betas(roi=...) are
volume-only: they operate on the space-T1w_res-1pt8
NIfTI mask. Surface variants are loaded via get_roi_data.
Multi-format ROI loading¶
Each ROI ships in three file types: a volumetric .nii.gz
mask, per-hemisphere .func.gii surface masks, and
per-hemisphere FreeSurfer .label files. get_roi_data
returns a nested dict keyed by ROI; format and hemi axes
prune the tree:
sub.get_roi_data("FFA1") # full nested dict (default = all)
# {
# "FFA1": {
# "volume": <1-D bool ndarray>,
# "gii": {
# "hemi-L": {"func.gii": <bool>, "label": <int idx>},
# "hemi-R": {...},
# },
# },
# }
sub.get_roi_data("FFA1", format="volume") # vol only
sub.get_roi_data("FFA1", format="gii", hemi="L") # left surface
sub.get_roi_data("FFA1", format="func.gii") # surface masks only
sub.get_roi_data("face", format="volume") # one entry per face ROI
format accepts "all", "volume" / "nii.gz"
(synonyms), "gii" (both surface types), "func.gii",
or "label". hemi accepts "L", "R", or
"all" (default).
Multi-session results¶
Pass a list to any session-keyed accessor and you get a
dict keyed by session ID, never a stacked array. Trial
counts can differ per session, so a regular ndarray would be
unsafe – you stack yourself only when you know shapes match.
Multi-subject access¶
from laion_fmri.group import load_subjects
group = load_subjects(["sub-03", "sub-05"])
group.get_shared_betas(session="ses-01") # dict[sub, ndarray]
for sub_id, sub in group:
...
Brain-space mapping¶
sub.to_nifti(per_voxel_array, "/tmp/out.nii.gz")
sub.get_voxel_coordinates() # (n_voxels, 3)
For projecting subject-T1w-space values onto fsaverage or MNI templates, see Template space.
Anatomical derivatives¶
Per-subject T1w / T2w volumes and a dedicated brain mask are
published under derivatives/anatomical/sub-XX/ses-PrismaAnat/
anat/. Each modality ships at full resolution and at
res-1pt8 (the functional grid). Pull them with
download(include_anatomical=True) and access via:
sub.has_anatomical() # bool: are the files local?
sub.get_anatomical_dir() # Path to the subject's anat tree
sub.get_t1w() # full-resolution T1w path
sub.get_t1w(res="1pt8") # functional-grid T1w path
sub.get_t2w() # full-resolution T2w path
sub.get_anatomical_brain_mask(res="1pt8") # anat brain mask path
Each accessor returns a pathlib.Path; load with
nibabel.load(...) when you need the voxel data. The
res-1pt8 mask is the same file source="anatomical"
reads in get_brain_mask() and the voxel-axis accessors,
so the voxel axis stays aligned.
PyTorch integration¶
ds = sub.to_torch_dataset(session="ses-01", roi="visual")
item = ds[0] # dict with betas, image, ...
Memory & shape considerations¶
Every accessor returns a fresh ndarray; nothing is cached. You control how much data you pull into RAM. A few rules of thumb:
One whole-brain session of betas is
n_trials × n_voxelsfloat32. With ~1000 trials and ~270k brain-mask voxels, that’s roughly 1 GB per call. Doable for one session on a laptop; multiplying by 30+ sessions per subject quickly reaches many tens of GB. Always pass anroi=filter when you can – it cuts memory by 1-2 orders of magnitude.ROI filters cut memory dramatically.
roi="visual"typically reduces voxel count by an order of magnitude; combining withnc_thresholdreduces it further.Avoid loading whole sessions if you only need a slice. Build a
mask=array yourself, or chainroi+nc_threshold, before callingget_betas.Multi-session results are dicts, not stacked arrays. Trial counts vary per session, so passing a list to
get_betasreturns adict[ses, ndarray]. All sessions share the same brain mask within a subject, so the voxel axis matches andnp.concatenate(list(out.values()), axis=0)is the right stack – you just have to align trial-level metadata yourself when you do.
PyTorch users: to_torch_dataset(...) exposes the same
accessors lazily per __getitem__ call, so total RAM stays
proportional to batch size rather than the dataset.
Per-trial stimulus access¶
For most analyses you don’t want to talk to the stimulus set directly
– you want, for the trials this subject saw, the images,
embeddings, captions, or segmentation masks aligned to the betas. The
Subject exposes those four modalities as namespaces, each keyed by
global trial index (a row of Subject.metadata):
sub = load_subject("sub-03")
sub.metadata # ── the trial table ──
# One row per single-trial beta, concatenated across sessions.
# Columns: session, session_trial, image_name, stim_idx,
# unique_or_shared, dataset, (+ events.tsv extras).
# The row index is the "trial index" used everywhere below.
sub.images.get(42) # PIL.Image for trial 42
sub.images[42] # raw JPEG bytes
sub.images.array(session="ses-01") # (n, 1000, 1000, 3) uint8 stack
sub.images.all() # iterator, length n_trials
sub.embeddings.models # ['CLIP', 'DINOv2', ...]
sub.embeddings.get("CLIP", 42) # (D,) features for trial 42
sub.embeddings.all("CLIP") # (n_trials, D) — ready to regress
sub.embeddings.all("CLIP", session="ses-01")
sub.segmentations.nouns(42) # ['hand', 'piano', ...]
sub.segmentations.has_image(42) # False if unique-image trial
sub.segmentations.get(42, "hand") # (1000, 1000) uint8 mask
sub.captions.list(42) # all captions for trial 42's image
The bulk methods (.all(...) / .array(...)) preserve trial order,
so the rows of sub.embeddings.all("CLIP") line up one-to-one with
the rows of sub.get_betas(session=None) concatenated across
sessions. That’s the regression workflow in one expression.
Single concrete example — fit CLIP features to betas:
import numpy as np
sub = load_subject("sub-03")
# Per-session z-score, then concatenate, for fair scale:
beta_chunks = []
for ses in sub.get_sessions():
b = sub.get_betas(session=ses, roi="visual")
beta_chunks.append((b - b.mean(0)) / b.std(0))
y = np.concatenate(beta_chunks, axis=0) # (n_trials, n_voxels)
# Trial-aligned CLIP features (same row order as y):
X = sub.embeddings.all("CLIP") # (n_trials, 1024)
# Standard regression from here...
Another typical pattern — pull masks only for shared-image trials, since masks ship only for the shared set:
trials = sub.metadata
shared_trials = trials.index[trials["unique_or_shared"] == "shared"]
for trial in shared_trials:
nouns = sub.segmentations.nouns(trial)
if "face" in nouns:
face_mask = sub.segmentations.get(trial, "face")
... # do something with this trial
The sub.segmentations API is safe to call on every trial:
has_image(trial) returns False and nouns(trial) returns
[] for unique-image trials, so loops across the full trial table
need no special-casing.
PyTorch users get the same per-trial access lazily via
to_torch_dataset().
Stimuli: images, embeddings, segmentations (dataset-wide)¶
The same modalities are also reachable through a dataset-wide hub,
load_stimuli(), keyed by image_name rather than trial index.
Use this when you want the full stimulus set independent of any
subject’s trial ordering – e.g. computing similarity matrices on all
1,492 shared images, or pulling embeddings for arbitrary names.
import laion_fmri
stim = laion_fmri.load_stimuli()
stim.metadata.head() # pandas DataFrame, 25052 rows
# ── Stimulus images ─────────────────────────────────────────
img = stim.images.get("shared_12rep_LAION_cluster_1003_i0.jpg") # PIL
jpeg = stim.images["shared_12rep_LAION_cluster_1003_i0.jpg"] # raw bytes
stim.images.names()[:3]
for name, raw in stim.images:
...
# ── Embeddings (CLIP, DINOv2, PEcore, SigLIP2) ──────────────
stim.embeddings.models # ['CLIP', 'DINOv2', 'PEcore', 'SigLIP2']
stim.embeddings["CLIP"].shape # (25052, 1024) float16
stim.embeddings.get(
"CLIP", "shared_12rep_LAION_cluster_1003_i0.jpg",
) # (1024,) vector
# ── Object-segmentation masks (shared images only) ──────────
stim.segmentations.nouns(
"shared_12rep_LAION_cluster_1003_i0.jpg",
) # ['fingers', 'hand', ...]
stim.segmentations.get(
"shared_12rep_LAION_cluster_1003_i0.jpg", "fingers",
) # (1000, 1000) uint8
# ── Captions (human for all images; AI for shared non-OOD) ────────
stim.captions.human(
"shared_12rep_LAION_cluster_1003_i0.jpg",
) # list[str], length 5 for shared
stim.captions.ai(
"shared_12rep_LAION_cluster_1003_i0.jpg",
) # str or None
stim.captions.get(
"shared_12rep_LAION_cluster_1003_i0.jpg",
) # pandas DataFrame
Files are opened lazily: touching stim.embeddings does not open
the embedding HDF5 until you actually look up a vector. The
modality-specific HDF5 files are independent downloads:
# Stimulus images (gated; Data Use Agreement on first call):
laion_fmri.download_stimuli()
# Embeddings (public, CC0):
laion_fmri.download_embeddings()
laion_fmri.download_embeddings(models=["CLIP", "DINOv2"])
# Object segmentations (public, CC0, ~68 MB):
laion_fmri.download_segmentations()
# Captions (public, CC0):
laion_fmri.download_captions()
CLI equivalents: laion-fmri download-stimuli,
laion-fmri download-embeddings, laion-fmri download-segmentations,
laion-fmri download-captions.
Note
Segmentations cover the shared stimulus set only (1,492 images
viewed by every subject); subject-unique images carry no masks.
The listing methods (nouns, for_image, has_image)
return empty results – not errors – for uncovered images, so
loops across all trials need no special-casing.
Captions cover every stimulus image with human captions: shared
images have five and unique images have three. AI captions are
provided for shared non-OOD images only, so
stim.captions.ai(name) and sub.captions.ai(trial) return
None for unique-image and OOD trials.
See Stimulus Set for the per-model embedding details (feature dimensions, normalisation, exact model identifiers) and a deeper tour of the segmentation file layout.
Common workflow: per-session z-scoring + train/test split¶
A frequent recipe is to load every session for one subject, z-score betas within each session, then split shared vs. unique stimuli into test vs. train. This composes from the existing accessors:
import numpy as np
from laion_fmri.subject import load_subject
sub = load_subject("sub-03")
train_chunks, test_chunks = [], []
for ses in sub.get_sessions():
betas = sub.get_betas(session=ses) # (n_trials, n_voxels)
z = (betas - betas.mean(0)) / betas.std(0) # within-session z-score
trials = sub.get_trial_info(session=ses)
is_shared = trials["label"].str.startswith("shared_").to_numpy()
train_chunks.append(z[~is_shared])
test_chunks.append(z[is_shared])
X_train = np.concatenate(train_chunks, axis=0)
X_test = np.concatenate(test_chunks, axis=0)
The stimuli="shared" / "unique" filter on
get_betas does the same trial selection in one call if you
prefer it without the manual label parse:
train = sub.get_betas(session=ses, stimuli="unique")
test = sub.get_betas(session=ses, stimuli="shared")
For ROI-restricted variants, add roi="visual" (or any
mask= / nc_threshold= filter) to the same calls –
voxel selection composes naturally and applies before the
z-score.
Bundled train / test splits (re:vision Method 1 / 2 / 3)¶
The laion_fmri.splits subpackage layers on top of the
accessors above without changing them. Splits are label matches
against the label column of the trial table, so the same
get_betas() and
get_trial_info() calls feed
both train and test slices:
import numpy as np
import pandas as pd
from laion_fmri.splits import get_split_masks
sessions = sub.get_sessions()
betas = np.concatenate(list(
sub.get_betas(session=sessions, roi="visual").values()
), axis=0)
trials = pd.concat(list(
sub.get_trial_info(session=sessions).values()
), ignore_index=True)
train, test = get_split_masks(trials, "tau", pool="shared")
X_train, X_test = betas[train], betas[test]
See Train / Test Splits for the full split catalogue
(random_*, cluster_k5_*, tau, ood) and per-method
worked examples.
Errors you may encounter¶
The package raises a small, named exception hierarchy from
laion_fmri._errors:
DataDirNotSetErrorRaised by
load_subject(and any accessor) when no data directory has been configured. Runlaion_fmri.config.dataset_initialize(...)first.DataNotDownloadedError(subclass ofFileNotFoundError)Raised when a subject’s directory exists but the requested file is missing on disk. Re-run
download(...)with the rightses/desc/statfilters.StimuliNotDownloadedError(subclass ofFileNotFoundError)Raised by stimulus-side accessors (
Subject.metadata,Subject.images/embeddings/segmentations,Subject.to_torch_dataset,load_stimuli) when the stimuli directory has not been mirrored yet. Re-rundownload(..., include_stimuli=True).SubjectNotFoundError(subclass ofValueError)Raised by
resolve_subject_idfor malformed IDs (empty, bare"sub-", non-string).LicenseNotAcceptedError(subclass ofRuntimeError)Raised by
download/accept_licenseswhen the dataset license is declined.
Plain ValueError covers narrower mistakes – for example,
asking get_betas for both roi and mask at once,
passing an unknown ROI name, or specifying neither session
nor desc to get_noise_ceiling.
See Loading Data for a full tour.