Train / Test Splits¶

LAION-fMRI ships with predefined train/test splits so that encoding-model and representation-similarity replications can report comparable generalization scores. The splits implement all three methods of the re:vision generalization framework:

Method 1 — Independent within-distribution. An 80/20 split that covers the image distribution as fully as possible while penalising train/test pairs that are too close in feature space. Use the tau split.
Method 2 — Out-of-distribution clusters. Five-fold CLIP-feature cluster holdout. Use the cluster_k5_0 … cluster_k5_4 splits and average results across folds.
Method 3 — Out-of-distribution images. Train on the pool’s regular images, test on the 371 held-out OOD stimuli. Use the ood split. Optionally restrict the test set to specific OOD categories with the ood_types= keyword on get_train_test_ids() / get_split_masks().
Random baselines (random_0 … random_4) are also bundled — comparing them against tau quantifies how much of a model’s score depends on train/test similarity.

Pools¶

Every split is bundled for six pools. Pick the one whose stimulus subset matches the original study — see list_pools().

"shared" — the 1,121 cross-subject shared images (non-OOD subset of the shared block). Use this when the original study used only stimuli that every subject saw (e.g. NSD- or Conwell-style benchmarks).
"sub-01", "sub-03", "sub-05", "sub-06", "sub-07" — the per-subject pools (1,121 shared + 4,712 subject-unique images = 5,833 images each). Use a subject’s pool when the original analysis was per-subject and used that subject’s full stimulus set.

Split names¶

Twelve names exist in every pool — see list_splits().

name	re:vision method	notes
`tau`	Method 1	The MMD-matched 80/20 nearest-neighbour-distance split. Train and test match at the population level (low MMD), but each test image is kept maximally far from its nearest training neighbour in feature space. Single fixed split per pool.
`cluster_k5_0` … `cluster_k5_4`	Method 2	Each split holds out one of five CLIP-feature k-means clusters as the test fold. Test sizes vary with cluster population; train + test always sum to the pool.
`ood`	Method 3	Train = the pool’s regular images, test = the 371 held-out OOD shared images. The test set is identical across pools; train varies. See Method 3 — OOD images below for the 9 OOD categories and how to filter to a subset of them.
`random_0` … `random_4`	baseline	Five seeded uniform-random 80/20 partitions. Use them as a baseline for any generalization metric.

Split sizes:

random_* and tau are fixed at 80/20 of the pool — that’s 897 / 224 for the shared pool and 4666 / 1167 per subject.
cluster_k5_* test sizes vary with cluster population; train + test always equals the pool size.
ood test = 371 for every pool. Train = the pool itself (1,121 for shared, 5,833 for the subject pools).

Loading splits¶

The simplest path is get_train_test_ids(), which returns the train and test image-id lists. Match those against the label column of the get_trial_info() output, then apply the mask to your betas:

import numpy as np
import pandas as pd
from laion_fmri.subject import load_subject
from laion_fmri.splits import get_train_test_ids

sub = load_subject("sub-01")

# Load betas and trial info as usual (see laion_fmri_package/load).
sessions = sub.get_sessions()
betas_per_ses  = sub.get_betas(session=sessions, roi="hlvis")
trials_per_ses = sub.get_trial_info(session=sessions)

# Concatenate across sessions (standard idiom).
betas  = np.concatenate(list(betas_per_ses.values()), axis=0)
trials = pd.concat(list(trials_per_ses.values()), ignore_index=True)

# Method 1: within-distribution generalization on the shared pool.
train_ids, test_ids = get_train_test_ids("tau", pool="shared")
train_mask = trials["label"].isin(train_ids).to_numpy()
test_mask  = trials["label"].isin(test_ids).to_numpy()

X_train, X_test = betas[train_mask], betas[test_mask]

Matching is by string equality: train_ids / test_ids are the same strings as the label column entries (e.g. "shared_12rep_LAION_cluster_1003_i0.jpg"), so no normalisation is needed.

For convenience, get_split_masks() collapses those last three lines:

from laion_fmri.splits import get_split_masks

train_mask, test_mask = get_split_masks(trials, "tau", pool="shared")
X_train, X_test = betas[train_mask], betas[test_mask]

Method 2 — five-fold cluster average¶

from laion_fmri.splits import get_split_masks

scores = []
for k in range(5):
    train_mask, test_mask = get_split_masks(
        trials, f"cluster_k5_{k}", pool="shared",
    )
    fit_encoding(features[train_mask], betas[train_mask])
    scores.append(score(features[test_mask], betas[test_mask]))

m2 = float(np.mean(scores))

Method 3 — OOD images¶

The ood split holds out 371 OOD stimuli as a test set, with the train side being the pool’s regular images. Every subject saw all 371 OOD images, so test_ids is the same set regardless of pool; what changes is what counts as “regular” training data.

from laion_fmri.splits import get_split_masks

# Method 3: train on shared, test on OOD-shared
train_mask, test_mask = get_split_masks(trials, "ood", pool="shared")
fit_encoding(features[train_mask], betas[train_mask])
m3 = score(features[test_mask], betas[test_mask])

OOD categories¶

The 371 OOD images span 9 categories. Discoverable at runtime via list_ood_types():

type	n	what’s in it
`shape`	82	abstract shape, colour and digit stimuli
`relations`	72	unusual spatial relations between objects
`unusual`	64	natural scenes with unusual content
`cropped`	60	zoomed-in / cropped object close-ups
`selfmade`	30	custom-shot stimuli
`gaudy`	24	highly saturated / patterned scenes
`illusion-classic`	21	textbook visual illusions
`gabor`	10	Gabor patches
`illusion-natural`	8	natural-image illusions

Pass ood_types= to get_train_test_ids() or get_split_masks() to restrict the test set to specific categories — useful when only some OOD types make sense for the original study (e.g. an object-recognition finding probably shouldn’t be evaluated on Gabor patches):

# Just shape / unusual / cropped (positively biased toward objects)
train_mask, test_mask = get_split_masks(
    trials, "ood", pool="shared",
    ood_types=["shape", "unusual", "cropped"],
)

# Or one type at a time, to compare per-category generalization:
per_type_scores = {}
for t in list_ood_types():
    _, m = get_split_masks(trials, "ood", pool="shared", ood_types=[t])
    per_type_scores[t] = score(features[m], betas[m])

Inspection¶

For the full Split object — including splitter / params / variant metadata — use load_split():

>>> from laion_fmri.splits import list_pools, list_splits, load_split
>>> list_pools()
['shared', 'sub-01', 'sub-03', 'sub-05', 'sub-06', 'sub-07']
>>> list_splits()
['cluster_k5_0', ..., 'random_0', ..., 'ood', 'tau']
>>> sp = load_split("tau", pool="shared")
>>> sp.n_train, sp.n_test, sp.split_family
(897, 224, 'tau')
>>> sp.params
{'method': 'min_nn_filter + stochastic_mmd_swap', ...}

Split file schema¶

The bundled JSONs live under laion_fmri/splits/data/{pool}/ and are accessed transparently by the loaders above. Schema, for reference:

{
  "name": "random_0",
  "pool": "sub-01 full pool (unique + LAION non-OOD shared)",
  "splitter": "min_nn_stochastic",
  "params": { "method": "uniform_random", "seed": 0 },
  "n_train": 4666,
  "n_test":  1167,
  "variants": [
    {
      "variant_id": 0,
      "train_ids":  ["shared_12rep_LAION_cluster_1003_i0.jpg", "..."],
      "test_ids":   ["unique_LAION_initial_cluster_2817_i38_p01.jpg", "..."]
    }
  ]
}