{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# Querying the Dataset\n\nDiscover what is in the dataset without downloading anything.\n\nEvery cell in this example either queries the S3 bucket directly\n(``laion_fmri.discovery``) or reads bundled metadata that ships with\nthe package (``laion_fmri.splits``). No subject data is fetched.\nWhere a query needs locally-downloaded files, the corresponding\n``download(...)`` and Subject-API calls are shown **commented out**,\nso you can copy them without this script triggering a download.\n\nPick the subject you want to look at on the line below:\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "SUBJECT = \"sub-01\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Initialize a data directory\n\nDiscovery and split listings don't need data on disk, but\n``dataset_initialize`` is still required so that any subsequent\n(commented-out) ``download(...)`` would have a destination.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import os\n\nfrom laion_fmri.config import dataset_initialize\n\ndata_dir = os.path.join(os.getcwd(), \"laion_fmri_quickstart\")\nos.makedirs(data_dir, exist_ok=True)\ndataset_initialize(data_dir)\n\nfrom laion_fmri.discovery import (\n    describe,\n    get_rois,\n    get_subjects,\n    inspect_bucket,\n)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Top-level summary\n\n``describe()`` prints a one-screen overview: bucket name, subject\ncount, and the first subject's ROI list. Run it first to confirm\nthe bucket is reachable.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "describe()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Subjects in the bucket\n\n``get_subjects`` lists every subject the bucket exposes,\nincluding ones whose data is only partially uploaded -- so the\ncount matches the dataset's published size, not just the\nsubjects with complete data.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(f\"All subjects: {get_subjects()}\")\nprint(f\"Querying subject: {SUBJECT}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## ROI queries: specific / category / all\n\nROIs ship in eight categories on the bucket. Use the\n``category=`` filter when you want to scope a query to one\nfunctional family (e.g. just the face-area ROIs); call\n``get_rois`` without a filter when you want the full inventory.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "ROI_CATEGORIES = (\n    \"body\", \"character\", \"face\", \"laion\",\n    \"motion\", \"object\", \"place\", \"retinotopy\",\n)\n\nprint(f\"All ROIs ({len(get_rois(SUBJECT))}):\")\nprint(get_rois(SUBJECT))\nprint()\nfor cat in ROI_CATEGORIES:\n    rois = get_rois(SUBJECT, category=cat)\n    print(f\"{cat}: {rois}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Bucket diagnostic listing\n\n``inspect_bucket`` prints the immediate top-level prefixes plus a\ncount of subject directories under each derivative tree -- useful\nwhen discovery returns surprises.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "inspect_bucket()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Bundled train/test splits (no download required)\n\n``laion_fmri.splits`` ships predefined train/test partitions of\nthe stimulus set so callers can compare against the published\nbaselines without re-running any clustering or sampling.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from laion_fmri.splits import (\n    get_train_test_ids,\n    list_ood_types,\n    list_pools,\n    list_splits,\n    load_split,\n)\n\nprint(f\"Pools:     {list_pools()}\")\nprint(f\"Splits:    {list_splits()}\")\nprint(f\"OOD types: {list_ood_types()}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Inspect one split\n\n``load_split(name, pool=...)`` returns a ``Split`` describing the\nsplit's sizes and family. ``get_train_test_ids`` is the\nconvenience wrapper that gives you the actual ID lists in one\ncall.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "split = load_split(\"random_0\", pool=\"shared\")\nprint(f\"Split:    {split.name}\")\nprint(f\"Pool:     {split.pool}\")\nprint(f\"Family:   {split.split_family}\")\nprint(f\"n_train:  {split.n_train}\")\nprint(f\"n_test:   {split.n_test}\")\n\ntrain_ids, test_ids = get_train_test_ids(\"random_0\", pool=\"shared\")\nprint(f\"Loaded:   {len(train_ids)} train / {len(test_ids)} test ids\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## OOD splits with a type filter\n\nThe ``ood`` split partitions held-out stimuli by category; the\n``ood_types=`` argument restricts which categories are kept in the\ntest set.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "_, test_shape = get_train_test_ids(\n    \"ood\", pool=\"shared\", ood_types=[\"shape\"],\n)\nprint(f\"OOD shape only:  test ids = {len(test_shape)}\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Per-subject queries that need local data\n\nDiscovery covers what is *available*; once you commit to working\nwith a specific subject, the per-subject API on\n:class:`~laion_fmri.subject.Subject` is what you reach for. Those\nmethods read on-disk files, so they presuppose a download. The\nblock below is **commented out** to keep this example offline --\ncopy the lines you need into your own script after running\n``download(...)`` for the subject and session you care about.\n\n```python\nfrom laion_fmri.download import download\n# one session for one subject (~few hundred MB):\ndownload(subject=\"sub-01\", ses=\"ses-01\")\n\nfrom laion_fmri.subject import load_subject\nsub = load_subject(\"sub-01\")\n\n# Sessions present on disk\nprint(sub.get_sessions())                      # ['ses-01', ...]\n\n# Trial info: runs, repetitions, stimulus labels\ntrials = sub.get_trial_info(session=\"ses-01\")\n# columns: session, run, beta_index, label\nprint(trials.columns.tolist())\nprint(trials[\"run\"].unique())                  # runs in this session\nprint(len(trials))                             # trial count\n\n# Single-trial betas with the multi-level ROI grammar\nbetas_one  = sub.get_betas(session=\"ses-01\", roi=\"FFA1\")\nbetas_face = sub.get_betas(session=\"ses-01\", roi=\"face\")\nbetas_all  = sub.get_betas(session=\"ses-01\", roi=\"all\")\n\n# Multi-format ROI loading\nroi = sub.get_roi_data(\"FFA1\", format=\"all\", hemi=\"all\")\n# roi[\"FFA1\"] is a nested dict:\n# {\n#   \"volume\": <1-D bool>,\n#   \"gii\": {\"hemi-L\": {\"func.gii\": ..., \"label\": ...},\n#           \"hemi-R\": {...}},\n# }\n```\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Cross-subject discovery\n\nLoop ``get_subjects()`` to ask the same questions of every subject\nin the bucket. ROI counts can differ across subjects (some ROIs\ndon't exist for everyone).\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "for sub_id in get_subjects():\n    n_face = len(get_rois(sub_id, category=\"face\"))\n    n_total = len(get_rois(sub_id))\n    print(f\"  {sub_id}: {n_total:>3} ROIs total, {n_face} face\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Stimulus metadata (forward-compat)\n\nThe ``stimuli/`` prefix is reserved for the stimulus images and\ntheir metadata table; it isn't populated yet. Once it lands, the\ncall below would print the catalogue (commented out for the same\noffline-by-default reason as the Subject queries above):\n\n```python\n# download(subject=\"sub-01\", include_stimuli=True)\n# sub = load_subject(\"sub-01\")\n# if sub.has_stimuli():\n#     stim = sub.get_stimulus_metadata()\n#     print(stim.head())\n#     print(f\"Total stimuli: {len(stim)}\")\n#     print(f\"Shared:        {stim['shared'].sum()}\")\n#     print(f\"Categories:    {stim['category'].value_counts()}\")\n```\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.13"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}