Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors

Authors: Dr. Paul Scotti (Princeton Neuroscience Institute, MedARC), Dr. Tanishq Mathew Abraham (Stability.AI, MedARC)

Example images reconstructed from human brain activity.

Introduction

MedARC, in collaboration with researchers at Princeton Neuroscience Institute, Ecole Normale Supérieure, PSL University, University of Toronto, and the Hebrew University of Jerusalem, along with EleutherAI and Stability AI, is proud to release its collaborative paper on MindEye. 

MindEye is a state-of-the-art approach that reconstructs and retrieves images from fMRI brain activity. Functional magnetic resonance imaging (fMRI) measures brain activity by detecting changes in oxygenated blood flow. It is used to analyze which parts of the brain handle different functions and to assist in evaluating treatments for the brain. MindEye was trained and evaluated on the Natural Scenes Dataset [1], an offline fMRI dataset containing data from human participants who each agreed to spend up to 40 hours viewing a series of static images, for a few seconds each, inside the MRI machine. 

This is the first preprint put out by MedARC since its public launch and is currently undergoing peer review. MedARC is a Discord-based research community supported by Stability AI that is building foundation generative AI models for medicine using a decentralized, collaborative, and open research approach. 

Method & Results

MindEye achieves state-of-the-art performance across both image retrieval and reconstruction. That is, given a sample of fMRI activity from a participant viewing an image, MindEye can either identify which image out of a pool of possible image candidates was the original seen image (retrieval), or it can recreate the image that was seen (reconstruction). 

To achieve the goals of retrieval and reconstruction with a single model trained end-to-end, we adopt a novel approach of using two parallel submodules that are specialized for retrieval (using contrastive learning) and reconstruction (using a diffusion prior). 

Each unique image in the dataset was viewed three times, for three seconds at a time. Corresponding fMRI activity (flattened spatial patterns across 1.8mm cubes of cortical tissue called “voxels”) was collected for each image presentation. fMRI activity across the three same-image viewings was averaged together and input to MindEye to retrieve and reconstruct the original image. 

MindEye overall schematic depicts the retrieval and reconstruction submodules alongside an independent low-level perceptual pipeline meant to enhance reconstruction fidelity.

Retrieval

For retrieval, MindEye finds the exact (top-1) match in a pool of test samples with >90% accuracy for both image and brain retrieval, outperforming previous work which showed <50% retrieval accuracy. This suggests that MindEye brain embeddings retain fine-grained image-specific signals.

MindEye image retrieval. Given a pool of candidate images, the nearest neighbor search in CLIP space enables searching for the original image based on brain activity.

We accomplish this feat through contrastive learning. First fMRI brain activity from regions of the brain receptive to visual information are flattened and fed through a dense 940M parameter multilayer perceptron (MLP). This outputs brain embeddings that are the same dimensionality as the outputs from the last hidden layer of CLIP ViT-L/14 [2] (although we could use any multimodal latent space). These brain embeddings are fed through a lightweight MLP projector and then we use a novel bidirectional implementation of mixup contrastive data augmentation using CLIP loss to train the model to map brain embeddings into the same space as pre-trained, frozen CLIP image embeddings.  

For inference, you can simply compute the CLIP image embedding for every possible image in the pool of image candidates and see which one has the highest cosine similarity with the brain embedding output from MindEye. We found that this approach worked even when scaling up to the billions of image candidates contained in LAION-5B.

Reconstruction

Side-by-side comparison of reconstructions from fMRI-to-Image NSD papers.

For reconstructions, we take the outputs from the dense MLP backbone mentioned above and feed them through a diffusion prior trained from scratch to better align brain embeddings to CLIP image space. This is the same approach used by DALL-E 2 to align their CLIP text embeddings to CLIP image space before they feed the aligned embeddings through another diffusion model to output images. As visualized by UMAP dimensionality reduction, the inputs from the MLP backbone are clearly disjointed in reference to the CLIP embeddings following the initial MLP backbone (left subplot below), but they are well-aligned following the diffusion prior (right subplot).

UMAP plots depict CLIP image latents (blue), MindEye MLP backbone latents (orange), MindEye MLP projector latents (green), and MindEye diffusion prior latents (red).

This alignment allows us to substitute CLIP image latents for MindEye brain latents. We can simply take any pre-trained generative model that accepts CLIP image latents as inputs and feed the model a brain latent instead (no fine-tuning required!). This flexibility suggests that MindEye reconstructions will continue to improve as newer, more powerful image generation models are released.

Conclusion

Privacy Concerns & Societal Benefits

The ability to reconstruct perception from brain activity offers many societal benefits, as well as certain risks, including privacy concerns, as noted below: 

Benefits

  • Clinical applications could offer new diagnostic and assessment methods, as reconstructions are expected to be systematically distorted due to mental state or neurological conditions.

  • Current models trained on perception can potentially generalize to mental imagery, as similar patterns of brain activity are observed across perception and mental imagery of the same stimuli [3, 4].

  • Fine-grained visual communication facilitated by MindEye could potentially enhance communication with patients in a pseudocoma state beyond simple classification.

  • If adapted to real-time fMRI analysis [5] or non-fMRI neuroimaging modalities, MindEye could improve the performance of brain-computer interfaces.


Risks and Limitations

  • Each participant in the dataset spent up to 40 hours in the MRI machine to gather sufficient training data.

  • Models were trained separately for every participant and are not generalizable across people.

  • Image limitations: MindEye is limited to the kinds of natural scenes used for training the model. For other image distributions, additional data collection and specialized generative models would be needed.

  • Data protection: Ensuring the protection of sensitive brain data and transparency from data-collecting companies is essential. MindEye used the Natural Scenes Dataset where data collection was approved by the University of Minnesota institutional review board and where participants provided informed written consent to share their data.

  • Data contamination: Non-invasive neuroimaging methods like fMRI not only require participant compliance but also full concentration on following instructions during the lengthy scan process. Data become noisy or unusable if participants move their heads or fail to pay attention to the task.  

This adaptation to real-time fMRI, in combination with training foundation neuroimaging models, is actively being developed in MedARC and we invite interested readers to explore our ongoing projects and join us on Discord as volunteer contributors.

Open Research

MindEye was developed using a 100% transparent volunteer-driven open research approach. The source code was accessible via a public GitHub repository throughout the lifespan of the project. Research discussions were held via public Discord channels, and weekly video conference calls were recorded and shared publicly.

We want to establish an internationally diversified, volunteer-driven research team composed of members from varied backgrounds possessing a wide array of expertise. Fully transparent open-research initiatives such as this and others like EleutherAI, LAION, OpenBioML, and ML Collective could redefine the traditional framework of scientific research, democratizing entry into machine learning and medical research through the harnessing of crowd-sourced collective intelligence and community collaboration.

Authors

MindEye was developed by co-first authors Dr. Paul Scotti (Princeton Neuroscience Institute & MedARC) and Atmadeep Banerjee (MedARC), with the support of joint senior authors Dr. Tanishq Abraham (MedARC CEO, Stability.AI) and Dr. Kenneth Norman (Princeton Neuroscience Institute). MindEye contributors also include Jimmie Goode (core contributor), Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, and David Weisberg. 

For more information, visit our project website or email Dr. Paul Scotti at scottibrain@gmail.com.

References

  1. Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L. Breedlove, Jacob S. Prince, Logan T. Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin Hutchinson, Thomas Naselaris, and Kendrick Kay. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1):116–126, January 2022. ISSN 1097-6256, 1546-1726. doi: 10.1038/s41593-021-00962-x. URL https://www.nature.com/articles/s41593-021-00962-x.

  2. Learning Transferable Visual Models From Natural Language Supervision. (2021). CLIP model card [Repository file]. GitHub. Retrieved June 20, 2023, from https://github.com/openai/CLIP/blob/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1/model-card.md

  3. Mark Stokes, Russell Thompson, Rhodri Cusack, and John Duncan. Top-Down Activation of Shape-Specific Population Codes in Visual Cortex during Mental Imagery. Journal of Neuroscience, 29(5): 1565–1572, February 2009. ISSN 0270-6474, 1529-2401. doi: 10.1523/JNEUROSCI.4657-08.2009. URL https://www.jneurosci.org/content/29/5/1565. Publisher: Society for Neuroscience Section: Articles.

  4. Leila Reddy, Naotsugu Tsuchiya, and Thomas Serre. Reading the mind’s eye: Decoding category information during mental imagery. NeuroImage, 50(2):818–825, April 2010. ISSN 1053-8119. doi: 10. 1016/j.neuroimage.2009.11.084. URL https://www.sciencedirect.com/science/article/pii/S1053811909012701.

  5. Grant Wallace, Stephen Polcyn, Paula P. Brooks, Anne C. Mennen, Ke Zhao, Paul S. Scotti, Sebastian Michelmann, Kai Li, Nicholas B. Turk-Browne, Jonathan D. Cohen, and Kenneth A. Norman. RTCloud: A cloud-based software framework to simplify and standardize real-time fMRI. NeuroImage, 257:119295, August 2022. ISSN 10538119. doi: 10.1016/j.neuroimage.2022.119295. URL https://linkinghub.elsevier.com/retrieve/pii/S1053811922004141.

Previous
Previous

Objaverse-XL: A Universe of 10M+ 3D objects

Next
Next

OpenFlamingo v2: New Models and Enhanced Training Setup