MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

Key Takeaways

  • MedARC, in collaboration with Stability AI and various research institutions, has developed MindEye2, an advanced fMRI-to-image reconstruction model that can generate high-quality images from brain activity. 

  • MindEye2 employs a novel approach of pre-training on data from multiple subjects followed by fine-tuning with a new participant's data using only 1 hour of training data.

  • The research highlights the potential of MindEye2 in clinical applications and brain-computer interfaces by enabling high-fidelity image reconstructions from minimal fMRI data, paving the way for new diagnostic and communication tools for patients with neurological conditions. Read our full research paper here.

MindEye2 reconstructs images seen in the MRI machine by mapping fMRI brain activity to CLIP space and leveraging shared-subject modeling to improve out-of-subject generalization.

Continuing research into neuroscience and machine learning, today MedARC releases a state-of-the-art fMRI-to-image reconstruction model that is shown to work well with just 1 hour of training data. This work was done in the MedARC community in collaboration with researchers at the Princeton Neuroscience Institute, University of Minnesota, University of Sydney, and University of Waterloo. 

Creating generalizable brain models is challenging due to the diversity in human brain sizes, shapes, and the way they organize visual information. However, MindEye2 represents a significant advancement over previous methods [1,2,3,4] for reconstructing viewed images from fMRI brain activity. Unlike past approaches, MindEye2 models were trained on brain data from several individuals at once and can be fine-tuned to adapt to new participants with minimal training data.

The model was trained and evaluated on the Natural Scenes Dataset [5], an fMRI dataset consisting of 8 human participants who each spent up to 40 hours viewing naturalistic images inside the MRI machine. Each unique image was viewed three times, for three seconds at a time. The corresponding fMRI activity (flattened spatial patterns across 1.8mm cubes of cortical tissue called “voxels”) was collected for each image presentation, averaged together and input to MindEye2 to retrieve and reconstruct images. Participants viewed completely different images inside of the scanner, prohibiting the use of previous shared-subject alignment approaches that require shared seen images [6].

To overcome these barriers, we developed a novel functional alignment procedure to linearly map brain data to a shared-subject latent space, followed by a shared non-linear mapping to CLIP image space. We then map from CLIP space to pixel space by fine-tuning Stable Diffusion XL [7] to accept CLIP latents as inputs instead of text. 

MindEye2 overall schematic. The model is first trained on subjects each contributing up to 40 hours of scanning data. The model is then fine-tuned on a held-out, target subject who contributes only 1 hour of training data.

MindEye2 achieves state-of-the-art image retrieval, image reconstruction and caption generation metrics across multiple subjects, and enables high quality image generation with just 2.5% of the previously required data (i.e., 1 hour of training data instead of 40 hours). Given a sample of fMRI activity from a participant viewing an image, MindEye can identify either which image out of a pool of possible image candidates was the original seen image (retrieval), or it can recreate the image that was seen (reconstruction) along with its text caption. These results can be generated by fine-tuning the pre-trained model with just an hour of fMRI data from a new subject.

Reconstructions from different model approaches using 1 hour of scanning data from the Natural Scenes Dataset.

Qualitative comparison above confirms that in the 1-hour training data setting, MindEye2 outperforms other methods, and that specifically pre-training on other subjects’ data helps to enable such performance. See our preprint for quantitative comparisons.

MindEye2 is first trained using data from 7 subjects in the Natural Scenes Dataset and then fine-tuned using a target held-out subject who may have scarce training data. The fMRI activity is initially mapped to a shared-subject latent space via subject-specific ridge regression. The rest of the model is subject-agnostic, consisting of an MLP backbone and a diffusion prior which are used to output predicted CLIP [8,9] embeddings which are reconstructed into images using our Stable Diffusion XL unCLIP model. The diffusion prior plays an important role in connecting the CLIP latent space and the fMRI latents generated by the MLP backbone, effectively closing the gap between different types of data. Rather than training the alignment to a common subject space separately, the entire system is trained end-to-end, incorporating brain data from all participants in each batch during the pre-training phase. 

UnCLIP (or “image variations”) models have previously been used for the creative application of returning variations of a given reference image [10, 11, 12]. Contrary to this, our goal was to train a model that returns images as close as possible to the reference image across both low-level structure and high-level semantics. For this use-case, we observed that existing unCLIP models do not accurately reconstruct images from their ground truth CLIP image embeddings (see above Figure). We therefore fine-tuned our own unCLIP model (using the 256 x 1664 dim. image embeddings from OpenCLIP ViT-bigG/14) from Stable Diffusion XL to support this goal, leading to much higher fidelity reconstructions from ground truth CLIP embeddings. For MindEye2, we can use OpenCLIP embeddings predicted from the brain instead of the ground truth embeddings to reconstruct images. This change increases the ceiling possible performance for fMRI-to-image reconstructions as it is no longer limited by the performance of the pre-trained frozen unCLIP model.

Generating images from their respective ground truth CLIP image embeddings. Our fine-tuned Stable Diffusion XL unCLIP model (middle) outperforms the previously used Versatile Diffusion (right) in retaining both low-level structure and high-level semantics.

Normalized reconstruction metrics for MindEye2 with (connected) or without (dotted) pretraining on other subjects, using varying amounts of training/fine-tuning data.

Comparing MindEye2 reconstruction metrics across varying training/fine-tuning data, we see a steady improvement across both pre-trained and non-pretrained models as more data is used to train the model. Such results demonstrate how the 1-hour setting offers a good balance between scan duration and reconstruction performance, with notable improvements from first pre-training across other subjects’ data. 

MindEye2 generates image reconstructions from fMRI data, achieving state-of-the-art results with less training data than previous methods. Our work shows the potential to apply deep learning models trained on large-scale neuroimaging datasets to new subjects with minimal data. This work demonstrates that it is now practical for patients to undergo a single MRI scanning session and produce enough data to perform high-quality reconstructions of their visual perception. Such image reconstructions from brain activity are expected to be systematically distorted due to factors including mental state, neurological conditions, etc. This could potentially enable novel clinical diagnosis and assessment approaches, including applications for improved locked-in (pseudocoma) patient communication [13] and brain-computer interfaces if adapted to real-time analysis [14] or non-fMRI neuroimaging modalities. 

Limitations

fMRI is extremely sensitive to movement and requires subjects to comply with the task: decoding is easily resisted by slightly moving one's head or thinking about unrelated information [15]. MindEye2 has also only been shown to work on natural scenes such as those in COCO; additional data and/or specialized generative models would likely be required for other image distributions.

Open Research

MindEye was developed using a 100% transparent volunteer-driven open research approach. The source code was accessible via a public GitHub repository throughout the lifespan of the project. Research discussions were held via public Discord channels, and weekly video conference calls were recorded and shared publicly. We want to establish an internationally diversified, volunteer-driven research team composed of members from varied backgrounds possessing a wide array of expertise. Fully transparent open-research initiatives such as this and others like EleutherAI, LAION, OpenBioML, and ML Collective could redefine the traditional framework of scientific research, democratizing entry into machine learning and medical research through the harnessing of crowd-sourced collective intelligence and community collaboration. Join our efforts at https://medarc.ai/fmri.

Authors

MindEye2 was developed by project lead Paul Scotti (Stability AI), core contributor Mihir Tripathy (MedARC), core contributor Cesar Kadir Torrico Villanueva (MedARC), core contributor Reese Kneeland (University of Minnesota), Tong Chen (University of Sydney), Ashutosh Narang (MedARC), Charan Santhirasegaran (MedARC), Jonathan Xu (University of Waterloo), Thomas Naselaris (University of Minnesota), Kenneth A. Norman (Princeton University), and Tanishq Mathew Abraham (Stability AI). 


For more information, visit our project website.

References

  1. Paul Scotti, Atmadeep Banerjee, Jimmie Goode, Stepan Shabalin, Alex Nguyen, Ethan Cohen, Aidan Dempster, Nathalie Verlinde, Elad Yundler, David Weisberg, Kenneth Norman, and Tanishq Abraham. Reconstructing the Mind’s Eye: fMRI-to-Image with Contrastive Learning and Diffusion Priors. Advances in Neural Information Processing Systems, 36:24705– 24728, December 2023.

  2. Yu Takagi and Shinji Nishimoto. High-resolution image reconstruction with latent diffusion models from human brain activity. preprint, Neuroscience, November 2022. URL http://biorxiv.org/lookup/doi/10.1101/2022.11.18.517004.

  3. Furkan Ozcelik and Rufin VanRullen. Brain-Diffuser: Natural scene reconstruction from fMRI signals using generative latent diffusion, March 2023. URL http://arxiv.org/abs/2303.05334. arXiv:2303.05334 [cs, q-bio].

  4. Reese Kneeland, Jordyn Ojeda, Ghislain St-Yves, and Thomas Naselaris. Brain-optimized inference improves reconstructions of fMRI brain activity, December 2023c. URL http://arxiv.org/abs/2312.07705. arXiv:2312.07705 [cs, q-bio].

  5. Emily J. Allen, Ghislain St-Yves, Yihan Wu, Jesse L. Breedlove, Jacob S. Prince, Logan T. Dowdle, Matthias Nau, Brad Caron, Franco Pestilli, Ian Charest, J. Benjamin Hutchinson, Thomas Naselaris, and Kendrick Kay. A massive 7T fMRI dataset to bridge cognitive neuroscience and artificial intelligence. Nature Neuroscience, 25(1):116–126, January 2022. ISSN 1097-6256, 1546-1726. doi: 10.1038/ s41593-021-00962-x.

  6. Po-Hsuan (Cameron) Chen, Janice Chen, Yaara Yeshurun, Uri Hasson, James Haxby, and Peter J Ramadge. A ReducedDimension fMRI Shared Response Model. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.

  7. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis, July 2023. URL http://arxiv.org/abs/2307.01952. arXiv:2307.01952 [cs].

  8. Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. LAION5B: An open large-scale dataset for training next generation image-text models, October 2022.

  9. Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021. URL https://doi.org/10.5281/zenodo. 5143773.

  10. Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, and Humphrey Shi. Versatile Diffusion: Text, Images and Variations All in One Diffusion Model, March 2023. URL http://arxiv.org/abs/2211.08332. arXiv:2211.08332 [cs].

  11. Hu Ye, Jun Zhang, Sibo Liu, Xiao Han, and Wei Yang. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models, August 2023. URL http://arxiv.org/abs/2308.06721. arXiv:2308.06721 [cs].

  12. Justin Pinkney. Lambda Diffusers, 2022. URL https://github.com/LambdaLabsML/lambda-diffusers. publicationType: misc; publisher: GitHub; journal: GitHub repository

  13. Martin M. Monti, Audrey Vanhaudenhuyse, Martin R. Coleman, Melanie Boly, John D. Pickard, Luaba Tshibanda, Adrian M. Owen, and Steven Laureys. Willful Modulation of Brain Activity in Disorders of Consciousness. New England Journal of Medicine, 362(7):579– 589, February 2010. ISSN 0028-4793. doi: 10.1056/ NEJMoa0905370. URL https://doi.org/10.1056/NEJMoa0905370. Publisher: Massachusetts Medical Society _eprint: https://doi.org/10.1056/NEJMoa0905370.

  14. Grant Wallace, Stephen Polcyn, Paula P. Brooks, Anne C. Mennen, Ke Zhao, Paul S. Scotti, Sebastian Michelmann, Kai Li, Nicholas B. Turk-Browne, Jonathan D. Cohen, and Kenneth A. Norman. RT-Cloud: A cloud-based software framework to simplify and standardize real-time fMRI. NeuroImage, 257:119295, August 2022. ISSN 10538119. doi: 10.1016/j.neuroimage.2022. 119295. URL https://linkinghub.elsevier.com/retrieve/pii/S1053811922004141.

  15. Jerry Tang, Amanda LeBel, Shailee Jain, and Alexander G. Huth. Semantic reconstruction of continuous language from non-invasive brain recordings. Nature Neuroscience, pages 1–9, May 2023. ISSN 1546-1726. doi: 10.1038/ s41593-023-01304-9. URL https://www.nature.com/articles/s41593-023-01304-9. Publisher: Nature Publishing Group.

Previous
Previous

Stable Diffusion 3 API Now Available

Next
Next

Introducing Stable LM 2 12B