Dorrestein is Professor at the University of California - San Diego. He is the Director of the Collaborative Mass Spectrometry Innovation Center and a Co-Director, Institute for Metabolomics Medicine in the Skaggs School of Pharmacy & Pharmaceutical Sciences, and Department of Pharmacology and Pediatrics. Although Born in Utrecht, the Netherlands, Dorrestein did his undergraduate in metalloorganic chemistry at Northern Arizona University under guidance of Prof MacDonald and his graduate work in mechanistic enzymology at Cornell University under guidance of Prof. Begley. He performed his post-doctoral work as an NIH through and NRSA fellowship in top-down proteomics in the lab of Prof. Kelleher. Since his arrival to UCSD in 2006, Prof. Dorrestein has been pioneering the development of mass spectrometry methods to study the chemical ecological crosstalk between microorganisms, including host interactions for agricultural, diagnostic and therapeutic applications. In his spare time Dorrestein likes to spend time with Kathleen, his better half (wife) and awesome daughter Tatiana. He also likes to and garden, rock climb, hike, kayak and mountain bike.
1. When and why did you start using metabolomics in your investigations?
As a graduate student, I was studying the mechanistic enzymology of how bacteria made vitamins. Initially, I did the synthesis of small molecules as mechanistic probes to understand the mechanisms by which the enzymes work but then pivoted to the analysis of the proteins by mass spectrometry. A large part of this work involved genome mining where we predicted the molecules that were made by gene clusters on bacterial genomes or the prediction of the roles and mechanisms by which small molecules were biosynthesized. This led to the discovery of a new biosynthetic pathway to make methionine in actinomycetes, a new strategy for bacteria to make vitamin B1 and many other examples. We were studying the intact proteins and their modifications they undergo during biosynthesis by top-down mass spectrometry that were involved in the biosynthesis. As a post-doc and as an assistant professor I expanded this work to study how small molecules are made by non-ribosomal peptide synthetases and polyketide synthetases using top and middle down proteomics methodologies as many of our therapeutic agents. (penicillin, vancomycin and lovastatin are made via those biosynthetic paradigms).
But I was always very curious about the roles microbial small molecules played in their respective ecosystems (e.g. human, ocean, agriculture, environments, etc). questions such as "How do microbes use molecules to communicate with each other? or "How do microbes communicate with the host?" or "How do chemical milieu shape microbial communities?". In asking these questions and trying to solve them it became quickly evident that there were insufficient tools that could be widely employed. I realized this in my second year as an assistant professor and switched the entire lab, that was dedicated to doing top and middle down proteomics, to the study of small molecules within a two month period. I was strongly advised by senior co-workers not to make s dramatic switch as it would hurt my career. But decided that this is what I wanted to do scientifically and therefore proceeded.
In this transition, we developed strategies such as microbial imaging mass spectrometry that allowed the investigation of the metabolic exchange of two or more microbes grown in Petri-dishes. This was so illuminating. However, while we got easy to interpret images in terms of metabolic exchange or communication, they were m/z signals. Translating the m/z signals to molecules turned out to be the most time-consuming and difficult part. In addition, it was not only one molecule but often panels or molecules that could be observed, each with different functions that could be detected. And then they were often only present in small quantities or unstable to isolation. One key limitation is that most microbial molecules are not covered in MS/MS-based reference libraries. This is because reference libraries are largely obtained from commercially available molecules. Thus new methods needed to be developed for the annotation and structural annotation or classification of the microbial molecules to allow the interpretation of the images. We therefore first leveraged our strength in genome mining and developed strategies to link mass spec signals to microbial genomic information. peptidogenomics, glycogenomics, leading to the tools such as NRPquest, RippQuest and Pep2Path and several others. Such work is still done in my lab through the many wonderful collaborations the lab still has in this area. However, they do not work for many classes of molecules.
This led to another major shift in the lab, the shift from microbial imaging to the analysis of global data. This started with the introduction of molecular networking. From genome mining, we realized the importance of being able to compare two or more sequencing. From this one could infer modifications to the sequence. Around that time people just started to create networks of genes and infer function from them. By analogy, if we compare two MS/MS spectra and even though the parent mass are not the same but the overall MS/MS were (while taking in account the ions that differ by the same masses as the parent mass) then we could infer the structures were related. I gave my co-worker and collaborator Dr. Bandeira a DDA acquired data set of Bacillus subtilis and asked if he could give me a table with three columns. In the first two columns an MS/MS spectral ID of each spectrum to each spectrum and then in the third column and then I would look at it. But my jaw dropped when I looked and inspected the first molecular network. We had a very difficult time publishing the first paper on molecular networking with eight rejections. Only about 1 in 5 people understood the implications of molecular networking right away all others not until they tried with their own data. I got the paper published as I called an editor, the late Jerrold Meinwald, and told them this was the most impactful paper I have created in my scientific career to date and he was willing to listen and get it reviewed. I not only realized molecular networking is a key strategy to tackle the annotation problem but also that this is the key ingredient to start building the Google search engine equivalent of the world's MS/MS data. However, it meant that we also needed to get all that data and create an infrastructure that would capture the world's mass spectrometry data and knowledge associated with this data and make this freely accessible to the world. We are getting closer and closer to that goal.
This set of the movement to building the global natural product social (GNPS) molecular networking analysis ecosystem that now has more than 300,000 accessions a month and has processed 3.2 million tasks through the submission of 200,000 jobs. The focus of the Bandeira lab had been to develop data storage called MassIVE for proteomics data, this was instrumental for the ability to develop GNPS. A student of his, in computer science, and now a postdoc in my lab, Mingxun Wang, led the development of the GNPS infrastructure but was originally only a small side project as a student. But the worldwide community saw the value of GNPS and it spawned so many wonderful collaborations. A key underappreciated aspect of GNPS is that this is provided for free to the community, and is made possible by hosting and maintaining 70 ft of server and compute hardware, an scientific engineering feat in its own right and is not documented in any publication. But as the community and the more users use the GNPS ecosystem and the more collaborations it initiates the more useful the analysis ecosystem becomes and are seeing an explosion of new tools, many of which build in the lab and in collaborations. For the first time, all publicly available reference spectra could be searched in one go, one could perform molecular networking to propagate annotations, introduce feature-based molecular networking and made it so that it functions together with many feature finding tools such as MZmine, OpenMS, XCMS, MS-DIAL, mzTab-M, and vendor feauture finding software. It also became possible that public data sets could be searched all without having to download data and converting the data to a format that the user could analyze as this was already done. We have introduced MS/MS searches called MASST across all public data so one can find all data files that contain the same MS/MS spectra of interest. This is useful as it allows one to ask simple questions such as is the discovery of this new molecule in an animal translational to people? or is this molecule found in people, host, diet or mycobiome derived? We have to build ReDU, a metadata-driven strategy, that allows searches of all data that contain specific annotation or allows for co-analysis of data that is filtered based on specific metadata. For example, filtering based on age and sex for clinical data or longitude /latitude for environmental data. Basically, these developments are early versions of the types of search engines we want to build.
2. What have you been working on recently?
I like to rephrase this in "What has the lab and world-wide collaborators been working on recently?" as this reflects the way the lab functions. We are working with a worldwide community to address some key and fundamentally important tools to allow us to understand all the information that is collected in untargeted mass spectrometry experiments. There are now 50 different tools available in GNPS. Some recent tools such as Passatuto (FDR estimator for spectral matching), Sirius, Zodiac (MF predictors), MS2LDA (substructure classification), Spec2Vec (a method that allows molecular networking when two or more modifications are present), CANOPUS (structural classification of MS/MS), Deprelicator, CSI:fingerID, NAP (In silico predictions), Qemistree (Visualization strategy of MS/MS similarity trees), Ion Identity networking (a molecular networking strategy that also networks based on peak shape analysis to find different ions of the same molecule), Native Spray based metal-small molecule discovery (Many molecules do not work without their metal counterpart and thus is a workflow that allows one to discover potential metal-small molecule pairs) are some of the key advances that are assisting in the deeper annotation of the untargeted metabolomics data.
However, we are also interfacing GNPS with important third-party tools. For example, Cytoscape, developed by our co-worker Ideker. We now have a direct export into Cytoscape to visualize the results. We have linked GNPS to BIGscape, Natural Product atlas and iOMEGA to link gene clusters and microbial moelcules. We have direct connections to MONA via HASHtags so that public spectra can be searched. And to allow high-level statistical, multivariate analysis, ratio analysis, machine learning we export the result into a QIIME and QIITA format, that the Knight lab, another co-worker at UC-San Diego developed. While designed for the analysis of microbiome data from an ecological standpoint, they are one of the most advanced platforms for quantitative data analysis and visualization that is currently available in the scientific community. At the same time, GNPS is also compatible with more widely used tools such as Metabolanalyst, one of our favorite analysis programs we routinely use and Jupyter notebooks, which allows for improved reproducibility of custom scientific analysis and sharing of the code. Integration with other infrastructures has also been critical to developing new methods such as MMVec that learn microbe-metabolite relationships.
There are also some additional new developments that are emerging that will become available through GNPS in the near future. 1) A new reference library of propagated annotations from the entire repository 2) A new Google doc equivalent for data analysis so that one can do data analysis with multiple people, share the content and then others can continue where the first person left off. 3)Tagging of molecules for ease of interpretation of the data. 4) Reference-based metabolomics that annotates not based on the structure but source. 5) we are developing new strategies to understand if molecules are microbial derived or not. 6) we are particularly excited to develop strategies that allow the understanding of diet to microbial community relationships and are developing a series of tools to address that need. 7) we are also excited to begin building connections to Metabolights, the European metabolomics repository. Ultimately we want all these results and knowledge gained from all the above-mentioned tools and others that were not mentioned be available from a single search and be distributed worldwide.
3. What are some challenges when characterizing the metabolic interactions between microbes?
The small quantities and the many unique molecules they produce are not commercially available. There are now some 29,000 known microbial molecules but only a few hundred reference spectra are available of microbial metabolites. It means that even if the structures are known one still will not know. Thus it is critical to deposit reference spectra of any new structure that is determined so that the worldwide community does not spend time and a significant amount of money to solve a known structure.
4. Do you envision your work on identifying small molecules involved in these dynamic interactions being translated into improved human health?
Yes, I strongly believe that many of these molecules are potential candidates to solve medical challenges. Companies license technologies directly from our university UC San Diego for this purpose. I am also on the scientific advisory boards of Sirenas, which explores the oceans for new molecules that benefit human health; Cybele Microbiome, and is developing a specific molecule to promote skin health; Galileo, which discovers human microbiome-derived molecules for human health and then, as approved by UC San Diego, I am also a co-founder Ometalabs that provides mass spectrometry analysis capabilities and infrastructure solutions for instances when companies, agencies or individuals cannot use public infrastructure or that do not want their data to be public for IP reasons and I am a co-founder of Enveda Biosciences, that is translating plant-derived molecules into medicine.
5. As one of the creators of the Global Natural Products Social Molecular Networking (GNPS), what are the challenges when characterizing metabolites within complex mixtures or samples?
I already mentioned some of the challenges due to the lack of available standards. But some of the other challenges are that GNPS is build for and by the community. We want to make the infrastructure fair for everyone and truly democratize metabolomics analysis. This means that a high school student will have the same compute capacity as a seasoned expert with 50 years of experience or that anyone with access to the internet can do large scale analysis and have access to the same knowledge as everyone else. We also wanted to reduce the barrier of entry by making sure all the data is already converted to a common data format (mzML, mzXML or MGF). Thus the limit then becomes how creative is the person doing the analysis. It has been amazing to see the solutions that the community has been developing using the GNPS infrastructure, oftentimes in very creative and unexpected ways. For example, the first hints of a molecule called colibactin, an E.coli derived molecule that causes cancer, was discovered using molecular networking and feeding studies but there are many such examples. But the real challenge is the lack of sharing data and annotations associated with the data. Imagine what the community can do if all the metabolomics data and annotations associated with it that everyone collects is available at your fingertips. This will transform our understanding of not only microbial metabolites but also relationships to health, xenobiotics, diet etc. This capacity will also transform the types of questions we can address and will ask. This is a transition that is happening and cannot wait to see how the next generation of metabolomics scientists leverage this information.
6. What are your recommendations for people getting started in microbial metabolomics?
Currently, GNPS houses the most information on the metabolomics of microbial reference data thus GNPS is a good starting point. I would also familiarize myself with QIIME and QIITA. There are online tutorials and workshops available, attend them. Reach out to the authors, not only PIs, but the researchers that are listed first. They all get excited to help newcomers out with getting started. Also consult other infrastructures such as Metabolights, Metabolomics WBs, XCMSonline, NP atlas, there are lots of gems to be found. In addition, make your data available and early. For example, your data becomes searchable in MASST, ReDU but also becomes part of living data thus you can subscribe to your data and then if any new matches to your data become available from spectral libraries you will get an email and ensures that you do not miss any new annotations. More importantly, there are many microbial MS/MS spectra annotated in GNPS that are not yet published and then you can collaborate with the authors that deposited those spectra. In other words, become part of the community, if there is a tool you want to apply but don't know how to get started reach out directly to the authors or use forums. GNPS has it own forum linked in the GNPS banner for GNPS questions, as do the QIIME or Cytoscape analysis and data visualization tools but also the metabolomics society has a forum.