Johannes studied Electrical and Biomedical Engineering at the Technical University of Graz, Austria. He obtained his MSc in Bioinformatics in 2003 and after that conducted a PhD in Bioinformatics with a focus on cancer research in particular childhood leukemia. From 2007 to 2015 he was working, first as a Post Doc and then as a junior group leader for Bioinformatics, at the Medical University of Innsbruck, Austria. His main research areas during this time were transcriptomics and genomics in the field of childhood leukemia. In 2015 he moved to the Institute for Biomedicine of the Eurac Research in Bolzano, Italy and shifted his focus first on genetics and subsequently to metabolomics research. In 2018 he became the head of the Computational Metabolomics Team of the Institute for Biomedicine at Eurac Research. Johannes has a long-lasting experience in open software development in the fields of transcriptomics, genomics, and metabolomics. Since 2020 he got more involved in the Bioconductor project and is since a member of the Community Advisory Board, the Code of Conduct Committee, and the Package Review Working Group. He is author of, and contributor to more than 15 Bioconductor packages most of them providing functionality for the analysis of mass spectrometry and metabolomics data. In his free time, he enjoys designing stickers and logos, mountaineering and spending time with his family.
1. When and why did you start using metabolomics in your investigations?
I first got in contact with metabolomics data when I joined the Institute for Biomedicine of Eurac Research. I had long lasting experience in the analysis of large-scale data sets (mostly microarray and RNA-seq data) and was thus appointed to help analyzing the metabolomics data sets that were generated at the Institute, in particular the untargeted LC-MS data. I started investigating and looking for tools to analyze that data and had the impression that the software available at that time, especially when compared to the software for the processing of transcriptome data, was sub-optimal. This was when I then first contacted Steffen Neumann and Laurent Gatto and discussed with them the possibility to join forces to update and improve MS-related software in R. In particular, I wanted to avoid the code-duplication being present in the various software packages and to unify the code base of R/Bioconductor packages for the analysis of mass spectrometry (MS) data (both for metabolomics and proteomics). The rest is history. We've updated since the xcms and MSnbase R packages to support also the analysis of very large data sets and from there, started to implement, together with an ever-growing number of collaborators and contributors, a large panel of other software packages that together, as we believe, provide a comprehensive and flexible infrastructure for MS data handling and analysis.
2. What have you been working on recently?
Recently, we've implemented a set of R packages providing established methods and core functionality for the annotation of untargeted metabolomics data. Rather than being a single application, these packages provide modular functions that can be used to create customized, flexible, and reproducible annotation workflows. In addition, we're currently analyzing the targeted and untargeted metabolomics data sets from our in-house population study.
3. What are the main challenges you see on the data analysis of untargeted metabolomics data from populational studies?
It's their magnitude. On one hand this data is computationally intense, but that's something we can easily work on and fix by simply implement more efficient or less memory demanding software. The bigger problem for me is that such data tends to be so large that it becomes hard to do a proper and comprehensive quality control. And that is obviously essential if we want to evaluate whether the pre-processing (peak detection, alignment, and correspondence) actually worked for all files. Another important fact, which however also applies to targeted metabolomics data, is that data from population studies will always be less controlled than for example data in case-control or clinical studies. Hence, evaluating influences of potential confounding factors is in my opinion very important, especially for metabolomics data because, as we know, it is more affected by environmental factors than for example genetic, transcriptome or proteome data.
4. As one of the people constantly working on software/packages development in R for metabolomics, could you share some recent updates that may be interesting for the community?
This might be partially also answered by point 2 above. In addition, what we aim at present is to define an infrastructure that enables to access various reference libraries (such as spectral libraries and compound annotations from e.g., HMDB, MassBank etc) in a more standardized way. Ultimately, this should help the end user, as they would no longer loose time in converting, importing, and reformatting data. My vision would be to distribute such annotation resources in a user friendly and reproducible way. For genomic, transcriptomic and proteome annotations this is already possible through Biocondutor's AnnotationHub resource. We are now planning to do the same for metabolite or small compound annotations. In addition, we're working hard to better integrate some of these fantastic tools that are out there, like SIRIUS or MASST, into R which would enable to use them without the need to manually export, upload, execute and re-load the results again into R.
5. What tips/advices would you give for ECR who would like to start working with R in metabolomics?
The power of R is the possibility to create flexible, customized, and reproducible analysis workflows by using and integrating methods from this huge number of packages that are out there. For that, obviously, some understanding of R is needed. For people that don't have experience with R, one of the introductory courses/workshops from Data Carpentry (https://datacarpentry.org/) might be a good starting point. Also, each R package (should) provides documents describing how it can be used based on some use-cases (the so-called package "vignettes"). It's always a good thing to first go through these to get a feeling how a package can be used and what functionality it provides. In addition, there are a lot of other tutorials and workshops out there, also for the analysis of metabolomics data, that can be used as a starting point to set up own, custom, workflows. Most importantly, don't be afraid to get in contact with the package developers if something is unclear. Most will help you out if something is not working.