Author: Tony Hauptmann

News and events in Q3/2024

Current activities:

  • On August 22, 2024, Prof. Stefan Kramer was appointed "KI-Lotse" (roughly translated as "AI guide") for the life sciences by Alexander Schweitzer, the Prime Minister of the State of Rhineland-Palatinate. In his new role, Prof. Kramer will be advising partners from either AI or the life sciences regarding collaborations and the state government regarding the interface between the two disciplines. Congratulations and good luck with his plans to further strengthen the interaction between AI and the life sciences!

 

Gutenberg Workshop on AI for Scientific Discovery:

  • From 2-4 September the Gutenberg Workshop on AI for Scientific Discovery will take place in Weingut Wasem with an excellent line-up of invited speakers. The workshop was scientifically organized by Professor Peter Baumann and Professor Stefan Kramer of JGU Mainz. It will dive into a fascinating journey through the world of automated scientific discovery, driven by artificial intelligence. Since its inception, AI research has been deeply intertwined with the pursuit of uncovering scientific insights, a synergy that has gained remarkable momentum since the late 1970s. Fueled by advancements across various domains, AI now stands at the forefront of reshaping the landscape of scientific inquiry. The workshop focuses on cutting-edge topics, such as automation and autonomy in science, deep learning and foundation models for science, and the automated discovery of interpretable scientific knowledge. The workshop is organized around the following four themes:
    • Automation and autonomy in science
    • Applications of AI
    • Equation discovery, symbolic regression, and the induction of process models
    • Integration effort
  • The list of speakers includes some of the pioneers of the field:
    • Pat Langley (who started the field of AI for Scientific Discovery and wrote the groundbreaking book about it together with Herbert Simon, the only person to win both a Nobel and a Turing award)
    • Burkhard Rost (the first person to apply neural networks -- together with alignments -- successfully to protein data, to achieve the first breakthrough in secondary structure prediction)
    • Ross D. King (the first person to build a completely autonomous robot scientist)
    • Sašo Džeroski (who established equation discovery as a field and achieved a major breakthrough by employing context-free grammars for that purpose)
  • Additionally, some of our group members and associates will present their current research:
    • Jannis Brugger will show how equation discovery can profit from a supervised learning setting instead of a reinforcement one. Furthermore, he will highlight the important task of how to embed tabular data.
    • Mattia Cerrato developed together with his seminar students a testbed for AI-driven scientific discovery, called Science-Gym. The benchmark foster physical understanding of the tasks, by having agents autonomously perform data collection, experimental design, and equation discovery.
    • Cedric Cerstoff worked on an extension for Monte-Carlo tree search that allows the exclusion of already explored subtrees or leaves, resulting in a broader search, while maintaining identical computational resources.
    • Marius Köppel will highlight the use of AI in particle physics in the past, discuss
      current capabilities, and explore future directions.

 

Current Publications and Presentations:

  • Derian Boer will present Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering, which is a joint work with Fabian Koch and Stefan Kramer, at IJCLR'24.
  • Lukas Pensel will presen Neural RELAGGS, which is a joint work with Stefan Kamer, also at IJCLR'24.

 

Several of our group members take part in Discovery Science 2024:

  • Kirsten Köbschall will present Soft Hoeffding Tree: A Transparent and Differentiable Model on Data Streams, which is a joint work with Lisa Hartung, and Stefan Kramer.
  • Mattia Cerrato will present Science-Gym: A Simple Testbed for AI-drivenScientific Discovery, which is a joint work with Nicholas Schmitt, Lennart Baur, Edward Finkelstein, Selina Jukic, Lars Münzel, Felix Peter Paul, Pascal Pfannes, Benedikt Rohr, Julius Schellenberg, Philipp Wolf, and Stefan Kramer.
  • Jannis Brugger will present Residuals for Equation Discovery, which is a joint work with Viktor Pfanschilling, Mira Mezini and Stefan Kramer.

 

Recent Events:

  • At 01.08 was the kickoff meeting for our upcoming project: Medical AI combining Natural products and CEllular Imaging (MAINCE). MAINCE will use AI approaches to identify new and urgently needed therapeutics in immunology. Insights into the effects of these therapeutics, obtained through cutting-edge imaging techniques, will be combined with lab experiments using AI to accelerate and make drug development more efficient.
  • From 1-3 July we hosted the Third European Workshop on Algorithmic Fairness (EWAF’24) in Mainz, organized by Mattia Cerrato and Alesia Vallenas Coronel. The Workshop created a unique platform for researchers from academia and industry working on algorithmic fairness in the context of Europe’s legal and societal framework.
Posted on | Posted in Allgemein

Our paper "Discriminative machine learning for maximal representative subsampling" has been accepted at Scientific Reports

Our paper Discriminative machine learning for maximal representative subsampling, which is a joint work of Tony Hauptmann, Sophie Fellenz, Laksan Nathan, Oliver Tüscher and Stefan Kramer has been accepted at Scientific Reports.

Abstract

Biased population samples pose a prevalent problem in the social sciences. Therefore, we present two novel methods that are based on positive-unlabeled learning to mitigate bias. Both methods leverage auxiliary information from a representative data set and train machine learning classifiers to determine the sample weights. The first method, named maximum representative subsampling (MRS), uses a classifier to iteratively remove instances, by assigning a sample weight of 0, from the biased data set until it aligns with the representative one. The second method is a variant of MRS – Soft-MRS – that iteratively adapts sample weights instead of removing samples completely. To assess the effectiveness of our approach, we induced artificial bias in a public census data set and examined the corrected estimates. We compare the performance of our methods against existing techniques, evaluating the ability of sample weights created with Soft-MRS or MRS to minimize differences and improve downstream classification tasks. Lastly, we demonstrate the applicability of the proposed methods in a real-world study of resilience research, exploring the influence of resilience on voting behavior. Through our work, we address the issue of bias in social science, amongst others, and provide a versatile methodology for bias reduction based on machine learning. Based on our experiments, we recommend to use MRS for downstream classification tasks and Soft-MRS for downstream tasks where the relative bias of the dependent variable is relevant.

Posted on | Posted in Allgemein

Our paper "Four-dimensional trapped ion mobility spectrometry lipidomics for high throughput clinical profiling of human blood samples" has been accepted at Nature Communications

Our paper "Four-dimensional trapped ion mobility spectrometry lipidomics for high throughput clinical profiling of human blood samples", which is a joint work of Raissa Lerner, Dhanwin Baker, Claudia Schwitter, Sarah Neuhaus, Tony Hauptmann, Julia M. Post, Stefan Kramer & Laura Bindila, has been accepted at Nature Communications.

Abstract

Lipidomics encompassing automated lipid extraction, a four-dimensional (4D) feature selection strategy for confident lipid annotation as well as reproducible and cross-validated quantification can expedite clinical profiling. Here, we determine 4D descriptors (mass to charge, retention time, collision cross section, and fragmentation spectra) of 200 lipid standards and 493 lipids from reference plasma via trapped ion mobility mass spectrometry to enable the implementation of stringent criteria for lipid annotation. We use 4D lipidomics to confidently annotate 370 lipids in reference plasma samples and 364 lipids in serum samples, and reproducibly quantify 359 lipids using level-3 internal standards. We show the utility of our 4D lipidomics workflow for high-throughput applications by reliable profiling of intra-individual lipidome phenotypes in plasma, serum, whole blood, venous and finger-prick dried blood spots.

Posted on | Posted in Allgemein

Our paper "A fair experimental comparison of neural network architectures for latent representations of multi-omics for drug response prediction" has been accepted at BMC Bioinformatics

Our paper "A fair experimental comparison of neural network architectures for latent representations of multi-omics for drug response prediction", which is a joint work of Tony Hauptmann and Stefan Kramer, has been accepted at BMC Bioinformatics.

 

Abstract:

Background

Recent years have seen a surge of novel neural network architectures for the integration of multi-omics data for prediction. Most of the architectures include either encoders alone or encoders and decoders, i.e., autoencoders of various sorts, to transform multi-omics data into latent representations. One important parameter is the depth of integration: the point at which the latent representations are computed or merged, which can be either early, intermediate, or late. The literature on integration methods is growing steadily, however, close to nothing is known about the relative performance of these methods under fair experimental conditions and under consideration of different use cases.

 

Results

We developed a comparison framework that trains and optimizes multi-omics integration methods under equal conditions. We incorporated early integration, PCA and four recently published deep learning methods: MOLI, Super.FELT, OmiEmbed, and MOMA. Further, we devised a novel method, Omics Stacking, that combines the advantages of intermediate and late integration. Experiments were conducted on a public drug response data set with multiple omics data (somatic point mutations, somatic copy number profiles and gene expression profiles) that was obtained from cell lines, patient-derived xenografts, and patient samples. Our experiments confirmed that early integration has the lowest predictive performance. Overall, architectures that integrate triplet loss achieved the best results. Statistical differences can, overall, rarely be observed, however, in terms of the average ranks of methods, Super.FELT is consistently performing best in a cross-validation setting and Omics Stacking best in an external test set setting.

 

Conclusions

We recommend researchers to follow fair comparison protocols, as suggested in the paper. When faced with a new data set, Super.FELT is a good option in the cross-validation setting as well as Omics Stacking in the external test set setting. Statistical significances are hardly observable, despite trends in the algorithms’ rankings. Future work on refined methods for transfer learning tailored for this domain may improve the situation for external test sets. The source code of all experiments is available at Github.

Posted on | Posted in Allgemein

Our paper "Learning to Rank Higgs Boson Candidates" has been accepted at Nature Scientific Reports

Our paper "Learning to Rank Higgs Boson Candidates", which is a joint work of Marius Köppel, Alexander Segner, Martin Wagener, Lukas Pensel, Andreas Karwath, Christian Schmitt and Stefan Kramer has been accepted at Nature Scientific Reports.

 

Abstract:

In the extensive search for new physics, the precise measurement of the Higgs boson continues to play an important role. To this end, machine learning techniques have been recently applied to processes like the Higgs production via vector-boson fusion. In this paper, we propose to use algorithms for learning to rank, i.e., to rank events into a sorting order, first signal, then background, instead of algorithms for the classification into two classes, for this task. The fact that training is then performed on pairwise comparisons of signal and background events can effectively increase the amount of training data due to the quadratic number of possible combinations. This makes it robust to unbalanced data set scenarios and can improve the overall performance compared to pointwise models like the state-of-the-art boosted decision tree approach. In this work we compare our pairwise neural network algorithm, which is a combination of a convolutional neural network and the DirectRanker, with convolutional neural networks, multilayer perceptrons or boosted decision trees, which are commonly used algorithms in multiple Higgs production channels. Furthermore, we use so-called transfer learning techniques to improve overall performance on different data types.

Our short paper "Ranking Creative Language Characteristics in Small Data Scenarios" has been accepted at ICCC’22

Our short paper "Ranking Creative Language Characteristics in Small Data Scenarios", which is a joint work of Julia Siekiera, Marius Köppel, Edwin Simpson, Kevin Stowe, Iryna Gurevych, Stefan Kramer has been accepted at ICCC'22.

 

Abstract:

The ability to rank creative natural language provides an important general tool for downstream language understanding and generation. However, current deep ranking models require substantial amounts of labeled data that are difficult and expensive to obtain for new domains, languages and creative characteristics. A recent neural approach, DirectRanker, reduces the amount of training data needed but has not previously been used to rank creative text. We therefore adapt DirectRanker to provide a new deep model for ranking creative language with small numbers of training instances, and compare it with a Bayesian approach, Gaussian process preference learning (GPPL), which was previously shown to work well with sparse data. Our experiments with short creative language texts show the effectiveness of DirectRanker even with small training datasets. Combining DirectRanker with GPPL outperforms the previous state of the art on humor and metaphor novelty tasks, increasing Spearman's ρ by 25% and 29% on average. Furthermore, we provide a possible application to validate jokes in the process of creativity generation.

Our paper "Deep Unsupervised Identification of Selected Genes and SNPs in Pool-Seq Data from Evolving Populations" has been accepted as poster presentation at RECOMB-Genetics’22

Our paper "Deep Unsupervised Identification of Selected Genes and SNPs in Pool-Seq Data from Evolving Populations", which was a joint work of Julia Siekiera and Stefan Kramer has been accepted as poster presentation at RECOMB 2022-Genetics.

 

Abstract:

The exploration of selected single nucleotide polymorphisms (SNPs) to identify genetic diversity between populations under selection pressure is a fundamental task in population genetics. As underlying sequence reads and their alignment are error-prone and univariate statistical solutions like the Cochran-Mantel-Haenszel test (CMH) only take individual positions of the genome into account, the identification of selected SNPs remains a challenging process. Deep learning models, by contrast, are able to consider large input areas to integrate the decision of individual positions in the context of (hidden) neighboring patterns. We suggest an unsupervised deep learning pipeline to detect selected SNPs or genes between different types of population pairs by the application of both active learning and explainable AI methods. To provide a solution for various experimental designs, the effectiveness of direct genomic population comparison and the integration of drift simulation is investigated. In addition, we demonstrate how the extension of an autoencoder architecture can support the mapping of the genotype into a hidden representation upon which optimized selection detection is possible. The performance of the proposed method configurations is investigated on different simulated sequencing pools of individuals (Pool-Seq)datasets of Drosophila melanogaster and compared to an univariate baseline. The evaluation demonstrates that deep neural networks offer the potential to recognize hidden patterns in the allele frequencies of evolved populations and to enhance the information given by univariate statistics.

Our paper "Deep neural networks to recover unknown physical parameters from oscillating time series" has been accepted at PLOS ONE

Our paper "Deep neural networks to recover unknown physical parameters from oscillating time series"  (DOI) which was a joint work of Antoine Garcon, Julian Vexler, Dmitry Budker and Stefan Kramer was accepted at PLOS ONE.

 

Abstract:

Deep neural networks are widely used in pattern-recognition tasks for which a human-comprehensible, quantitative description of the data-generating process, cannot be obtained. While doing so, neural networks often produce an abstract (entangled and non-interpretable) representation of the data-generating process. This may be one of the reasons why neural networks are not yet used extensively in physics-experiment signal processing: physicists generally require their analyses to yield quantitative information about the system they study. In this article we use a deep neural network to disentangle components of oscillating time series. To this aim, we design and train the neural network on synthetic oscillating time series to perform two tasks: a regression of the signal latent parameters and signal denoising by an Autoencoder-like architecture. We show that the regression and denoising performance is similar to those of least-square curve fittings with true latent-parameters initial guesses, in spite of the neural network needing no initial guesses at all. We then explore various applications in which we believe our architecture could prove useful for time-series processing, when prior knowledge is incomplete. As an example, we employ the neural network as a preprocessing tool to inform the least-square fits when initial guesses are unknown. Moreover, we show that the regression can be performed on some latent parameters, while ignoring the existence of others. Because the Autoencoder needs no prior information about the physical model, the remaining unknown latent parameters can still be captured, thus making use of partial prior knowledge, while leaving space for data exploration and discoveries.

We congratulate Dr. Atif Raza to his successful Ph.D. thesis defense

On the 13th of April 2021, our group member Atif Raza successfully defended his Ph.D. thesis, titled Metaheuristics for Pattern Mining in Big Sequence Data.

Interested readers can find the thesis here.

An overview of the thesis is given below:
An ever-growing list of human endeavors in a variety of domains results in the generation of time-series data, i.e., data that are time-resolved and measured in equidistant time intervals. The continued developments in sensor and storage technology and the availability of database systems specifically designed for time-series data have also made it possible to record an exorbitant amount of such data. The vast yet readily available data places ever-increasing demands on data mining methods for fast and efficient knowledge discovery, which establishes the need for exceedingly fast algorithms.
 
The data mining research community has been actively investigating various avenues to develop algorithms for time series classification. Most research has focused on optimizing accuracy or error rate, although runtime performance and broad applicability are as important in practice. The result is a plethora of algorithms that have quadratic or higher computational complexities. Consequently, the algorithms have little to no use for deployment on a large scale.
 
This thesis addresses the complexity issue by introducing several time-series classification methods based on metaheuristics and randomized approaches to improve the state-of-the-art in time-series mining. We introduce three subsequence-based time series classification algorithms and an approximate distance measure for time series data. One subsequences-based time series classifier explicitly employs random sampling for subsequence discovery. The other two subsequences-based classifiers employ discretized time series data coupled with (i) a linear time and space string mining algorithm for extracting frequent patterns and (ii) a novel pattern sampling approach for discovering frequent patterns. The frequent patterns are translated back to subsequences for model induction. Both of these algorithms are up to two orders of magnitude faster than previous state-of-the-art algorithms. An extensive set of experiments establishes the effectiveness and classification accuracy of these methods against established and recently proposed methods.
Posted on | Posted in News