Research
Statistics meets Linguistics
Currently, Claus Weihs works on statistical methods for the analysis of linguistic data. The methods are summarized in PrInDT, an R-package for the optimization of conditional inference trees (ctrees) for classification and regression. For optimization, the model space is searched for the best tree on the full sample by means of repeated subsampling. Restrictions are allowed so that only trees are accepted which do not include pre-specified uninterpretable split results. With the PrInDT package, both the predictive power and the interpretability of ctrees are increased. The performance of ensembles and individual trees is compared.
The package covers the optimization of ctrees for 2-level, multilevel, and multilabel classifications as well as for regression. Subsampling percentages can be varied for the classes in classification and for observations and predictors in regression. Furthermore, the posterior distribution of a specified variable in the terminal nodes of a given tree can be analyzed.
How to cite
Weihs, C., Buschfeld, S. (2023): PrInDT: Prediction and Interpretation in Decision Trees for Classification and Regression, R package version 1.0, url = {https://CRAN.R-project.org/package=PrInDT} .
Related publications
Weihs, C., Buschfeld, S. (2021a). "Combining Prediction and Interpretation in Decision Trees (PrInDT) - a Linguistic Example", Online: arXiv:2103.02336
Weihs, C., Buschfeld, S. (2021b). "NesPrInDT: Nested undersampling in PrInDT", Online: arXiv:2103.14931
Weihs, C., Buschfeld, S. (2021c). "Repeated undersampling in PrInDT (RePrInDT): Variation in undersampling and prediction, and ranking of predictors in ensembles", Online: arXiv:2108.05129
Classification
is an omnipresent challenge. It is therefore not surprising that classification methods are being developed in many sciences. Recently, however, the number of available methods has exploded. Thus, it is now a challenge to find the right procedure for an application problem or to optimally adapt existing procedures to such a problem. Moreover, the literature on the interpretation of the results of classification procedures seems rather thin compared to the ever new proposals of new procedures, and ease of interpretation is increasingly demanded as a matter of course by the users of the procedures.
Music and Statistics
The aim of this project is to automatically classify vocal interpretations in terms of tonal purity and various sound properties. Therefore, the physiological properties of the ear for perceiving sound are also of interest. Automatic transcription, i.e. the conversion of sound into musical notation, is another research focus. First results were obtained in a practical course in WS 1999/2000. Further results can be found as publications in the Technical Reports of the SFB475, as well as in the Working and Research Reports of the Department of Statistics.
Statistical Methods for Quality Assurance and Optimization
Quality monitoring and optimization are becoming increasingly important in the chemical industry with regard to cost reduction, certification and customer requirements. Statistical methods, especially desirability indices, are essential tools in this context.
(Statistical) design of experiments
attempts to investigate the relationship between target variables and factors possibly influencing them as completely as possible with as few experiments as possible. The aim is to identify those factors that really have an influence on the target variables and to determine those values of these factors that optimize (maximize / minimize) a target variable. A new area of research is the design of experiments on existing observational data for variable selection. For the EMILeA-stat project, an interactive teaching and learning environment (e-learning), the EMILeA Chemicals AG scenario for statistical experimental design was developed.
Life sciences
- Diagnostic methods are required, for example, in the development, optimization and validation of test systems and automatic analyzers. In this field, new statistical methods and approaches are needed to optimize existing processes and to deal with new problems appropriately. For example, the long-term accuracy of diagnostic tests in routine diagnostics can only be guaranteed by optimal calibration procedures. In the field of genomics, proteomics, peptidomics, etc., new diagnostic methods are being sought, and statistics is significantly involved in study planning, implementation and evaluation for the approval of diagnostic methods. The Department of Statistics and Roche Diagnostik GmbH in Penzberg/Bavaria cooperate in research in these areas.
- Linguistic information in neuronal responses: In cooperation with the Fraunhofer Institute for Digital Media Technology (IDMT) in Ilmenau/Thuringia, information is searched for in the neuronal response at the human auditory nerve. Based on a simulation model of the inner ear, speech input is automatically recognized.
Explorative data analysis
includes tools to visualize data and dependencies between different data series. The idea is to let the data speak for itself. In this way, conspicuities in the individual data series should be made clear (groups, outliers) and indications of correlations between different variables that were not expected in advance should be found. Such correlations are then used to investigate the extent to which certain variables can be predicted from others.
Expert systems
are computer systems that attempt to copy a domain expert's approach to solving a problem. Expert statistical systems are about the implementation of a statistician's knowledge on a computer, with the aim of concretizing the state of knowledge (what is really known, what is not (yet) ?) and sharing the knowledge.
Error-in-the-Variables Models
are mathematical/statistical approximations for real relationships between target variables and the factors influencing them, in which measurement errors in the factors are also modeled. Standard models, on the other hand, assume that measurement errors only occur in the target variables, i.e. that factors are completely 'under control'.
Numerical methods
are calculation rules (algorithms) for the solution of mathematical/statistical problems with given values of the output variables. The development of such procedures received new impetus by the increasing computerization. The aim is to calculate the correct result with the highest possible accuracy in the shortest possible time, taking into account all special cases.