Random gas mixtures for efﬁcient gas sensor calibration

. Applications like air quality, ﬁre detection and detection of explosives require selective and quantitative measurements in an ever-changing background of interfering gases. One main issue hindering the successful implementation of gas sensors in real-world applications is the lack of appropriate calibration procedures for advanced gas sensor systems. This article presents a calibration scheme for gas sensors based on statistically distributed gas proﬁles with unique randomized gas mixtures. This enables a more realistic gas sensor calibration including masking effects and other gas interactions which are not considered in classical sequential calibration. The calibration scheme is tested with two different metal oxide semiconductor sensors in temperature-cycled operation using indoor air quality as an example use case. The results are compared to a classical calibration strategy with sequentially increasing gas concentrations. While a model trained with data from the sequential calibration performs poorly on the more realistic mixtures, our randomized calibration achieves signiﬁcantly better results for the prediction of both sequential and randomized measurements for, for example, acetone, benzene and hydrogen. Its statistical nature makes it robust against overﬁtting and well suited for machine learning algorithms. Our novel method is a promising approach for the successful transfer of gas sensor systems from the laboratory into the ﬁeld. Due to the generic approach using


Motivation
Despite impressive advances in sensitivity, selectivity and response time of gas sensor systems over the last decades (Marco and Gutierrez-Galvez, 2012;Sharma et al., 2018), there is a striking lack of publications on successful field tests or real-world applications. A search on Google Scholar (from 31 March 2020) returns more than 3.4 million results for "gas sensor + material" and 553 000 results for "gas sensor + "data processing"", but only around 28 000 results for "gas sensor + "field test"". At the same time, field tests are a crucial link to the successful implementation of gas sensors in large-volume consumer applications (Borrego et al., 2016;Castell et al., 2017). Also, from our own experience field test data very often are hard to interpret due to deviations from the ideal conditions during the original lab calibration, for example in terms of baseline and dynamics. We believe that one main issue hindering successful field tests is the lack of appropriate realistic calibration procedures for modern gas sensor systems. Calibration is only a side note in many works, as a vehicle to show the performance of a new material or data processing method. The experimental design often consists of a few fixed concentration levels per gas, and, in many cases, the sensor is exposed to one and only one target gas at a time. The resulting data are relatively easy to evaluate in terms of sensitivity, selectivity and speed of response, but of little use for complex real-world scenarios.
Virtually all applications -for example, air quality (Castell et al., 2017;Spinelle et al., 2017), fire detection (Kohl et al., 2001;Fonollosa et al., 2016), detection of explosives (Tomchenko et al., 2005;Yu et al., 2005) and breath analysis (Bajtarevic et al., 2009;Lourenço and Turner, 2014) -require selective, quantitative measurements in an ever-changing background of interfering gases. A sensor calibration with single substances (as, for example, in the datasets of Fonollosa et al., 2015a, b;Fonollosa, 2016;Bastuck and Fricke, 2018) does not reveal any masking effects or other gas interactions altering the sensor response. Some publications take this into 412 T. Baur et al.: Random gas mixtures for efficient gas sensor calibration account by performing calibration with gas mixtures (Sundgren et al., 1991;Wolfrum et al., 2006;Zhang et al., 2013;Fonollosa, 2015;Sauerwald et al., 2018). Most of these except two (Zhang et al., 2013;Fonollosa, 2015) use between three and five fixed concentration levels for each gas. This quantization of a continuous quantity can, with too few levels, easily lead to overfitting due to systematic errors in the experimental equipment, contamination 1 of validation data through repetitions or misleading model performance measures.
In the past we could show good results in interlaboratory tests, as a first step towards a transferable calibration, with sequential calibration (Spinelle et al., 2017;Bastuck et al., 2018a;Sauerwald et al., 2018). However, there is still a gap between calibrating a sensor for interlaboratory tests and real-word scenarios Karagulian et al., 2019).
In this paper, we present and test a calibration scheme based on the method of random effects (Oehlert, 2000). It tackles the mentioned issues by drawing random concentrations from predefined distributions of a, theoretically, arbitrary number of gases. The result is a large number of gas exposures for calibration, each a unique mixture of all available gases. The approach is easy to configure and use, can be applied to a wide range of target applications, and is shown to be superior to sequential calibration.

Study design
The calibration method with randomized gas mixtures is shown using the example of indoor air quality (IAQ) but can be applied to any application and target variable. The gases used for this study were chosen to represent different approaches in IAQ assessment. Volatile organic compounds (VOCs) are an important indicator of IAQ, as many of the substances show irritating or even toxic behavior. Generally, a VOC is any organic compound that can be found in the gas phase at room temperature. The European Union defines VOC as any organic compound with an initial boiling point less than or equal to 250 • C measured at standard pressure of 101.3 kPa (Anon, 2004). In analytical chemistry these VOCs are normally divided into three subgroups: very volatile organic compounds, volatile organic compounds and semivolatile organic compounds. Specific sampling and measurement protocols are associated with each group. However, from a health perspective, there is no need to treat these groups separately since both toxic and harmless compounds can be found in each. We will, therefore, subsume all three groups under the term VOC for direct-measuring gas sensor systems. The total sum of VOCs, TVOC (total VOCs), is one target value that can be used for calibration and is, for example, defined by the German Environment Agency (UBA) for IAQ classification (Seifert, 1999;Anon, 2007). A study on behalf of the UBA (Hofmann and Plieninger, 2008) lists the statistical distribution of more than 300 different VOCs in indoor environments. The VOCs can be divided into interfering VOCs and target VOCs with regard to human health: while the former are harmless in usual concentrations, the latter are mostly toxic or carcinogenic. Measuring all of these hundreds of VOCs in varying concentrations is not feasible, so a preselection must be made based on the expected concentrations. Since our equipment (Helwig et al., 2014;Leidinger et al., 2018) is limited to six gases plus humidity, two representatives each were selected for inorganic background gases, interfering VOCs and target VOCs.
The carrier gas stream consists of zero air with varying humidity plus the background gases carbon monoxide and hydrogen. Carbon monoxide is a ubiquitous gas with highly variable concentrations ranging from the atmospheric background at 150 ppb (Schleyer et al., 2013) up to several ppm (WHO Regional Office for Europe, 2010). The atmospheric background concentration of hydrogen is 500 ppb (Schleyer et al., 2013). We could not find any studies on H 2 concentration in indoor air. We assume large fluctuations up to the ppm range  since hydrogen is emitted by humans (Levitt, 1969;Tomlin et al., 1991) and can, like CO 2 , be another indicator for human presence. For interfering VOCs we selected acetone and toluene, two common representatives with high average concentrations (Hofmann and Plieninger, 2008) but negligible health effects. The interfering gases were added to achieve a realistic TVOC concentration in indoor air (Hofmann and Plieninger, 2008). To represent the TVOC concentration with only two gases, they are supplied at 10 to 20 times the typical indoor concentrations. The target VOCs are two carcinogenic gases, formaldehyde and benzene. The concentration range of these target gases is based on the observed statistical distribution in indoor air (Hofmann and Plieninger, 2008) and WHO guidelines (WHO Regional Office for Europe, 2010; Anon, 2016). Since only a limited number of VOCs are present in this configuration, the sum of all measured VOCs is defined as VOC sum to clearly distinguish it from the common TVOC term.
The random mixtures were generated using a Python script (Bastuck, 2019) which iteratively determines the ratios of all components as shown schematically in Fig. 1. To generate a randomized gas mixture, the concentrations of the background components (carbon monoxide, hydrogen and humidity) and VOC sum were varied independently of each other. The concentrations of humidity, carbon monoxide, hydrogen and VOC sum are uniformly distributed over a realistic range (see Table 1). For the generation of the single VOC concentrations, the randomly selected VOC sum concentration is divided into several steps. First, the ratio of interfering (VOC interfering ) and target (VOC target ) VOCs in VOC sum is randomly selected to be between 0 and 20 % target VOC. Second, VOC interfering and VOC target are again divided randomly into the individual VOCs, both with a ratio between 0 and 100 %. The parameters for the generation are shown in Table 1, and the resulting concentration ranges for the single gases and VOC sum in Table 2.
Due to an error in the measurement setup, the concentrations of toluene and benzene were swapped, and the concentrations planned for benzene were offered as toluene and vice versa. Therefore, the concentration levels of the carcinogenic benzene are rather high in this study compared to their true occurrence, while the concentrations of toluene are unusually low (ppb range). This does not have any impact on the general conclusions drawn from this experiment, but the results for selective quantification of these two VOCs should be interpreted with caution. The concentration distributions of the individual gases can be found in Fig. A1. Each randomized gas mixture was supplied to the sensors for 20 min each. Twelve measurements with 99 randomized gas mixtures each were conducted over a period of 5 weeks, resulting in a total of 1188 randomized gas mixtures.
To compare the performance of our novel approach with a conventional sequential calibration strategy (one gas at a time, ascending concentration levels), a gas profile of this kind was measured for comparison. Each gas was supplied at four different concentrations (see Table 3), which were kept constant for 20 min. The background gases (hydrogen and carbon monoxide) were always kept at their atmospheric concentrations (500 and 150 ppb) except during their expo- sures as target gas. The profile was repeated three times at different relative humidities -25, 50 and 75 %RH -resulting in a total of 72 different gas exposures. The comparison was made only for the gas concentration ranges which were common to both calibration profiles.

Setup
In the overall measurement setup, a total of 11 different sensors were tested, seven of them metal oxide semiconductor gas sensors (MOS) and four gas-sensitive field effect transistors (GasFET). An overview of the results of all systems for a reduced dataset with the last five measurements and a slightly different evaluation method can be found in Bastuck (2019). The results and findings in this paper are shown for two analog sensors from ams, namely AS-MLV and AS-MLV-P2.They were chosen due to our long experience with these two types of sensors (Baur et al., 2015Schütze et al., 2017;Schultealbert et al., 2018a). In recent interlaboratory tests we have also found that transferring a sensor calibration from one laboratory to another works with these types of sensor. However, we have also seen that missing gas concentrations can lead to misinterpretation in our models. In Sauerwald et al. (2018) we trained interfering gases with only a few gas concentrations, since each additional concentration would have meant a doubling of time. Therefore, we  had problems with an extended humidity range, which was not covered by our calibration. In Bastuck et al. (2018a) we had a similar problem with hydrogen. Those previous issues make them good candidates for this study on a more efficient calibration strategy. The sensors were not operated in the operating modes recommended by the respective manufacturers, but with a self-designed temperature-cycled operation (TCO) (Gramm and Schütze, 2003;Baur et al., 2015;Schütze and Sauerwald, 2019). The temperature cycle is chosen to benefit from the highly sensitive differential surface reduction (DSR) method . The total cycle for the presented sensors with a duration of 120 s is shown in Fig. 2. The MOS sensors were operated with electronics with logarithmic conductance measurement and resistance-based temperature control developed in our lab . The gas mixtures were supplied by our gas mixing apparatus (GMA), which is described in detail in Helwig et al. (2014) and Leidinger et al. (2018). It consists of several mass flow controllers (MFCs) to supply carrier gas (zero air) and add the desired gas concentrations from gas cylinders. A twostage cleaning process generates the zero air . Hydrocarbons (larger than C3) are removed efficiently in the first step with a carbon filter system. In the second step, humidity is removed with a pressure swing, and smaller hydrocarbons as well as hydrogen and carbon monoxide are removed by catalytic conversion. The test gases from the cylinders are diluted twice to achieve very low and highly variable concentrations while avoiding the impact of different impurities contained in the synthetic air (Helwig et al., 2014). Humidity is supplied from a washing bottle with HPLC-grade water at room temperature (22 • C), which is flushed with zero air at the desired flow rate.
Since several sensors ran in the same experiment and should not affect each other, the total flow of 400 mL / min supplied by the GMA was split into four independent lines. To ensure proper split ratios, flow restrictions (10 cm of 1/16") were installed in each line, dominating the total flow resistance of each line, given that the rest of the setup is built with 1/8" tubing (<25 cm per line, PTFE and stainless steel). The sensor chambers are made of PTFE and aluminum.

Evaluation methods
The evaluation is performed with the open-source software DAV 3 E (Bastuck et al., 2018b) and can be divided into five steps: feature extraction, dimensionality reduction, regression, hyperparameter optimization and testing. For feature extraction, the 120 s sensor cycle is divided into 120 equidistant ranges. In each of these ranges, the mean value and slope, in total 240 features per sensor cycle, are computed. To prevent overfitting during modelling, a dimensionality reduction with principal component analysis (PCA) is carried out. For the next steps of modelling, the first 20 principle components are used as features. The quantification of the desired target value (concentration of a single gas or a partial gas mixture, e.g., VOC sum ) is performed with partial least squares regression (PLSR). For hyperparameter optimization and testing we use two different procedures. For evaluations with reduced datasets of the measurement we use the holdout method for testing; for instance, 10 % of the dataset is excluded from training. For hyperparameter optimizationi.e., the determination of the number of PLSR componentsa 10-fold cross-validation is applied. For evaluations with the complete dataset, a nested cross-validation, also known as double cross-validation (Stone, 1974), is performed for testing and hyperparameter optimization. We perform an outer 10-fold cross-validation for testing, by randomly dividing the data in 10 parts once. One part in turn is set aside as the test dataset, while all other parts comprise the training dataset and are used to optimize the hyperparameters of the model. For this optimization, we also perform a 10-fold cross-validation on the training dataset for different numbers of PLSR components. In the inner loop, the training dataset of the outer loop is also randomly divided into 10 parts; nine parts are used for training and one for the hyperparameter validation. For nested cross-validation we treat all sensor cycles within the same gas exposure as one unit (group-based). Otherwise, very similar cycles could end up in both the training and test dataset of an iteration, effectively "contaminating" the training data and leading to over-optimistic performance estimates. The mean predictive performance for these validation sets is calculated for each number of PLSR components over the inner and outer loop. The best number of PLSR components is decided as the minimal number of PLSR components still giving a good 2 predictive performance.
Generally, different metrics are used to describe the performance of a regression model. Arguably the most prevalent is the coefficient of determination R 2 , which describes the ratio of the explained to the total variance. Its range from 0 to 100 % is, however, hard to interpret in terms of, for example, accuracy and precision of a model. This interpretation becomes much easier for the root-mean-square error (RMSE) since it has the same unit as the model output. A distinction is made between the RMSE of calibration (RMSEC) for the training, the RMSE of cross-validation (RMSECV) for hyperparameter optimization and the RMSE of prediction (RMSEP) for testing. However, expecting the same precision between two models covering different concentration ranges is unrealistic. An RMSE of 50 ppb would be considered quite poor for formaldehyde (having an exposure limit of 80 ppb) but excellent for hydrogen. Since we choose the concentration ranges for all gases based on realistic data, it seems natural to define a metric "dynamic range" (DNR) as  with the maximum concentration c max, t and the root-meansquare error RMSE t for the target t. While not transferrable to arbitrary applications, the DNR allows comparison of sensor and model performances for different gases and concentration ranges in this case. To find the optimal number of PLSR components, we calculate the RMSECV n, i, j for each number of PLSR components n ∈ N, N = {x ∈ Z|1 ≤ x ≤ 20} for all 10 crossvalidation folds i ∈ I, I = {x ∈ Z|1 ≤ x ≤ 10} in all 10 testing folds j ∈ J, J = {x ∈ Z|1 ≤ x ≤ 10}. Thereby, the maximum number of PLSR components is limited to the number of predictor variables, in this case, the 20 first principle components. The RMSECV n is the mean value over all folds at the same n. We selected the number of PLSR components n sel with Eq. (2). This means we take the minimum number of PLSR components for which the RMSECV n is less than the RMSECV n min plus the standard deviation of RMSECV n min at the point of the minimum. A visualization of the data evaluation procedure can be found in Appendix B as pseudocode. Figure 4 shows the selection of the best number of PLSR components according to Eq. (2). n sel = min n | RMSECV n < RMSECV n min + SD i∈I, j ∈J RMSECV n min , i, j with n min = arg min n∈N RMSECV n . (2)

Results and discussion
Twelve measurements were performed. Each of the 1188 gas exposures contains 10 sensor cycles. Due to the time constant of the gas exchange, we omitted two sensor cycles at the beginning and one cycle at the end of the gas exposure in the evaluation. Therefore, we have a total of seven useful cycles per gas exposure, amounting to 8316 from the complete measurement campaign. Two and a half measurements (numbers 5, 6 and 7), in sum 245 random gas exposures, had formaldehyde completely missing because the bottle had run empty. Additionally, 74 random gas exposures are missing for the AS-MLV-P2 and 115 for the AS-MLV due to issues with the sensor system. Therefore, we can use 828 (AS-MLV) or 869 (AS-MLV-P2) random gas exposures for formaldehyde models and 1073 (AS-MLV) or 1114 (AS-MLV-P2) for all other models. Figure 3 shows an example of a PLSR model for the AS-MLV-P2 for quantification of carbon monoxide. For better visualization we reduced the dataset: this model was trained with 198 randomized gas exposures (measurements 8 and 9); the hyperparameter optimization was done by 10-fold crossvalidation. The dotted lines show the RMSECV of the hyperparameter optimization; the red circles show the predicted carbon monoxide concentration from 99 additional randomized gas exposures (measurement 10). A good agreement of   (2), is given in parentheses. A detailed description of (α)-(δ) is given in Table 4. the reduced dataset with an RMSECV of 57.3 ppb and a RM-SEP of 73.9 ppb is found. This means the unknown measurement can be predicted with a DNR of 27 in the range of 100 to 2000 ppb carbon monoxide.
For the evaluation of the complete measurement campaign, 10-fold nested cross-validation is used. Figure 4 shows the hyperparameter optimization for the selection of the number of PLSR components according to Eq. (2) as an example for the quantification of carbon monoxide with the AS-MLV-P2. The dark and light grey bars show the RMSECV and the RMSEP, respectively; the error bars indicate the standard deviation of the cross-validation folds. The red bar indicates the absolute minimum of the RMSECV at n min = 15. The dotted red line represents the RMSECV n min + SD i∈I, j ∈J RMSECV n min ,i, j as a boundary for selecting the number of PLSR components. The orange bar indicates the RMSECV for the number of PLSR components n sel selected according to Eq. (2), i.e., the minimum number with an RMSECV below the defined boundary, in this case n sel = 6. It shows that we can achieve a similarly good result -i.e., low RMSECV -with a small number of PLSR components compared to the minimum of the RMSECV. Figure 5 shows the R 2 value for both AS-MLV and AS-MLV-P2 for different models. All models except the model for formaldehyde and toluene achieve an R 2 over 0.86, and even over 0.94 with the exclusion of benzene. This indicates that a satisfying quantification of VOC sum and all gases except formaldehyde and toluene is possible with both sensors. The performance of the models is assessed with the RMSEP in Fig. 6 and the DNR in Fig. 7. Similar RMSEP values are achieved with both sensors for the different models. The regression models of AS-MLV and AS-MLV-P2 show the best performance for carbon monoxide with a DNR of 31. The re-gression models for acetone and hydrogen also achieve satisfactory results with a DNR between 16 and 18. The DNR for benzene with a value of 13 is relatively low considering the (unrealistically) high concentrations. The two gases with very low concentrations, toluene and formaldehyde, cannot be selectively quantified in this complex background, indicated by a DNR below 6. VOC sum can be quantified with a DNR of 18-19 independent of the unit (µg/m 3 or ppb). This is interesting because the two dominating VOCs, acetone and benzene, represent different chemical classes and have a 30 % difference in molecular weight.
For a comparison between randomized and sequential calibration methods, we compare different combinations of training/validation and testing (Table 4). For compatibility, the randomized dataset with a higher concentration dynamic is reduced to a dataset in which all concentrations are in the range of 0-120 % of the sequential measurement, resulting in 153 gas exposures. The distribution of all gases and VOC sum is shown in Fig. A2. Since the last six gas exposures (75 %RH, 750 and 100 ppb benzene, all formaldehyde concentrations) are missing from the sequential dataset due to a technical error, there are 66 sequential gas mixtures in total. Combination (α) shows the evaluation of the reduced randomized dataset with 153 gas exposures. For the evaluation we used 10-fold nested cross-validation for hyperparameter optimization and testing like the evaluation in Figs. 6 and 7. We split the reduced dataset from the randomized measurement for combinations (β) to (δ) into two datasets. The first dataset contains the first 72 randomized gas exposures for training and hyperparameter optimization, and the second dataset the remaining 81 for testing. This allows us to compare randomized calibration with sequential testing and vice versa. The hyperparameter optimization during the training Figure 9. Dynamic range (DNR) of AS-MLV-P2 for different training and testing models. All models use 10-fold cross-validation for hyperparameter optimization; the resulting number of PLSR components, determined with Eq. (2), is given in parentheses. A detailed description of (α)-(δ) is given in Table 4.  was always based on 10-fold cross-validation for randomized and sequential training.
Comparing the results of (α) and the previous evaluation in Figs. 6 and 7 shows the influence of reducing the randomized dataset for better compatibility with the sequential test scenario. The results of (α) and (β) show the influence of the two different evaluations. (γ ) explores the prediction ability of a model trained with randomized data for sequential data, and (δ) vice versa. The performances of these four models are compared in Fig. 8 (RMSEP) and Fig. 9 (DNR) for the AS-MLV-P2. The AS-MLV shows similar results and can be found in Figs. C1 and C2. The RMSEPs of the models with randomized training (α) to (γ ) are close together. The only exception is sequential testing -i.e., model (γ ) -for benzene, producing a significantly larger RMSEP. The reverse case -i.e., model (δ) predicting randomized data after a sequential training -results in considerably larger RMSEPs in practically all cases. Despite the RMSEPs being similar for (α) to (γ ) and Fig. 6, the DNR (Fig. 9) reveals the superiority of the results shown in Fig. 7 trained with a larger concentration range. The comparison between the randomized (β, γ ) and sequential (δ) training of the reduced dataset only shows similar performance for carbon monoxide. The randomized data are obviously more challenging to predict and, at the same time, provide a better model with a higher DNR for prediction, which is to be expected due to the much larger variability of the background. At the same time, this allows for more efficient training closer to reality, since one data point is obtained for each gas from each gas mixture. Comparing the PLSR models for VOC sum (in ppb) for combinations (β), (γ ) and (δ) from Table 4 indicates that classical sequential calibration (see Fig. 10b) is a subset of the randomized calibration presented here (see Fig. 10a). The models trained with randomized mixtures in Fig. 10a and b show a slightly larger RMSECV compared to the sequential training shown in Fig. 10c. However, only these random models can accurately and precisely predict both the randomized and sequential dataset. The sequentially trained (and validated) model in Fig. 10c achieves a slightly lower RMSECV but fails to predict the more complex randomized dataset. Note that the measurement duration for both datasets is identical.

Conclusion and outlook
In this paper an efficient and effective gas sensor calibration based on randomized gas mixtures is presented. The results are compared with a classical calibration strategy based on individual gas exposures with sequentially increasing concentrations and fixed steps. While a model trained with data from the sequential calibration performs poorly in the more realistic case of complex gas mixtures, the novel randomized calibration achieves very promising results for all tested datasets, making it more effective. Since generating the required data with randomized gas mixtures does not take more time (and could, potentially, take considerably less for more targets) than the classical sequential calibration strategy, it is also more efficient. Our method was developed and tested with the real-world application of indoor air quality monitoring in mind and thus presents an important tool for the successful transfer of chemical sensors from the laboratory to the field. Its statistical nature makes it robust against overfitting and well suited for machine learning algorithms.
Since only single gases were measured sequentially in the study presented here, an investigation of the performance and stability of sequential calibrations with fully sampled combinations should follow for a more complete comparison to the randomized strategy. The aim of these investigations should be to determine the ideal number of randomized mixtures for obtaining a reliable model for predicting the concentration of an individual gas or a gas mixture. To check for generalizability, tests with different mixture compositions, for example by replacing one or two gases, will also be considered.
The six gases investigated in this work are probably not enough to fully characterize the performance of sensors for indoor air quality assessment, especially for a quantification of a single VOC. Therefore, the complexity, for example the number of backgrounds and interfering and target gases, should be rigorously increased in order to get closer to reality. A next step is the development of new gas mixing apparatus allowing a higher number of gases to be measured. By testing different distributions, efficiency and performance could be further improved. In addition, extensive field tests with reference analysis are necessary to demonstrate the advantage of the calibration strategy for real-world applications.
Appendix A: Appendix A: Histogram of the complete and reduced dataset Figure A1. Concentration histogram of the observations in the complete measurement campaign for all gases and VOC sum . Figure A2. Concentration histogram of the observations for the reduced dataset (comparison between the randomized and sequential measurement) for all gases and VOC sum . Data availability. The data presented in this article are stored in an internal system according to the guidelines of the German Research Foundation (DFG). The data for the evaluation can also be found at https://doi.org/10.5281/zenodo.4264224 (Baur et al., 2020).
Author contributions. TB, MB and CS conceptualized the project. TB and CS built the setup for the MOS sensors. MB wrote the software for the randomized calibration. TB and MB performed the measurements and evaluation. TB visualized the results and wrote with MB and CS the original draft of the paper. TS and AS reviewed and edited the paper.