water_benchmark_hub.gecco_waterquality

water_benchmark_hub.gecco_waterquality.gecco_water_quality

Module provides functions for loading different GECCO water quality data sets.

class water_benchmark_hub.gecco_waterquality.gecco_water_quality.GeccoWaterQuality

Bases: BenchmarkResource

Base class for GECCO Water Quality 2017 - 2019 benchmarks.

Note that the scoring/evaluation algorithm is the same for all GECCO water quality benchmarks and is implemented in compute_evaluation_score().

static compute_evaluation_score(y_pred: numpy.ndarray, y: numpy.ndarray) float

Evaluates the performance of a detection method.

Note

All GECCO water quality challenges use the F1-score for evaluation.

Parameters:
  • y_pred (numpy.ndarray) – Event indication prediction over time

  • y (numpy.ndarray) – Ground truth event indication over time.

Returns:

Evaluation score.

Return type:

float

static get_meta_info() dict

Gets the meta information of this resource.

Returns:

Meta info.

Return type:

dict

class water_benchmark_hub.gecco_waterquality.gecco_water_quality.GeccoWaterQuality2017

Bases: GeccoWaterQuality

Class for Loading the original GECCO Industrial Challenge 2017 Dataset: A water quality dataset for the “Monitoring of drinking-water quality” competition organized by M. Friese, J. Stork, A. Fischbach, M. Rebolledo, T. Bartz-Beielstein at the Genetic and Evolutionary Computation Conference 2017, Berlin, Germany

This is a benchmark for anomaly detection algorithms on water quality. The data is provided by the “Thüringer Fernwasserversorgung” (Germany) and constitutes a real-world data set. In this data set, 9 numeric water quality features are given at a sampling rate of 1 min over approx. 3 month. The goal is to predict the presence of an anomaly – i.e. binary classification.

More information can be found at https://zenodo.org/records/3884465 and http://www.spotseven.de/gecco-challenge/gecco-challenge-2017/

static get_meta_info() dict

Gets the meta information of this resource.

Returns:

Meta info.

Return type:

dict

static load_data(download_dir: str | None = None, return_X_y: bool = True, verbose: bool = True) pandas.DataFrame | tuple[numpy.ndarray, numpy.ndarray]

Loads the original GECCO Industrial Challenge 2017 Dataset.

Note

Note that this is NOT a simulated scenario and therefore only the final data set is provided.

Parameters:
  • download_dir (str, optional) –

    Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

    The default is None.

  • return_X_y (bool, optional) –

    If True, the data is returned together with the labels as two Numpy arrays, otherwise the data is returned as Pandas data frame.

    The default is True.

  • verbose (bool, optional) –

    If True, a progress bar is shown while downloading files.

    The default is True.

Returns:

The benchmark data set as either a Pandas data frame or as a pair of (X, y) Numpy arrays.

Return type:

pandas.DataFrame or tuple[numpy.ndarray, numpy.ndarray]

class water_benchmark_hub.gecco_waterquality.gecco_water_quality.GeccoWaterQuality2018

Bases: GeccoWaterQuality

Class for Loading the GECCO Industrial Challenge 2018 Dataset: A water quality dataset for the “Internet of Things: Online Anomaly Detection for Drinking Water Quality” competition organized by F. Rehbach, M. Rebolledo, S. Moritz, S. Chandrasekaran, T. Bartz-Beielstein at the Genetic and Evolutionary Computation Conference 2018, Kyoto, Japan.

This is a benchmark (based on GeccoWaterQuality2017()) for anomaly detection algorithms on water quality. The data is provided by the “Thüringer Fernwasserversorgung” (Germany) and constitutes a real-world data set. In this data set, 9 numeric water quality features are given at a sampling rate of 1 min over approx. 3 month. The goal is to predict the presence of an anomaly – i.e. binary classification.

More information can be found at https://zenodo.org/records/3884398 and http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2018/

static get_meta_info() dict

Gets the meta information of this resource.

Returns:

Meta info.

Return type:

dict

static load_data(download_dir: str | None = None, return_X_y: bool = True, verbose: bool = True) pandas.DataFrame | tuple[numpy.ndarray, numpy.ndarray]

Loads the GECCO Industrial Challenge 2018 Dataset.

Note

Note that this is NOT a simulated scenario and therefore only the final data set is provided.

Parameters:
  • download_dir (str, optional) –

    Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

    The default is None.

  • return_X_y (bool, optional) –

    If True, the data is returned together with the labels as two Numpy arrays, otherwise the data is returned as Pandas data frame.

    The default is True.

  • verbose (bool, optional) –

    If True, a progress bar is shown while downloading files.

    The default is True.

Returns:

The benchmark data set as either a Pandas data frame or as a pair of (X, y) Numpy arrays.

Return type:

pandas.DataFrame or tuple[numpy.ndarray, numpy.ndarray]

class water_benchmark_hub.gecco_waterquality.gecco_water_quality.GeccoWaterQuality2019

Bases: GeccoWaterQuality

Class for Loading GECCO Industrial Challenge 2019 Dataset: A water quality dataset for the “Internet of Things: Online Event Detection for Drinking Water Quality Control” competition organized by F. Rehbach, S. Moritz, T. Bartz-Beielstein at the Genetic and Evolutionary Computation Conference 2019, Prague, Czech Republic.

This is a benchmark (based on GeccoWaterQuality2018) for anomaly detection algorithms on water quality. The data is provided by the “Thüringer Fernwasserversorgung” (Germany) and constitutes a real-world data set. In this data set, 6 numeric water quality features are given at a sampling rate of 1 min over approx. 3 month. The goal is to predict the presence of an anomaly – i.e. binary classification. The data set itself comes in three splits: A train set, a validation set, and a test set.

More information can be found at https://zenodo.org/records/4304080 and https://www.th-koeln.de/informatik-und-ingenieurwissenschaften/gecco-challenge-2019_63244.php

static get_meta_info() dict

Gets the meta information of this resource.

Returns:

Meta info.

Return type:

dict

static load_data(download_dir: str | None = None, return_X_y: bool = True, verbose: bool = True) dict

Loads GECCO Industrial Challenge 2019 Dataset.

Note

Note that this is NOT a simulated scenario and therefore only the final data set is provided.

Parameters:
  • download_dir (str, optional) –

    Path to the data files – if None, the temp folder will be used. If the path does not exist, the data files will be downloaded to the given path.

    The default is None.

  • return_X_y (bool, optional) –

    If True, the data is returned together with the labels as two Numpy arrays, otherwise the data is returned as Pandas data frame.

    The default is True.

  • verbose (bool, optional) –

    If True, a progress bar is shown while downloading files.

    The default is True.

Returns:

The data set as a dictionary with entries “train”, “validation”, and “test” containing the respective data.

Return type:

dict