Introduction

For the Optimal Decision Diagrams for Classification paper we chose as benchmark the 54 data sets from the UCI Machine Learning Repository 1 used by Bertsimas and Dunn (2017) 2 for their seminal work on optimal decision trees.

Finding all data sets in a standard format proved to be a challenge. We did find all data sets in the ARFF format, though with differing categorical encodings, data cleaning strategies and other details.

For this reason, we created a data pipeline to preprocess all ARFF files into CSV with a standard format of zero-based indexing, one-hot encoding and no rows with missing values, following the original benchmark from Bertsimas and Dunn (2017) 2.

The pipeline expects ARFF files in the datasets/raw directory and a set of corresponding transformations in datasets/transformations.py. It outputs processed CSV files in the datasets/processed directory.

Configuration

Possible transformations are implemented in the operations module. For a description of each operation, please refer to the module’s documentation.

To process a data set, an ARFF file must be placed in the datasets/raw directory. Then, a corresponding configuration must be created in the dataset_transformations dictionary of the datasets/transformations.py file, using the same name as the ARFF file.

The configuration has the following structure. Note that any transformation may be omitted if unnecessary.

dataset_transformations = {
  ...
  'dataset-name': {
    'replace': { 'value-to-be-replaced': 'new-value', 'other-value-to-be-replaced': 'other-new-value' },
    'zero-index': range(5),
    'one-hot-encode': [10,11],
    'drop-columns': [0,1],
    'drop-rows': [87,166,192,266,287,302],
    'drop-rows-with-values': ['?']
  },
  ...
}

Basic usage

Process all datasets in the datasets/raw directory.

$ python3 datasets/pipeline.py all

Process a single dataset.

$ python3 datasets/pipeline.py acute-inflammations-nephritis

Some options are available for debugging, printing and overwriting previously processed data sets. Please refer to the pipeline module documentation for more information.

1: Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
2(1,2): Bertsimas, Dimitris & Dunn, Jack. (2017). Optimal classification trees. Machine Learning. 106. 10.1007/s10994-017-5633-9.