csl.datasets

Datasets for the csl module

CIFAR-10

class csl.datasets.CIFAR10(root, train=True, subset=None, transform=None, target_transform=None)[source]

Bases: object

CIFAR-10 dataset

You can download the dataset in PyTorch tensor format from https://www.ocf.berkeley.edu/~chamon/data/cifar-10.zip

..warning:: For performance purposes, this class loads the full

CIFAR-10 dataset to RAM. Even though it is less than 1 GB, you’ve been warned.

Variables
  • train (bool) – True if training set or False otherwise.

  • data (torch.tensor) – CIFAR-10 images.

  • transform (callable) – Function applied to the data points before returning them.

  • target (torch.tensor) – CIFAR-10 labels.

  • target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]

Return size of dataset.

__get_item__()

Return tuple (torch.tensor, torch.tensor) of images ([N] x [C = 3] x [H = 32] x [W = 32]) and label (N x 1).

MEAN = [0.4914, 0.4822, 0.4465]

Average channel value over training set (list [float])

SD = [0.2023, 0.1994, 0.201]

Standard deviation of channel value over training set (list [float])

__init__(root, train=True, subset=None, transform=None, target_transform=None)[source]

CIFAR-10 dataset constructor

Parameters
  • root (str) – Data folder.

  • train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).

  • subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).

  • transform (callable, optional) – Transformation to apply to the data points. The default is None.

  • target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

classes = ('Plane', 'Car', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck')

CIFAR-10 labels (list [str])

Fashion MNIST

class csl.datasets.FMNIST(root, train=True, subset=None, transform=None, target_transform=None)[source]

Bases: object

FASHION MNIST dataset

You can download the dataset in PyTorch tensor format from https://www.ocf.berkeley.edu/~chamon/data/fmnist.zip

..warning:: For performance purposes, this class loads the full

FMNIST dataset to RAM. Even though it is less than 1 GB, you’ve been warned

Variables
  • train (bool) – True if training set or False otherwise.

  • data (torch.tensor) – FMNIST images.

  • transform (callable) – Function applied to the data points before returning them.

  • target (torch.tensor) – FMNIST labels.

  • target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]

Returns size of dataset.

__get_item__()

Return tuple (torch.tensor, torch.tensor) of images ([N] x [C = 1] x [H = 28] x [W = 28]) and label (N x 1).

MEAN = 0.1307

Average channel value over training set (float)

SD = 0.3081

Standard deviation of channel value over training set (float)

__init__(root, train=True, subset=None, transform=None, target_transform=None)[source]

FASHION MNIST dataset constructor

Parameters
  • root (str) – Data folder.

  • train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).

  • subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).

  • transform (callable, optional) – Transformation to apply to the data points. The default is None.

  • target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

classes = ('T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot')

FMNIST labels (list [str])

UTK Face

class csl.datasets.UTK(root, train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]

Bases: object

UTKFace dataset

Download the dataset from https://susanqq.github.io/UTKFace/ and indicate the path to the UTKFace folder

Variables
  • classes (list [str]) – Class labels

  • train (bool) – True if training set or False otherwise.

  • current_batch (dict) – Memoized dataset to speed-up consecutive requests for the same data.

  • data (panda.DataFrame) – Data frame containing the targets and path to each image. Contrary to CIFAR-10 or FMNIST, UTKFace is never fully loaded into memory.

  • transform (callable) – Function applied to the data points before returning them.

  • target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]

Return size of dataset.

__get_item__()

Return tuple (torch.tensor, pandas.DataFrame) of image ([N] x [C = 3] x [H = 200] x [W = 200]) and label (N x 3).

MEAN = [0.597, 0.4569, 0.3911]

Average channel value over training set (list [float])

SD = [0.258, 0.2307, 0.2265]

Standard deviation of channel value over training set (list [float])

__init__(root, train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]

UTKFace dataset constructor

Parameters
  • root (str) – Data folder.

  • train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).

  • split (float, optional) – Percentage of dataset to keep for training. The dataset is split randomly between training and testing, but training and test set are deterministic, i.e., the sets returned are always the same. The default is 0.7.

  • preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).

  • subset (array, list, or tensor, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).

  • transform (callable, optional) – Transformation to apply to the data points. The default is None.

  • target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

UCI’s Adult

class csl.datasets.Adult(root, target_name='income', train=True, preprocess=None, subset=None, transform=None, target_transform=None)[source]

Bases: object

UCI’s adult dataset

You can download adult.data and adult.test from http://archive.ics.uci.edu/ml/datasets/Adult

Variables
  • classes (list [str]) – Class labels.

  • train (bool) – True if training set or False otherwise.

  • data (torch.tensor) – Adult data points features.

  • transform (callable) – Function applied to the data points before returning them.

  • target (torch.tensor) – Adult data points labels.

  • target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]

Returns size of dataset.

__get_item__()

Return tuple (torch.tensor, torch.tensor) of features (N x F) and label (N x 1). The number of features F depends on preprocessing (see preprocess).

__init__(root, target_name='income', train=True, preprocess=None, subset=None, transform=None, target_transform=None)[source]

UCI’s adult dataset constructor

Parameters
  • root (str) – Data folder.

  • target_name (str, optional) – Name of target variable. The default is income.

  • train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).

  • preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).

  • subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).

  • transform (callable, optional) – Transformation to apply to the data points. The default is None.

  • target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

categorical = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']

List of categorical variable names (list [str]).

variables = ['age', 'workclass', 'fnlwgt', 'education', 'educational-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

List of variables in UCI’s Adult dataset (list [str]).

ProPublica’s COMPAS

class csl.datasets.COMPAS(root, target_name='two_year_recid', train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]

Bases: object

ProPublica’s COMPAS dataset

You can download compas-scores-two-years.csv from https://github.com/propublica/compas-analysis

Variables
  • classes (list [str]) – Class labels.

  • train (bool) – True if training set or False otherwise.

  • data (torch.tensor) – COMPAS data points features.

  • transform (callable) – Function applied to the data points before returning them.

  • target (torch.tensor) – COMPAS data points labels.

  • target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]

Returns size of dataset.

__get_item__()

Return tuple (torch.tensor, torch.tensor) of features (N x F) and label (N x 1). The number of features F depends on preprocessing (see preprocess).

__init__(root, target_name='two_year_recid', train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]

ProPublica’s COMPAS dataset constructor

Parameters
  • root (str) – Data folder.

  • target_name (str, optional) – Name of target variable. The default is two_year_recid.

  • train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).

  • split (float, optional) – Percentage of dataset to keep for training. The dataset is split randomly between training and testing, but training and test set are deterministic, i.e., the sets returned are always the same. The default is 0.7.

  • preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).

  • subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).

  • transform (callable, optional) – Transformation to apply to the data points. The default is None.

  • target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

categorical = ['sex', 'age_cat', 'race', 'score_text', 'v_score_text', 'c_charge_degree', 'is_recid', 'is_violent_recid', 'two_year_recid']

List of categorical variable names (list [str]).

variables = ['sex', 'age', 'age_cat', 'race', 'decile_score', 'score_text', 'v_decile_score', 'v_score_text', 'juv_misd_count', 'juv_other_count', 'priors_count', 'c_charge_degree', 'is_recid', 'is_violent_recid', 'two_year_recid']

List of variables retained from original ProPublica dataset (list [str]).

csl.datasets.utils

Dataset transformations

class csl.datasets.utils.Binning(var_name, bins)[source]

Bases: object

Bin variable.

Variables
  • var_name (str) – Variable name.

  • bins (list [int]) – Bin edges (each bin includes right edge and first bin includes both edges).

class csl.datasets.utils.Drop(var_names)[source]

Bases: object

Remove variables from data frame.

Variables

var_name (list [str]) – Variable names.

class csl.datasets.utils.Dummify(var_names)[source]

Bases: object

Dummy code variables.

Variables

var_names (list [str]) – Variable names.

class csl.datasets.utils.QuantileBinning(var_name, quantile)[source]

Bases: object

Bin variable in quantiles.

Variables
  • var_name (str) – Variable names.

  • quantile (int) – Number of bins.

class csl.datasets.utils.RandomCrop(size, padding)[source]

Bases: object

Pad and randomly crop image.

Variables
  • size (int) – Size of region to crop (in pixels).

  • padding (int) – Size of padding to add before cropping (in pixels).

class csl.datasets.utils.RandomFlip(p=0.5, axis=3)[source]

Bases: object

Randomly flip image along an axis.

Variables
  • p (float, optional) – Flipping probability. The default is 0.5.

  • axis (int, optional) – Axis along which to flip. The default is 3 (horizontal flip).

class csl.datasets.utils.Recode(var_name, dictionary)[source]

Bases: object

Recode variable.

Variables
  • var_name (str) – Variable name.

  • dictionary (dict) – Dictionary describing recoding patterns, e.g., {'L': ['L1', 'L2']} recodes levels L1 and L2 as L

class csl.datasets.utils.ToTensor(**kwargs)[source]

Bases: object

Transform input to torch.tensor or cast torch.tensor to dtype and device.

Variables

**kwargs (dict) – Parameters to pass to tensor constructor.