`csl.datasets`¶

Datasets for the csl module

CIFAR-10¶

class csl.datasets.CIFAR10(root, train=True, subset=None, transform=None, target_transform=None)[source]¶

Bases: object

CIFAR-10 dataset

You can download the dataset in PyTorch tensor format from https://www.ocf.berkeley.edu/~chamon/data/cifar-10.zip

..warning:: For performance purposes, this class loads the full: CIFAR-10 dataset to RAM. Even though it is less than 1 GB, you’ve been warned.

Variables

train (bool) – True if training set or False otherwise.
data (torch.tensor) – CIFAR-10 images.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – CIFAR-10 labels.
target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]¶: Return size of dataset.

__get_item__()¶: Return tuple (torch.tensor, torch.tensor) of images ([N] x [C = 3] x [H = 32] x [W = 32]) and label (N x 1).

MEAN = [0.4914, 0.4822, 0.4465]¶: Average channel value over training set (list [float])

SD = [0.2023, 0.1994, 0.201]¶: Standard deviation of channel value over training set (list [float])

__init__(root, train=True, subset=None, transform=None, target_transform=None)[source]¶

CIFAR-10 dataset constructor

Parameters

root (str) – Data folder.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

classes = ('Plane', 'Car', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck')¶: CIFAR-10 labels (list [str])

Fashion MNIST¶

class csl.datasets.FMNIST(root, train=True, subset=None, transform=None, target_transform=None)[source]¶

Bases: object

FASHION MNIST dataset

You can download the dataset in PyTorch tensor format from https://www.ocf.berkeley.edu/~chamon/data/fmnist.zip

..warning:: For performance purposes, this class loads the full: FMNIST dataset to RAM. Even though it is less than 1 GB, you’ve been warned

Variables

train (bool) – True if training set or False otherwise.
data (torch.tensor) – FMNIST images.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – FMNIST labels.
target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]¶: Returns size of dataset.

__get_item__()¶: Return tuple (torch.tensor, torch.tensor) of images ([N] x [C = 1] x [H = 28] x [W = 28]) and label (N x 1).

MEAN = 0.1307¶: Average channel value over training set (float)

SD = 0.3081¶: Standard deviation of channel value over training set (float)

__init__(root, train=True, subset=None, transform=None, target_transform=None)[source]¶

FASHION MNIST dataset constructor

Parameters

root (str) – Data folder.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

classes = ('T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot')¶: FMNIST labels (list [str])

UTK Face¶

class csl.datasets.UTK(root, train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶

Bases: object

UTKFace dataset

Download the dataset from https://susanqq.github.io/UTKFace/ and indicate the path to the UTKFace folder

Variables

classes (list [str]) – Class labels
train (bool) – True if training set or False otherwise.
current_batch (dict) – Memoized dataset to speed-up consecutive requests for the same data.
data (panda.DataFrame) – Data frame containing the targets and path to each image. Contrary to CIFAR-10 or FMNIST, UTKFace is never fully loaded into memory.
transform (callable) – Function applied to the data points before returning them.
target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]¶: Return size of dataset.

__get_item__()¶: Return tuple (torch.tensor, pandas.DataFrame) of image ([N] x [C = 3] x [H = 200] x [W = 200]) and label (N x 3).

MEAN = [0.597, 0.4569, 0.3911]¶: Average channel value over training set (list [float])

SD = [0.258, 0.2307, 0.2265]¶: Standard deviation of channel value over training set (list [float])

__init__(root, train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶

UTKFace dataset constructor

Parameters

root (str) – Data folder.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
split (float, optional) – Percentage of dataset to keep for training. The dataset is split randomly between training and testing, but training and test set are deterministic, i.e., the sets returned are always the same. The default is 0.7.
preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).
subset (array, list, or tensor, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

UCI’s Adult¶

class csl.datasets.Adult(root, target_name='income', train=True, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶

Bases: object

UCI’s adult dataset

You can download adult.data and adult.test from http://archive.ics.uci.edu/ml/datasets/Adult

Variables

classes (list [str]) – Class labels.
train (bool) – True if training set or False otherwise.
data (torch.tensor) – Adult data points features.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – Adult data points labels.
target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]¶: Returns size of dataset.

__get_item__()¶: Return tuple (torch.tensor, torch.tensor) of features (N x F) and label (N x 1). The number of features F depends on preprocessing (see preprocess).

__init__(root, target_name='income', train=True, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶

UCI’s adult dataset constructor

Parameters

root (str) – Data folder.
target_name (str, optional) – Name of target variable. The default is income.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

categorical = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']¶: List of categorical variable names (list [str]).

variables = ['age', 'workclass', 'fnlwgt', 'education', 'educational-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']¶: List of variables in UCI’s Adult dataset (list [str]).

ProPublica’s COMPAS¶

class csl.datasets.COMPAS(root, target_name='two_year_recid', train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶

Bases: object

ProPublica’s COMPAS dataset

You can download compas-scores-two-years.csv from https://github.com/propublica/compas-analysis

Variables

classes (list [str]) – Class labels.
train (bool) – True if training set or False otherwise.
data (torch.tensor) – COMPAS data points features.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – COMPAS data points labels.
target_transform (callable) – Function applied to the labels before returning them.

__len__()[source]¶: Returns size of dataset.

__get_item__()¶: Return tuple (torch.tensor, torch.tensor) of features (N x F) and label (N x 1). The number of features F depends on preprocessing (see preprocess).

__init__(root, target_name='two_year_recid', train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶

ProPublica’s COMPAS dataset constructor

Parameters

root (str) – Data folder.
target_name (str, optional) – Name of target variable. The default is two_year_recid.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
split (float, optional) – Percentage of dataset to keep for training. The dataset is split randomly between training and testing, but training and test set are deterministic, i.e., the sets returned are always the same. The default is 0.7.
preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.

categorical = ['sex', 'age_cat', 'race', 'score_text', 'v_score_text', 'c_charge_degree', 'is_recid', 'is_violent_recid', 'two_year_recid']¶: List of categorical variable names (list [str]).

variables = ['sex', 'age', 'age_cat', 'race', 'decile_score', 'score_text', 'v_decile_score', 'v_score_text', 'juv_misd_count', 'juv_other_count', 'priors_count', 'c_charge_degree', 'is_recid', 'is_violent_recid', 'two_year_recid']¶: List of variables retained from original ProPublica dataset (list [str]).

`csl.datasets.utils`¶

Dataset transformations

class csl.datasets.utils.Binning(var_name, bins)[source]¶

Bases: object

Bin variable.

Variables

var_name (str) – Variable name.
bins (list [int]) – Bin edges (each bin includes right edge and first bin includes both edges).

class csl.datasets.utils.Drop(var_names)[source]¶

Bases: object

Remove variables from data frame.

Variables: var_name (list [str]) – Variable names.

class csl.datasets.utils.Dummify(var_names)[source]¶

Bases: object

Dummy code variables.

Variables: var_names (list [str]) – Variable names.

class csl.datasets.utils.QuantileBinning(var_name, quantile)[source]¶

Bases: object

Bin variable in quantiles.

Variables

var_name (str) – Variable names.
quantile (int) – Number of bins.

class csl.datasets.utils.RandomCrop(size, padding)[source]¶

Bases: object

Pad and randomly crop image.

Variables

size (int) – Size of region to crop (in pixels).
padding (int) – Size of padding to add before cropping (in pixels).

class csl.datasets.utils.RandomFlip(p=0.5, axis=3)[source]¶

Bases: object

Randomly flip image along an axis.

Variables

p (float, optional) – Flipping probability. The default is 0.5.
axis (int, optional) – Axis along which to flip. The default is 3 (horizontal flip).

class csl.datasets.utils.Recode(var_name, dictionary)[source]¶

Bases: object

Recode variable.

Variables

var_name (str) – Variable name.
dictionary (dict) – Dictionary describing recoding patterns, e.g., {'L': ['L1', 'L2']} recodes levels L1 and L2 as L

class csl.datasets.utils.ToTensor(**kwargs)[source]¶

Bases: object

Transform input to torch.tensor or cast torch.tensor to dtype and device.

Variables: **kwargs (dict) – Parameters to pass to tensor constructor.

csl.datasets¶

CIFAR-10¶

Fashion MNIST¶

UTK Face¶

UCI’s Adult¶

ProPublica’s COMPAS¶

csl.datasets.utils¶

`csl.datasets`¶

`csl.datasets.utils`¶