csl.datasets
¶
Datasets for the csl module
CIFAR-10¶
-
class
csl.datasets.
CIFAR10
(root, train=True, subset=None, transform=None, target_transform=None)[source]¶ Bases:
object
CIFAR-10 dataset
You can download the dataset in PyTorch tensor format from https://www.ocf.berkeley.edu/~chamon/data/cifar-10.zip
- ..warning:: For performance purposes, this class loads the full
CIFAR-10 dataset to RAM. Even though it is less than 1 GB, you’ve been warned.
- Variables
train (bool) – True if training set or False otherwise.
data (torch.tensor) – CIFAR-10 images.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – CIFAR-10 labels.
target_transform (callable) – Function applied to the labels before returning them.
-
__get_item__
()¶ Return tuple (torch.tensor, torch.tensor) of images ([N] x [C = 3] x [H = 32] x [W = 32]) and label (N x 1).
-
MEAN
= [0.4914, 0.4822, 0.4465]¶ Average channel value over training set (list [float])
-
SD
= [0.2023, 0.1994, 0.201]¶ Standard deviation of channel value over training set (list [float])
-
__init__
(root, train=True, subset=None, transform=None, target_transform=None)[source]¶ CIFAR-10 dataset constructor
- Parameters
root (str) – Data folder.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.
-
classes
= ('Plane', 'Car', 'Bird', 'Cat', 'Deer', 'Dog', 'Frog', 'Horse', 'Ship', 'Truck')¶ CIFAR-10 labels (list [str])
Fashion MNIST¶
-
class
csl.datasets.
FMNIST
(root, train=True, subset=None, transform=None, target_transform=None)[source]¶ Bases:
object
FASHION MNIST dataset
You can download the dataset in PyTorch tensor format from https://www.ocf.berkeley.edu/~chamon/data/fmnist.zip
- ..warning:: For performance purposes, this class loads the full
FMNIST dataset to RAM. Even though it is less than 1 GB, you’ve been warned
- Variables
train (bool) – True if training set or False otherwise.
data (torch.tensor) – FMNIST images.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – FMNIST labels.
target_transform (callable) – Function applied to the labels before returning them.
-
__get_item__
()¶ Return tuple (torch.tensor, torch.tensor) of images ([N] x [C = 1] x [H = 28] x [W = 28]) and label (N x 1).
-
MEAN
= 0.1307¶ Average channel value over training set (float)
-
SD
= 0.3081¶ Standard deviation of channel value over training set (float)
-
__init__
(root, train=True, subset=None, transform=None, target_transform=None)[source]¶ FASHION MNIST dataset constructor
- Parameters
root (str) – Data folder.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.
-
classes
= ('T-shirt', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot')¶ FMNIST labels (list [str])
UTK Face¶
-
class
csl.datasets.
UTK
(root, train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶ Bases:
object
UTKFace dataset
Download the dataset from https://susanqq.github.io/UTKFace/ and indicate the path to the UTKFace folder
- Variables
classes (list [str]) – Class labels
train (bool) – True if training set or False otherwise.
current_batch (dict) – Memoized dataset to speed-up consecutive requests for the same data.
data (panda.DataFrame) – Data frame containing the targets and path to each image. Contrary to
CIFAR-10
orFMNIST
,UTKFace
is never fully loaded into memory.transform (callable) – Function applied to the data points before returning them.
target_transform (callable) – Function applied to the labels before returning them.
-
__get_item__
()¶ Return tuple (torch.tensor, pandas.DataFrame) of image ([N] x [C = 3] x [H = 200] x [W = 200]) and label (N x 3).
-
MEAN
= [0.597, 0.4569, 0.3911]¶ Average channel value over training set (list [float])
-
SD
= [0.258, 0.2307, 0.2265]¶ Standard deviation of channel value over training set (list [float])
-
__init__
(root, train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶ UTKFace dataset constructor
- Parameters
root (str) – Data folder.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
split (float, optional) – Percentage of dataset to keep for training. The dataset is split randomly between training and testing, but training and test set are deterministic, i.e., the sets returned are always the same. The default is 0.7.
preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).
subset (array, list, or tensor, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.
UCI’s Adult¶
-
class
csl.datasets.
Adult
(root, target_name='income', train=True, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶ Bases:
object
UCI’s adult dataset
You can download
adult.data
andadult.test
from http://archive.ics.uci.edu/ml/datasets/Adult- Variables
classes (list [str]) – Class labels.
train (bool) – True if training set or False otherwise.
data (torch.tensor) – Adult data points features.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – Adult data points labels.
target_transform (callable) – Function applied to the labels before returning them.
-
__get_item__
()¶ Return tuple (torch.tensor, torch.tensor) of features (N x F) and label (N x 1). The number of features F depends on preprocessing (see
preprocess
).
-
__init__
(root, target_name='income', train=True, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶ UCI’s adult dataset constructor
- Parameters
root (str) – Data folder.
target_name (str, optional) – Name of target variable. The default is income.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.
-
categorical
= ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'native-country', 'income']¶ List of categorical variable names (list [str]).
-
variables
= ['age', 'workclass', 'fnlwgt', 'education', 'educational-num', 'marital-status', 'occupation', 'relationship', 'race', 'gender', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']¶ List of variables in UCI’s Adult dataset (list [str]).
ProPublica’s COMPAS¶
-
class
csl.datasets.
COMPAS
(root, target_name='two_year_recid', train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶ Bases:
object
ProPublica’s COMPAS dataset
You can download compas-scores-two-years.csv from https://github.com/propublica/compas-analysis
- Variables
classes (list [str]) – Class labels.
train (bool) – True if training set or False otherwise.
data (torch.tensor) – COMPAS data points features.
transform (callable) – Function applied to the data points before returning them.
target (torch.tensor) – COMPAS data points labels.
target_transform (callable) – Function applied to the labels before returning them.
-
__get_item__
()¶ Return tuple (torch.tensor, torch.tensor) of features (N x F) and label (N x 1). The number of features F depends on preprocessing (see
preprocess
).
-
__init__
(root, target_name='two_year_recid', train=True, split=0.7, preprocess=None, subset=None, transform=None, target_transform=None)[source]¶ ProPublica’s COMPAS dataset constructor
- Parameters
root (str) – Data folder.
target_name (str, optional) – Name of target variable. The default is two_year_recid.
train (bool, optional) – Returns training set if True and test set if False. The default is True (training set).
split (float, optional) – Percentage of dataset to keep for training. The dataset is split randomly between training and testing, but training and test set are deterministic, i.e., the sets returned are always the same. The default is 0.7.
preprocess (callable, optional) – Transformations to apply before separating labels (e.g., binning, dummifying, etc.).
subset (list, optional) – Subset of indices of the dataset to use. The default is None (use the whole dataset).
transform (callable, optional) – Transformation to apply to the data points. The default is None.
target_transform (callable, optional) – Transformation to apply to the labels. The default is None.
-
categorical
= ['sex', 'age_cat', 'race', 'score_text', 'v_score_text', 'c_charge_degree', 'is_recid', 'is_violent_recid', 'two_year_recid']¶ List of categorical variable names (list [str]).
-
variables
= ['sex', 'age', 'age_cat', 'race', 'decile_score', 'score_text', 'v_decile_score', 'v_score_text', 'juv_misd_count', 'juv_other_count', 'priors_count', 'c_charge_degree', 'is_recid', 'is_violent_recid', 'two_year_recid']¶ List of variables retained from original ProPublica dataset (list [str]).
csl.datasets.utils
¶
Dataset transformations
-
class
csl.datasets.utils.
Binning
(var_name, bins)[source]¶ Bases:
object
Bin variable.
- Variables
var_name (str) – Variable name.
bins (list [int]) – Bin edges (each bin includes right edge and first bin includes both edges).
-
class
csl.datasets.utils.
Drop
(var_names)[source]¶ Bases:
object
Remove variables from data frame.
- Variables
var_name (list [str]) – Variable names.
-
class
csl.datasets.utils.
Dummify
(var_names)[source]¶ Bases:
object
Dummy code variables.
- Variables
var_names (list [str]) – Variable names.
-
class
csl.datasets.utils.
QuantileBinning
(var_name, quantile)[source]¶ Bases:
object
Bin variable in quantiles.
- Variables
var_name (str) – Variable names.
quantile (int) – Number of bins.
-
class
csl.datasets.utils.
RandomCrop
(size, padding)[source]¶ Bases:
object
Pad and randomly crop image.
- Variables
size (int) – Size of region to crop (in pixels).
padding (int) – Size of padding to add before cropping (in pixels).
-
class
csl.datasets.utils.
RandomFlip
(p=0.5, axis=3)[source]¶ Bases:
object
Randomly flip image along an axis.
- Variables
p (float, optional) – Flipping probability. The default is 0.5.
axis (int, optional) – Axis along which to flip. The default is 3 (horizontal flip).