data¶

Specification of what data the model will load is given as part of your (base_models) model class, this does not specify where the data will be loaded from, but what data (e.g. the given name, the directory name, normalisation used, delta and delta-delta inclusion, and file extension). The location of the data is given when running an experiment using the Command line arguments.

If you need to load a file with a custom function create a subclass of _DataSource (loading can include preprocessing if needed).

The following sections describe the provided utilities for loading code in Morgana, however these are all used internally by experiment_builder, typically the only thing you need to be aware of are the available data sources and feature normalisers.

Batching utility
FilesDataset
DataSource specification
Feature normalisers
Wrappers to change existing DataLoader instance

Batching utility¶

morgana.data.batch(data_generator, batch_size=32, shuffle=True, num_data_threads=0, device='cpu')[source]¶

Creates the batched data loader for the dataset given, maps the batches to a given device.

Parameters

data_generator (torch.utils.data.Dataset or FilesDataset) – Dataset from which to load the batches of data.
batch_size (int) – Number of samples to load per batch.
shuffle (bool) – Whether to shuffle the data every epoch.
num_data_threads (int) – Number of parallel subprocesses to use for data loading.
device (str) – Name of the device to place the parameters on.

Returns

An instance with the __iter__ method, allowing for iteration over batches of the dataset.

Return type

torch.utils.data.DataLoader (in a ToDeviceWrapper container)

FilesDataset¶

Note

This dataset provides indexing access to the _DataSource instances given. It provides a custom collate_fn for transposing and padding a dictionary of features.

class morgana.data.FilesDataset(data_sources, data_dir, id_list, normalisers, data_root='.')[source]¶

Bases: torch.utils.data.dataset.Dataset

Combines multiple _DataSource instances, and enables batching of a dictionary of sequence features.

Parameters

data_sources (dict[str, _DataSource]) – Specification of the different data to be loaded.
data_dir (str) – The directory containing all data for this dataset split.
id_list (str) – The name of the file id-list containing base names to load, contained withing data_root.
normalisers (Normalisers or dict[str, _FeatureNormaliser]) – Normaliser instances used to normalise loaded features (and delta features).
data_root (str) – The directory root for this dataset.

file_ids¶

List of base names loaded from id_list.

Type: list[str]

normalisers¶

The normaliser instances, set automatically by morgana.experiment_builder.ExperimentBuilder.

Type: Normalisers or dict[str, _FeatureNormaliser]

static collate_fn(batch)[source]¶

Collates a list of outputs from self.__getitem__ into a batched structure.

Parameters: batch (list[dict[str, object]]) – Each element in the list is a non-nested dictionary containing features loaded by each data source.
Returns: batched_features – Batched version of the list of features items in batch. Note, it is possible to provide objects such as strings that will not be converted to torch.Tensor, these will not be padded or sent to the correct device, but can be accessed in the features dictionary.
Return type: dict[str, torch.Tensor]

DataSource specification¶

Note

Data sources are defined in tts_data_tools, these provide a consistent interface to define what features to load for a model.

Note

Supported Feature normalisers are limited to mvn and minmax. To define a new normalisers you should override Normalisers.create_normaliser().

_DataSource¶

class tts_data_tools.data_sources._DataSource(name, normalisation=None, use_deltas=False, ext=None)[source]¶

Bases: object

Abstract data loading class.

Parameters

name (str) – Name of directory that will contain this feature.
normalisation (None or str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features.
ext (str, optional) – The file extension of the saved features, if not set self.name is used.

Notes

The data setup assumes a folder structure such as the following example,

dataset_name (data_root)

    train (data_dir)

        lab (name)
            *.lab
        lab_minmax.json

        lf0 (name)
            *.npy
        lf0_mvn.json
        lf0_deltas_mvn.json
        ...

    valid (data_dir)
        ...

    ...

All data is contained below data_root.

There can be multiple data_dir directories, e.g. one for each data split (train, valid, test).

Each feature should have a directory within data_dir, this will contain all files for this feature.

While normalisation is not handled here, you should ensure there are files present containing the normalisation parameters, e.g. ‘lf0_mvn.json’. Such files should exist for all data sources requiring normalisation, with an additional file for all data sources using delta features.

file_path(self, base_name, data_dir)[source]¶: Creates file path for a given base name and data directory.

load_file(self, base_name, data_dir)[source]¶

Loads the contents of a given file. Must either be a sequence feature with 2 dimensions or a scalar value.

Parameters

base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Returns

Return type

int or float or bool or np.ndarray, shape (seq_len, feat_dim)

save_file(self, data, base_name, data_dir)[source]¶

Saves data to a file using the format defined by the class.

Parameters

data (int or float or bool or np.ndarray, shape (seq_len, feat_dim)) – Data loaded from the file specified.
base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

__call__(self, base_name, data_dir)[source]¶

Loads the feature and creates deltas if specified by this data source.

Parameters

base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Returns

Loaded feature, and deltas if specified.

Return type

dict[str, (int or float or bool or np.ndarray)]

__weakref__¶: list of weak references to the object (if defined)

NumpyBinarySource¶

class tts_data_tools.data_sources.NumpyBinarySource(name, normalisation=None, use_deltas=False, ext='npy')[source]¶

Bases: tts_data_tools.data_sources._DataSource

Data loading class for features saved with np.ndarray.tofile, loading is thus performed using np.fromfile.

Parameters

name (str) – Name of directory that will contain this feature.
normalisation (str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features. If normalisation is being used it will also perform normalisation of deltas.
ext (str, optional) – The file extension of the saved features, if not set name is used.

load_file(self, base_name, data_dir)[source]¶

Loads the feature using np.load.

Parameters

base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Returns

Return type

int or float or bool or np.ndarray, shape (seq_len, feat_dim)

save_file(self, data, base_name, data_dir)[source]¶

Saves the feature using np.save.

Parameters

data (int or float or bool or np.ndarray, shape (seq_len, feat_dim)) – Data loaded from the file specified.
base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

TextSource¶

class tts_data_tools.data_sources.TextSource(name, normalisation=None, use_deltas=False, ext='txt')[source]¶

Bases: tts_data_tools.data_sources._DataSource

Loads data from a text file, this can contain integers or floats and will have up to 2 dimensions.

Parameters

name (str) – Name of directory that will contain this feature.
normalisation (str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features. If normalisation is being used it will also perform normalisation of deltas.
ext (str, optional) – The file extension of the saved features, if not set name is used.

load_file(self, base_name, data_dir)[source]¶

Loads the feature from a text file into a numpy array.

Parameters

base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Returns

Return type

int or float or np.ndarray, shape (seq_len, feat_dim)

save_file(self, data, base_name, data_dir)[source]¶

Saves data as a text file.

Parameters

data (int or float or bool or np.ndarray, shape (seq_len, feat_dim)) – Data loaded from the file specified.
base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

StringSource¶

class tts_data_tools.data_sources.StringSource(name, ext='txt')[source]¶

Bases: tts_data_tools.data_sources._DataSource

Loads data from a text file, this will be loaded as strings where each item should be on a new line.

Parameters

name (str) – Name of directory that will contain this feature.
ext (str, optional) – The file extension of the saved features, if not set name is used.

load_file(self, base_name, data_dir)[source]¶

Loads lines of text.

Parameters

base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Returns

Return type

list<str>

save_file(self, data, base_name, data_dir)[source]¶

Saves text as a text file.

Parameters

data (list<str>) – Sequence of strings.
base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

ASCIISource¶

class tts_data_tools.data_sources.ASCIISource(name, ext='txt')[source]¶

Bases: tts_data_tools.data_sources.StringSource

Loads data from a text file, this will be loaded as strings where each item should be on a new line.

Parameters

name (str) – Name of directory that will contain this feature.
ext (str, optional) – The file extension of the saved features, if not set name is used.

load_file(self, base_name, data_dir)[source]¶

Loads the lines and converts to ASCII codes (np.int8), each line is considered as a sequence item.

Each line can have a different number of characters, the maximum number of characters will be used to determine the shape of the 2nd dimension of the array.

Parameters

base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Returns

Return type

np.ndarray, shape (seq_len, max_num_characters), dtype (np.int8)

save_file(self, data, base_name, data_dir)[source]¶

Saves ASCII codes as a text file.

Parameters

data (np.ndarray, shape (seq_len, max_num_characters), dtype (np.int8)) – Sequence of strings stored as ASCII codes (and padded with x00).
base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

WavSource¶

class tts_data_tools.data_sources.WavSource(name, normalisation=None, use_deltas=False, sample_rate=None)[source]¶

Bases: tts_data_tools.data_sources._DataSource

Loads wavfiles using scipy.io.wavfile.

Parameters

name (str) – Name of directory that will contain this feature.
normalisation (str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features. If normalisation is being used it will also perform normalisation of deltas.

sample_rate¶

The sample rate of the wavfiles being loaded, if not given this will be set in self.load_file.

Type: int

load_file(self, base_name, data_dir)[source]¶

Loads a wavfile using scipy.io.wavfile.

Parameters

base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Returns

Return type

np.ndarray, shape (num_samples,), dtype (np.int16)

save_file(self, data, base_name, data_dir)[source]¶

Saves the feature as a wavfile using scipy.io.wavfile.write

Parameters

data (int or float or bool or np.ndarray, shape (seq_len, feat_dim)) – Data loaded from the file specified.
base_name (str) – The name (without extensions) of the file to be loaded.
data_dir (str) – The directory containing all feature types for this dataset.

Feature normalisers¶

Normalisers¶

class morgana.data.Normalisers(data_sources, normalisation_dir, data_root='.', device='cpu')[source]¶

Bases: dict

A dictionary-like container for normalisers, provides an interface for creating the normalisers.

Parameters

normalisation_dir (str) – The directory containing the normalisation parameters (in a JSON file).
data_root (str) – The directory root for this dataset.
device (str or torch.device) – The name of the device to place the parameters on.

create_normaliser(self, name, data_source)[source]¶

Creates the normaliser if one was specified for this data source.

Parameters

name (str) – Name used to index this data source in the model.
data_source (_DataSource) – Specification of how to load this feature.

Returns

Return type

_FeatureNormaliser

_FeatureNormaliser¶

class morgana.data._FeatureNormaliser(feature_name, data_dir, use_deltas=False, device='cpu', data_root='.', file_pattern='{}.json')[source]¶

Bases: object

Abstract feature normaliser class. Exposes the normalise() and denormalise() methods.

Normalisers will work on both NumPy arrays and PyTorch tensors. This is necessary to process NumPy arrays in _DataSource.__call__() and to normalise/denormalise PyTorch tensors in batch within the model.

Parameters

feature_name (str) – Name of the feature.
data_dir (str) – Directory containing all data for this dataset split.
use_deltas (bool) – Whether to load normalisation parameters for delta features.
device (str or torch.device) – Name of the device to place the parameters on.
data_root (str) – Directory root for this dataset.
file_pattern (str) – Format of the JSON file containing the normalisation parameters.

normalise(self, feature, deltas=False)[source]¶

Normalises the sequence feature.

Parameters

feature (np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)) – Sequence feature to be normalised, can be a NumPy array or a PyTorch tensor, can be batched.
deltas (bool) – Whether feature is a delta feature, and should be normalised using the delta parameters.

Returns

Normalised sequence feature.

Return type

np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)

denormalise(self, feature, deltas=False)[source]¶

De-normalises the sequence feature.

Parameters

feature (np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)) – Sequence feature to be normalised, can be a NumPy array or a PyTorch tensor, can be batched.
deltas (bool) – Whether feature is a delta feature, and should be normalised using the delta parameters.

Returns

Normalised sequence feature.

Return type

np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)

fetch_params(self, data_type=<class 'numpy.ndarray'>, deltas=False)[source]¶: Gets the normalisation parameters, taking into account the delta flag and type of data.

static load_params(feature_name, data_dir, device='cpu', file_pattern='{}.json')[source]¶: Loads the parameters from file and converts them to NumPy arrays and PyTorch tensors.

MeanVaraianceNormaliser¶

class morgana.data.MeanVaraianceNormaliser(feature_name, data_dir, use_deltas=False, device='cpu', data_root='.')[source]¶

Bases: morgana.data._FeatureNormaliser

Normalises features such that they have zero mean and unit variance.

Normalisation:: norm_f = (f - mean) / std_dev
Denormalisation:: f = (norm_f * std_dev) + mean

Parameters

feature_name (str) – Name of the feature.
data_dir (str) – Directory containing all data for this dataset split.
use_deltas (bool) – Whether to load normalisation parameters for delta features.
device (str or torch.device) – Name of the device to place the parameters on.
data_root (str) – Directory root for this dataset.

MinMaxNormaliser¶

class morgana.data.MinMaxNormaliser(feature_name, data_dir, use_deltas=False, device='cpu', data_root='.')[source]¶

Bases: morgana.data._FeatureNormaliser

Normalises features such that they have a minimum value of 0 and a maximum value of 1.

Normalisation:: norm_f = (f - min) / (max - min)
Denormalisation:: f = norm_f * (max - min) + min

Parameters

feature_name (str) – Name of the feature.
data_dir (str) – Directory containing all data for this dataset split.
use_deltas (bool) – Whether to load normalisation parameters for delta features.
device (str or torch.device) – Name of the device to place the parameters on.
data_root (str) – Directory root for this dataset.

Wrappers to change existing DataLoader instance¶

_DataLoaderWrapper¶

class morgana.data._DataLoaderWrapper(data_loader)[source]¶

Bases: object

Abstract torch.utils.data.DataLoader wrapper. Allows attribute reference for underlying data loader.

ToDeviceWrapper¶

class morgana.data.ToDeviceWrapper(data_loader, device)[source]¶

Bases: morgana.data._DataLoaderWrapper

Wraps the __iter__ method of torch.utils.data.DataLoader, mapping each batch to a given device.

data¶

Batching utility¶

FilesDataset¶

DataSource specification¶

_DataSource¶

NumpyBinarySource¶

TextSource¶

StringSource¶

ASCIISource¶

WavSource¶

Feature normalisers¶

Normalisers¶

_FeatureNormaliser¶

MeanVaraianceNormaliser¶

MinMaxNormaliser¶

Wrappers to change existing DataLoader instance¶

_DataLoaderWrapper¶

ToDeviceWrapper¶

Table of Contents

Related Topics

This Page