data¶
Specification of what data the model will load is given as part of your (base_models) model class, this does not specify where the data will be loaded from, but what data (e.g. the given name, the directory name, normalisation used, delta and delta-delta inclusion, and file extension). The location of the data is given when running an experiment using the Command line arguments.
If you need to load a file with a custom function create a subclass of _DataSource
(loading can include
preprocessing if needed).
The following sections describe the provided utilities for loading code in Morgana, however these are all used internally by experiment_builder, typically the only thing you need to be aware of are the available data sources and feature normalisers.
Batching utility¶
-
morgana.data.
batch
(data_generator, batch_size=32, shuffle=True, num_data_threads=0, device='cpu')[source]¶ Creates the batched data loader for the dataset given, maps the batches to a given device.
- Parameters
data_generator (torch.utils.data.Dataset or FilesDataset) – Dataset from which to load the batches of data.
batch_size (int) – Number of samples to load per batch.
shuffle (bool) – Whether to shuffle the data every epoch.
num_data_threads (int) – Number of parallel subprocesses to use for data loading.
device (str) – Name of the device to place the parameters on.
- Returns
An instance with the
__iter__
method, allowing for iteration over batches of the dataset.- Return type
torch.utils.data.DataLoader
(in aToDeviceWrapper
container)
FilesDataset¶
Note
This dataset provides indexing access to the _DataSource instances given. It provides
a custom collate_fn
for transposing and padding a dictionary of features.
-
class
morgana.data.
FilesDataset
(data_sources, data_dir, id_list, normalisers, data_root='.')[source]¶ Bases:
torch.utils.data.dataset.Dataset
Combines multiple
_DataSource
instances, and enables batching of a dictionary of sequence features.- Parameters
data_sources (dict[str, _DataSource]) – Specification of the different data to be loaded.
data_dir (str) – The directory containing all data for this dataset split.
id_list (str) – The name of the file id-list containing base names to load, contained withing
data_root
.normalisers (Normalisers or dict[str, _FeatureNormaliser]) – Normaliser instances used to normalise loaded features (and delta features).
data_root (str) – The directory root for this dataset.
-
normalisers
¶ The normaliser instances, set automatically by
morgana.experiment_builder.ExperimentBuilder
.- Type
-
static
collate_fn
(batch)[source]¶ Collates a list of outputs from
self.__getitem__
into a batched structure.- Parameters
batch (list[dict[str, object]]) – Each element in the list is a non-nested dictionary containing features loaded by each data source.
- Returns
batched_features – Batched version of the list of
features
items inbatch
. Note, it is possible to provide objects such as strings that will not be converted totorch.Tensor
, these will not be padded or sent to the correct device, but can be accessed in the features dictionary.- Return type
dict[str,
torch.Tensor
]
DataSource specification¶
Note
Data sources are defined in tts_data_tools, these provide a consistent interface to define what features to load for a model.
Note
Supported Feature normalisers are limited to mvn
and minmax
. To define a new normalisers you should override
Normalisers.create_normaliser()
.
_DataSource¶
-
class
tts_data_tools.data_sources.
_DataSource
(name, normalisation=None, use_deltas=False, ext=None)[source]¶ Bases:
object
Abstract data loading class.
- Parameters
name (str) – Name of directory that will contain this feature.
normalisation (None or str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features.
ext (str, optional) – The file extension of the saved features, if not set
self.name
is used.
Notes
The data setup assumes a folder structure such as the following example,
dataset_name (data_root) train (data_dir) lab (name) *.lab lab_minmax.json lf0 (name) *.npy lf0_mvn.json lf0_deltas_mvn.json ... valid (data_dir) ... ...
All data is contained below
data_root
.There can be multiple
data_dir
directories, e.g. one for each data split (train, valid, test).Each feature should have a directory within
data_dir
, this will contain all files for this feature.While normalisation is not handled here, you should ensure there are files present containing the normalisation parameters, e.g. ‘lf0_mvn.json’. Such files should exist for all data sources requiring normalisation, with an additional file for all data sources using delta features.
-
file_path
(self, base_name, data_dir)[source]¶ Creates file path for a given base name and data directory.
-
load_file
(self, base_name, data_dir)[source]¶ Loads the contents of a given file. Must either be a sequence feature with 2 dimensions or a scalar value.
-
save_file
(self, data, base_name, data_dir)[source]¶ Saves data to a file using the format defined by the class.
-
__call__
(self, base_name, data_dir)[source]¶ Loads the feature and creates deltas if specified by this data source.
-
__weakref__
¶ list of weak references to the object (if defined)
NumpyBinarySource¶
-
class
tts_data_tools.data_sources.
NumpyBinarySource
(name, normalisation=None, use_deltas=False, ext='npy')[source]¶ Bases:
tts_data_tools.data_sources._DataSource
Data loading class for features saved with
np.ndarray.tofile
, loading is thus performed usingnp.fromfile
.- Parameters
name (str) – Name of directory that will contain this feature.
normalisation (str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features. If normalisation is being used it will also perform normalisation of deltas.
ext (str, optional) – The file extension of the saved features, if not set
name
is used.
TextSource¶
-
class
tts_data_tools.data_sources.
TextSource
(name, normalisation=None, use_deltas=False, ext='txt')[source]¶ Bases:
tts_data_tools.data_sources._DataSource
Loads data from a text file, this can contain integers or floats and will have up to 2 dimensions.
- Parameters
name (str) – Name of directory that will contain this feature.
normalisation (str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features. If normalisation is being used it will also perform normalisation of deltas.
ext (str, optional) – The file extension of the saved features, if not set
name
is used.
-
load_file
(self, base_name, data_dir)[source]¶ Loads the feature from a text file into a numpy array.
StringSource¶
-
class
tts_data_tools.data_sources.
StringSource
(name, ext='txt')[source]¶ Bases:
tts_data_tools.data_sources._DataSource
Loads data from a text file, this will be loaded as strings where each item should be on a new line.
- Parameters
ASCIISource¶
-
class
tts_data_tools.data_sources.
ASCIISource
(name, ext='txt')[source]¶ Bases:
tts_data_tools.data_sources.StringSource
Loads data from a text file, this will be loaded as strings where each item should be on a new line.
- Parameters
-
load_file
(self, base_name, data_dir)[source]¶ Loads the lines and converts to ASCII codes (np.int8), each line is considered as a sequence item.
Each line can have a different number of characters, the maximum number of characters will be used to determine the shape of the 2nd dimension of the array.
WavSource¶
-
class
tts_data_tools.data_sources.
WavSource
(name, normalisation=None, use_deltas=False, sample_rate=None)[source]¶ Bases:
tts_data_tools.data_sources._DataSource
Loads wavfiles using
scipy.io.wavfile
.- Parameters
name (str) – Name of directory that will contain this feature.
normalisation (str) – Type of normalisation to perform. This allows the type of normalisation to be specified, but the normaliser itself will not be contained within the data source, that must be handled outside of the data source.
use_deltas (bool) – Whether to compute delta features. If normalisation is being used it will also perform normalisation of deltas.
-
sample_rate
¶ The sample rate of the wavfiles being loaded, if not given this will be set in
self.load_file
.- Type
Feature normalisers¶
Normalisers¶
-
class
morgana.data.
Normalisers
(data_sources, normalisation_dir, data_root='.', device='cpu')[source]¶ Bases:
dict
A dictionary-like container for normalisers, provides an interface for creating the normalisers.
- Parameters
-
create_normaliser
(self, name, data_source)[source]¶ Creates the normaliser if one was specified for this data source.
- Parameters
name (str) – Name used to index this data source in the model.
data_source (_DataSource) – Specification of how to load this feature.
- Returns
- Return type
_FeatureNormaliser¶
-
class
morgana.data.
_FeatureNormaliser
(feature_name, data_dir, use_deltas=False, device='cpu', data_root='.', file_pattern='{}.json')[source]¶ Bases:
object
Abstract feature normaliser class. Exposes the
normalise()
anddenormalise()
methods.Normalisers will work on both NumPy arrays and PyTorch tensors. This is necessary to process NumPy arrays in
_DataSource.__call__()
and to normalise/denormalise PyTorch tensors in batch within the model.- Parameters
feature_name (str) – Name of the feature.
data_dir (str) – Directory containing all data for this dataset split.
use_deltas (bool) – Whether to load normalisation parameters for delta features.
device (str or
torch.device
) – Name of the device to place the parameters on.data_root (str) – Directory root for this dataset.
file_pattern (str) – Format of the JSON file containing the normalisation parameters.
-
normalise
(self, feature, deltas=False)[source]¶ Normalises the sequence feature.
- Parameters
feature (np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)) – Sequence feature to be normalised, can be a NumPy array or a PyTorch tensor, can be batched.
deltas (bool) – Whether
feature
is a delta feature, and should be normalised using the delta parameters.
- Returns
Normalised sequence feature.
- Return type
np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)
-
denormalise
(self, feature, deltas=False)[source]¶ De-normalises the sequence feature.
- Parameters
feature (np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)) – Sequence feature to be normalised, can be a NumPy array or a PyTorch tensor, can be batched.
deltas (bool) – Whether
feature
is a delta feature, and should be normalised using the delta parameters.
- Returns
Normalised sequence feature.
- Return type
np.ndarray or torch.Tensor, shape (batch_size, seq_len, feat_dim) or (seq_len, feat_dim)
MeanVaraianceNormaliser¶
-
class
morgana.data.
MeanVaraianceNormaliser
(feature_name, data_dir, use_deltas=False, device='cpu', data_root='.')[source]¶ Bases:
morgana.data._FeatureNormaliser
Normalises features such that they have zero mean and unit variance.
- Normalisation:
norm_f = (f - mean) / std_dev
- Denormalisation:
f = (norm_f * std_dev) + mean
- Parameters
feature_name (str) – Name of the feature.
data_dir (str) – Directory containing all data for this dataset split.
use_deltas (bool) – Whether to load normalisation parameters for delta features.
device (str or
torch.device
) – Name of the device to place the parameters on.data_root (str) – Directory root for this dataset.
MinMaxNormaliser¶
-
class
morgana.data.
MinMaxNormaliser
(feature_name, data_dir, use_deltas=False, device='cpu', data_root='.')[source]¶ Bases:
morgana.data._FeatureNormaliser
Normalises features such that they have a minimum value of 0 and a maximum value of 1.
- Normalisation:
norm_f = (f - min) / (max - min)
- Denormalisation:
f = norm_f * (max - min) + min
- Parameters
feature_name (str) – Name of the feature.
data_dir (str) – Directory containing all data for this dataset split.
use_deltas (bool) – Whether to load normalisation parameters for delta features.
device (str or
torch.device
) – Name of the device to place the parameters on.data_root (str) – Directory root for this dataset.
Wrappers to change existing DataLoader instance¶
_DataLoaderWrapper¶
-
class
morgana.data.
_DataLoaderWrapper
(data_loader)[source]¶ Bases:
object
Abstract
torch.utils.data.DataLoader
wrapper. Allows attribute reference for underlying data loader.
ToDeviceWrapper¶
-
class
morgana.data.
ToDeviceWrapper
(data_loader, device)[source]¶ Bases:
morgana.data._DataLoaderWrapper
Wraps the
__iter__
method oftorch.utils.data.DataLoader
, mapping each batch to a given device.