Prototypes#

The submodule tsl.datasets.prototypes provides interfaces that can help in creating new datasets. All datasets provided by the library are implemented extending these interfaces.

The most general interface is Dataset, which is the parent class for every dataset in tsl. The more complete class TabularDataset provides useful functionalities for multivariate time series datasets with data in a tabular format, i.e., with time, node and feature dimensions. Data passed to this dataset should be pandas.DataFrame and/or numpy.ndarray. Missing values are supported either by setting as nan the missing entry or by explicitly setting the attribute mask.

If your data are timestamped, meaning that each observation is associated with a specific date and time, then you can consider using DatetimeDataset, which extends TabularDataset and provides additional functionalities for temporal data (e.g., datetime_encoded(), resample()). This class accepts DataFrame with index of type DatetimeIndex and columns of type MultiIndex (with nodes as the first level and channels as the second) for the target.

`Dataset`	Base class for Datasets in tsl.
`TabularDataset`	Base `Dataset` class for tabular data.
`DatetimeDataset`	Create a tsl dataset from a `pandas.DataFrame`.

class Dataset(*args, **kwargs)[source]#

Base class for Datasets in tsl.

Parameters:

name (str, optional) – Name of the dataset. If None, use name of the class. (default: None)
spatial_aggregation (str) – Function (as string) used for aggregation along temporal dimension. (default: 'sum')
spatial_aggregation – Permutation invariant function (as string) used for aggregation along nodes’ dimension. (default: 'sum')

property length: int#

Returns the length – in terms of time steps – of the dataset.

Returns:: Temporal length of the dataset.
Return type:: int

property n_nodes: int#

Returns the number of nodes in the dataset. In case of dynamic graph, n_nodes is the total number of nodes present in at least one time step.

Returns:: Total number of nodes in the dataset.
Return type:: int

property n_channels: int#

Returns the number of node-level channels of the main signal in the dataset.

Returns:: Number of channels of the main signal.
Return type:: int

property raw_file_names: Union[str, Sequence[str]]#: The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names: Union[str, Sequence[str]]#: The name of the files in the self.root_dir folder that must be present in order to skip building.

property raw_files_paths: List[str]#: The absolute filepaths that must be present in order to skip downloading.

property required_files_paths: List[str]#: The absolute filepaths that must be present in order to skip building.

download() → None[source]#: Downloads dataset’s files to the self.root_dir folder.

build() → None[source]#: Eventually build the dataset from raw data to self.root_dir folder.

load_raw(*args, **kwargs)[source]#: Loads raw dataset without any data preprocessing.

load(*args, **kwargs)[source]#: Loads raw dataset and preprocess data. Default to load_raw.

dataframe() → Union[DataFrame, List[DataFrame]][source]#: Returns a pandas representation of the dataset in the form of a DataFrame. May be a list of DataFrames if the dataset has a dynamic structure.

numpy(return_idx: bool = False) → Union[ndarray, List[ndarray], Tuple[ndarray, Series], Tuple[List[ndarray], Series]][source]#: Returns a numpy representation of the dataset in the form of a ndarray. If return_index is True, it returns also a Series that can be used as index. May be a list of ndarrays (and Series) if the dataset has a dynamic structure.

save_pickle(filename: str) → None[source]#

Save Dataset to disk.

Parameters:: filename (str) – path to filename for storage.

classmethod load_pickle(filename: str) → Dataset[source]#

Load instance of Dataset from disk.

Parameters:: filename (str) – path of Dataset.

compute_similarity(method: str, **kwargs) → Optional[ndarray][source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \\times N}\) computation, according to method.

Parameters:

method (str) – Method for the similarity computation.
**kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

get_similarity(method: Optional[str] = None, save: bool = False, **kwargs) → ndarray[source]#

Returns the matrix \(\mathbf{S} \in \mathbb{R}^{N \\times N}\), where \(N=\)self.n_nodes, with the pairwise similarity scores between nodes.

Parameters:

method (str, optional) – Method for the similarity computation. If None, defaults to dataset-specific default method. (default: None)
save (bool) – Whether to save similarity matrix in dataset’s directory after computation. (default: True)
**kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

Raises:

ValueError – If the similarity method is not valid.

get_connectivity(method: Optional[str] = None, threshold: Optional[float] = None, knn: Optional[int] = None, binary_weights: bool = False, include_self: bool = True, force_symmetric: bool = False, normalize_axis: Optional[int] = None, layout: str = 'edge_index', **kwargs) → Union[ndarray, Tuple, coo_matrix, csr_matrix, csc_matrix][source]#

Returns the weighted adjacency matrix \(\mathbf{A} \in \mathbb{R}^{N \times N}\), where \(N=\)self.n_nodes. The element \(a_{i,j} \in \mathbf{A}\) is 0 if there not exists an edge connecting node \(i\) to node \(j\). The return type depends on the specified layout (default: edge_index).

Parameters:

method (str, optional) – Method for the similarity computation. If None, defaults to dataset-specific default method. (default: None)
threshold (float, optional) – If not None, set to 0 the values below the threshold. (default: None)
knn (int, optional) – If not None, keep only \(k=\) knn nearest incoming neighbors. (default: None)
binary_weights (bool) – If True, the positive weights of the adjacency matrix are set to 1. (default: False)
include_self (bool) – If False, self-loops are never taken into account. (default: True)
force_symmetric (bool) – Force adjacency matrix to be symmetric by taking the maximum value between the two directions for each edge. (default: False)
normalize_axis (int, optional) – Divide edge weight \(a_{i, j}\) by \(\sum_k a_{i, k}\), if normalize_axis=0 or \(\sum_k a_{k, j}\), if normalize_axis=1. None for no normalization. (default: None)
layout (str) –
Convert matrix to a dense/sparse format. Available options are:
- dense: keep matrix dense \(\mathbf{A} \in \mathbb{R}^{N \times N}\).
- edge_index: convert to (edge_index, edge_weight) tuple, where edge_index has shape \([2, E]\) and edge_weight has shape \([E]\), being \(E\) the number of edges.
- coo/csr/csc: convert to specified scipy sparse matrix type.
(default: edge_index)
**kwargs (optional) – Additional optional keyword arguments for similarity computation.

Returns:

The similarity dense matrix.

get_splitter(method: Optional[str] = None, *args, **kwargs) → Splitter[source]#: Returns the splitter for a SpatioTemporalDataset. A Splitter provides the splits of the dataset – in terms of indices – for cross validation.

aggregate(node_index: Optional[Iterable[Iterable]] = None)[source]#

Aggregates nodes given an index of cluster assignments (spatial aggregation).

Parameters:: node_index – Sequence of grouped node ids.

get_config() → dict[source]#: Returns the keywords arguments (as dict) for instantiating a SpatioTemporalDataset.

class TabularDataset(*args, **kwargs)[source]#

Base Dataset class for tabular data.

Tabular data are assumed to be 3-dimensional arrays where the dimensions represent time, nodes and features, respectively. They can be either DataFrame or ndarray.

Parameters:

target (FrameArray) – DataFrame or numpy.ndarray containing the data related to the target signals. The first dimension (or the DataFrame index) is considered as the temporal dimension. The second dimension represents nodes, the last one denotes the number of channels. If the input array is bi-dimensional (or the DataFrame’s columns are not a MultiIndex), the sequence is assumed to be univariate (number of channels = 1). If DataFrame’s columns are a MultiIndex with two levels, we assume nodes are at first level, channels at second.
covariates (dict, optional) –
named mapping of DataFrame or numpy.ndarray representing covariates. Examples of covariates are exogenous signals (in the form of dynamic, multidimensional data) or static attributes (e.g., graph/node metadata). You can specify what each axis refers to by providing a pattern for each item in the mapping. Every item can be:
- a DataFrame or ndarray: in this case the pattern is inferred from the shape (if possible).
- a dict with keys ‘value’ and ‘pattern’ indexing the covariate object and the relative pattern, respectively.
(default: None)
mask (FrameArray, optional) – Boolean mask denoting if values in target are valid (True) or not (False). (default: None)
similarity_score (str) – Default method to compute the similarity matrix with compute_similarity. It must be inside dataset’s similarity_options. (default: None)
temporal_aggregation (str) – Default temporal aggregation method after resampling. (default: sum)
spatial_aggregation (str) – Default spatial aggregation method for aggregate, i.e., how to aggregate multiple nodes together. (default: sum)
default_splitting_method (str, optional) – Default splitting method for the dataset, i.e., how to split the dataset into train/val/test. (default: temporal)
force_synchronization (bool) – Synchronize all time-varying covariates with target. (default: True)
name (str, optional) – Optional name of the dataset. (default: class_name)
precision (int or str, optional) – numerical precision for data: 16 (or “half”), 32 (or “full”) or 64 (or “double”). (default: 32)

property length: int#: Number of time steps in the dataset.

property n_nodes: int#: Number of nodes in the dataset.

property n_channels: int#: Number of channels in dataset’s target.

property patterns: dict#

Shows the dimension of the data in the dataset in a more informative way.

The pattern mapping can be useful to glimpse on how data are arranged. The convention we use is the following:

‘t’ stands for “number of time steps”

‘n’ stands for “number of nodes”

‘f’ stands for “number of features” (per node)

property exogenous#: Time-varying covariates of the dataset’s target.

property attributes#: Static features related to the dataset.

property n_covariates: int#: Number of covariates in the dataset.

set_target(value: Union[DataFrame, ndarray])[source]#: Set sequence of target channels at self.target.

set_mask(mask: Optional[Union[DataFrame, ndarray]])[source]#: Set mask of target channels, i.e., a bool for each (node, time step, feature) triplet denoting if corresponding value in target is observed (obj:True) or not (obj:False).

add_covariate(name: str, value: Union[DataFrame, ndarray], pattern: Optional[str] = None)[source]#

Add covariate to the dataset. Examples of covariate are exogenous signals (in the form of dynamic multidimensional data) or static attributes (e.g., graph/node metadata). Parameter pattern specifies what each axis refers to:

‘t’: temporal dimension;
‘n’: node dimension;
‘c’/’f’: channels/features dimension.

For instance, the pattern of a node-level covariate is ‘t n f’, while a pairwise metric between nodes has pattern ‘n n’.

Parameters:

name (str) – the name of the object. You can then access the added object as dataset.{name}.
value (FrameArray) – the object to be added.
pattern (str, optional) –
the pattern of the object. A pattern specifies what each axis refers to:
- ’t’: temporal dimension;
- ’n’: node dimension;
- ’c’/’f’: channels/features dimension.
If None, the pattern is inferred from the shape. (default None)

add_exogenous(name: str, value: Union[DataFrame, ndarray], node_level: bool = True)[source]#: Shortcut method to add a time-varying covariate.

aggregate(node_index: Optional[Union[Index, Mapping]] = None, aggr: Optional[str] = None, mask_tolerance: float = 0.0)[source]#

Aggregates nodes given an index of cluster assignments (spatial aggregation).

Parameters:: node_index – Sequence of grouped node ids.

dataframe() → DataFrame[source]#: Returns a pandas representation of the dataset in the form of a DataFrame. May be a list of DataFrames if the dataset has a dynamic structure.

numpy(return_idx=False) → Union[ndarray, Tuple[ndarray, Index]][source]#: Returns a numpy representation of the dataset in the form of a ndarray. If return_index is True, it returns also a Series that can be used as index. May be a list of ndarrays (and Series) if the dataset has a dynamic structure.

build() → None#: Eventually build the dataset from raw data to self.root_dir folder.

compute_similarity(method: str, **kwargs) → Optional[ndarray]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \\times N}\) computation, according to method.

Parameters:

method (str) – Method for the similarity computation.
**kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

download() → None#: Downloads dataset’s files to the self.root_dir folder.

get_config() → dict#: Returns the keywords arguments (as dict) for instantiating a SpatioTemporalDataset.

Parameters:

method (str, optional) – Method for the similarity computation. If None, defaults to dataset-specific default method. (default: None)
threshold (float, optional) – If not None, set to 0 the values below the threshold. (default: None)
knn (int, optional) – If not None, keep only \(k=\) knn nearest incoming neighbors. (default: None)
binary_weights (bool) – If True, the positive weights of the adjacency matrix are set to 1. (default: False)
include_self (bool) – If False, self-loops are never taken into account. (default: True)
force_symmetric (bool) – Force adjacency matrix to be symmetric by taking the maximum value between the two directions for each edge. (default: False)
normalize_axis (int, optional) – Divide edge weight \(a_{i, j}\) by \(\sum_k a_{i, k}\), if normalize_axis=0 or \(\sum_k a_{k, j}\), if normalize_axis=1. None for no normalization. (default: None)
layout (str) –
Convert matrix to a dense/sparse format. Available options are:
- dense: keep matrix dense \(\mathbf{A} \in \mathbb{R}^{N \times N}\).
- edge_index: convert to (edge_index, edge_weight) tuple, where edge_index has shape \([2, E]\) and edge_weight has shape \([E]\), being \(E\) the number of edges.
- coo/csr/csc: convert to specified scipy sparse matrix type.
(default: edge_index)
**kwargs (optional) – Additional optional keyword arguments for similarity computation.

Returns:

The similarity dense matrix.

get_similarity(method: Optional[str] = None, save: bool = False, **kwargs) → ndarray#

Returns the matrix \(\mathbf{S} \in \mathbb{R}^{N \\times N}\), where \(N=\)self.n_nodes, with the pairwise similarity scores between nodes.

Parameters:

method (str, optional) – Method for the similarity computation. If None, defaults to dataset-specific default method. (default: None)
save (bool) – Whether to save similarity matrix in dataset’s directory after computation. (default: True)
**kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

Raises:

ValueError – If the similarity method is not valid.

get_splitter(method: Optional[str] = None, *args, **kwargs) → Splitter#: Returns the splitter for a SpatioTemporalDataset. A Splitter provides the splits of the dataset – in terms of indices – for cross validation.

load(*args, **kwargs)#: Loads raw dataset and preprocess data. Default to load_raw.

classmethod load_pickle(filename: str) → Dataset#

Load instance of Dataset from disk.

Parameters:: filename (str) – path of Dataset.

load_raw(*args, **kwargs)#: Loads raw dataset without any data preprocessing.

property raw_file_names: Union[str, Sequence[str]]#: The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property raw_files_paths: List[str]#: The absolute filepaths that must be present in order to skip downloading.

property required_file_names: Union[str, Sequence[str]]#: The name of the files in the self.root_dir folder that must be present in order to skip building.

property required_files_paths: List[str]#: The absolute filepaths that must be present in order to skip building.

save_pickle(filename: str) → None#

Save Dataset to disk.

Parameters:: filename (str) – path to filename for storage.

class DatetimeDataset(*args, **kwargs)[source]#

Create a tsl dataset from a pandas.DataFrame.

Parameters:

target (pandas.Dataframe) –
DataFrame containing the data related to the main signals. The index is considered as the temporal dimension. The columns are identified as:
- nodes: if there is only one level (we assume the number of channels to be 1).
- (nodes, channels): if there are two levels (i.e., if columns is a MultiIndex). We assume nodes are at first level, channels at second.
covariates (dict, optional) –
named mapping of DataFrame or numpy.ndarray representing covariates. Examples of covariates are exogenous signals (in the form of dynamic, multidimensional data) or static attributes (e.g., graph/node metadata). You can specify what each axis refers to by providing a pattern for each item in the mapping. Every item can be:
- a DataFrame or ndarray: in this case the pattern is inferred from the shape (if possible).
- a dict with keys ‘value’ and ‘pattern’ indexing the covariate object and the relative pattern, respectively.
(default: None)
mask (pandas.Dataframe or numpy.ndarray, optional) – Boolean mask denoting if values in data are valid (True) or not (False). (default: None)
freq (str, optional) – Force a sampling rate, eventually by resampling. (default: None)
similarity_score (str) – Default method to compute the similarity matrix with compute_similarity. It must be inside dataset’s similarity_options. (default: None)
temporal_aggregation (str) – Default temporal aggregation method after resampling. This method is used during instantiation to resample the dataset. It must be inside dataset’s temporal_aggregation_options. (default: sum)
spatial_aggregation (str) – Default spatial aggregation method for aggregate, i.e., how to aggregate multiple nodes together. It must be inside dataset’s spatial_aggregation_options. (default: sum)
default_splitting_method (str, optional) – Default splitting method for the dataset, i.e., how to split the dataset into train/val/test. (default: temporal)
sort_index (bool) – whether to sort the dataset chronologically at initialization. (default: True)
force_synchronization (bool) – Synchronize all time-varying covariates with target. (default: True)
name (str, optional) – Optional name of the dataset. (default: class_name)
precision (int or str, optional) – numerical precision for data: 16 (or “half”), 32 (or “full”) or 64 (or “double”). (default: 32)

add_covariate(name: str, value: Union[DataFrame, ndarray], pattern: Optional[str] = None)#

‘t’: temporal dimension;
‘n’: node dimension;
‘c’/’f’: channels/features dimension.

For instance, the pattern of a node-level covariate is ‘t n f’, while a pairwise metric between nodes has pattern ‘n n’.

Parameters:

name (str) – the name of the object. You can then access the added object as dataset.{name}.
value (FrameArray) – the object to be added.
pattern (str, optional) –
the pattern of the object. A pattern specifies what each axis refers to:
- ’t’: temporal dimension;
- ’n’: node dimension;
- ’c’/’f’: channels/features dimension.
If None, the pattern is inferred from the shape. (default None)

add_exogenous(name: str, value: Union[DataFrame, ndarray], node_level: bool = True)#: Shortcut method to add a time-varying covariate.

aggregate(node_index: Optional[Union[Index, Mapping]] = None, aggr: Optional[str] = None, mask_tolerance: float = 0.0)#

Aggregates nodes given an index of cluster assignments (spatial aggregation).

Parameters:: node_index – Sequence of grouped node ids.

property attributes#: Static features related to the dataset.

build() → None#: Eventually build the dataset from raw data to self.root_dir folder.

compute_similarity(method: str, **kwargs) → Optional[ndarray]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \\times N}\) computation, according to method.

Parameters:

method (str) – Method for the similarity computation.
**kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

dataframe() → DataFrame#: Returns a pandas representation of the dataset in the form of a DataFrame. May be a list of DataFrames if the dataset has a dynamic structure.

datetime_encoded(units: Union[str, List]) → DataFrame#: Transform dataset’s temporal index into covariates using sinusoidal transformations. Each temporal unit is used as period to compute the operations, obtaining two feature (\(\sin\) and \(\cos\)) for each unit.

datetime_onehot(units: Union[str, List]) → DataFrame#: Transform dataset’s temporal index into one-hot-encodings for each temporal unit specified. Internally, this function calls pandas.get_dummies().

download() → None#: Downloads dataset’s files to the self.root_dir folder.

property exogenous#: Time-varying covariates of the dataset’s target.

get_config() → dict#: Returns the keywords arguments (as dict) for instantiating a SpatioTemporalDataset.

Parameters:

method (str, optional) – Method for the similarity computation. If None, defaults to dataset-specific default method. (default: None)
threshold (float, optional) – If not None, set to 0 the values below the threshold. (default: None)
knn (int, optional) – If not None, keep only \(k=\) knn nearest incoming neighbors. (default: None)
binary_weights (bool) – If True, the positive weights of the adjacency matrix are set to 1. (default: False)
include_self (bool) – If False, self-loops are never taken into account. (default: True)
force_symmetric (bool) – Force adjacency matrix to be symmetric by taking the maximum value between the two directions for each edge. (default: False)
normalize_axis (int, optional) – Divide edge weight \(a_{i, j}\) by \(\sum_k a_{i, k}\), if normalize_axis=0 or \(\sum_k a_{k, j}\), if normalize_axis=1. None for no normalization. (default: None)
layout (str) –
Convert matrix to a dense/sparse format. Available options are:
- dense: keep matrix dense \(\mathbf{A} \in \mathbb{R}^{N \times N}\).
- edge_index: convert to (edge_index, edge_weight) tuple, where edge_index has shape \([2, E]\) and edge_weight has shape \([E]\), being \(E\) the number of edges.
- coo/csr/csc: convert to specified scipy sparse matrix type.
(default: edge_index)
**kwargs (optional) – Additional optional keyword arguments for similarity computation.

Returns:

The similarity dense matrix.

get_similarity(method: Optional[str] = None, save: bool = False, **kwargs) → ndarray#

Returns the matrix \(\mathbf{S} \in \mathbb{R}^{N \\times N}\), where \(N=\)self.n_nodes, with the pairwise similarity scores between nodes.

Parameters:

method (str, optional) – Method for the similarity computation. If None, defaults to dataset-specific default method. (default: None)
save (bool) – Whether to save similarity matrix in dataset’s directory after computation. (default: True)
**kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

Raises:

ValueError – If the similarity method is not valid.

get_splitter(method: Optional[str] = None, *args, **kwargs) → Splitter#: Returns the splitter for a SpatioTemporalDataset. A Splitter provides the splits of the dataset – in terms of indices – for cross validation.

holidays_onehot(country, subdiv=None) → DataFrame#

Returns a DataFrame to indicate if dataset timestamps is holiday. See https://python-holidays.readthedocs.io/en/latest/.

Parameters:

country (str) – country for which holidays have to be checked, e.g., “CH” for Switzerland.
subdiv (dict, optional) – optional country sub-division (state, region, province, canton), e.g., “TI” for Ticino, Switzerland.

Returns:

DataFrame with one column (“holiday”) as one-hot: encoding (1 if the timestamp is in a holiday, 0 otherwise).

Return type:

pandas.DataFrame

property length: int#: Number of time steps in the dataset.

load(*args, **kwargs)#: Loads raw dataset and preprocess data. Default to load_raw.

classmethod load_pickle(filename: str) → Dataset#

Load instance of Dataset from disk.

Parameters:: filename (str) – path of Dataset.

load_raw(*args, **kwargs)#: Loads raw dataset without any data preprocessing.

property n_channels: int#: Number of channels in dataset’s target.

property n_covariates: int#: Number of covariates in the dataset.

property n_nodes: int#: Number of nodes in the dataset.

numpy(return_idx=False) → Union[ndarray, Tuple[ndarray, Index]]#: Returns a numpy representation of the dataset in the form of a ndarray. If return_index is True, it returns also a Series that can be used as index. May be a list of ndarrays (and Series) if the dataset has a dynamic structure.

property patterns: dict#

Shows the dimension of the data in the dataset in a more informative way.

The pattern mapping can be useful to glimpse on how data are arranged. The convention we use is the following:

‘t’ stands for “number of time steps”

‘n’ stands for “number of nodes”

‘f’ stands for “number of features” (per node)

property raw_file_names: Union[str, Sequence[str]]#: The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property raw_files_paths: List[str]#: The absolute filepaths that must be present in order to skip downloading.

property required_file_names: Union[str, Sequence[str]]#: The name of the files in the self.root_dir folder that must be present in order to skip building.

property required_files_paths: List[str]#: The absolute filepaths that must be present in order to skip building.

save_pickle(filename: str) → None#

Save Dataset to disk.

Parameters:: filename (str) – path to filename for storage.

set_mask(mask: Optional[Union[DataFrame, ndarray]])#: Set mask of target channels, i.e., a bool for each (node, time step, feature) triplet denoting if corresponding value in target is observed (obj:True) or not (obj:False).

set_target(value: Union[DataFrame, ndarray])#: Set sequence of target channels at self.target.