Datasets in TSL#

AirQuality

Measurements of pollutant \(PM2.5\) collected by 437 air quality monitoring stations spread across 43 Chinese cities from May 2014 to April 2015.

Elergone

Load profiles of 370 points collected every 15 minutes from 2011 to 2014.

MetrLA

Traffic readings collected from 207 loop detectors on highways in Los Angeles County, aggregated in 5 minutes intervals over four months between March 2012 and June 2012.

PemsBay

The dataset contains 6 months of traffic readings from 01/01/2017 to 05/31/2017 collected every 5 minutes by 325 traffic sensors in San Francisco Bay Area.

ElectricityBenchmark

Electricity consumption (in kWh) measured hourly by 321 sensors from 2012 to 2014.

TrafficBenchmark

A collection of hourly road occupancy rates (between 0 and 1) measured by 862 sensors for 48 months (2015-2016) on San Francisco Bay Area freeways.

SolarBenchmark

Solar power production records in the year of 2006, is sampled every 10 minutes from 137 synthetic PV farms in Alabama State.

ExchangeBenchmark

The collection of the daily exchange rates of eight foreign countries including Australia, British, Canada, Switzerland, China, Japan, New Zealand and Singapore ranging from 1990 to 2016.

GaussianNoiseSyntheticDataset

A generator of synthetic datasets from an input model and input graph.

class AirQuality(*args, **kwargs)[source]#

Measurements of pollutant \(PM2.5\) collected by 437 air quality monitoring stations spread across 43 Chinese cities from May 2014 to April 2015.

The dataset contains also a smaller version AirQuality(small=True) with only the subset of nodes containing the 36 sensors in Beijing.

Data collected inside the Urban Air project.

Dataset size:
  • Time steps: 8760

  • Nodes: 437

  • Channels: 1

  • Sampling rate: 1 hour

  • Missing values: 25.67%

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names: List[str]#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names: List[str]#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download()[source]#

Downloads dataset’s files to the self.root_dir folder.

build()[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load(impute_nans=True)[source]#

Loads raw dataset and preprocess data. Default to load_raw.

get_splitter(method: Optional[str] = None, **kwargs)[source]#

Returns the splitter for a SpatioTemporalDataset. A Splitter provides the splits of the dataset – in terms of indices – for cross validation.

compute_similarity(method: str, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class Elergone(*args, **kwargs)[source]#

Load profiles of 370 points collected every 15 minutes from 2011 to 2014.

Raw data at https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014. The load method loads the values in kWh, computes the mask for the zero values and pads the missing steps.

From the original description:

Values in the original dataframe are in kW of each 15 min. To convert values in kWh values must be divided by 4. Each column represent one client. Some clients were created after 2011. In these cases consumption were considered zero. All time labels report to Portuguese hour. However, all days present 96 measures (24*4). Every year in March time change day (which has only 23 hours) the values between 1:00 am and 2:00 am are zero for all points. Every year in October time change day (which has 25 hours) the values between 1:00 am and 2:00 am aggregate the consumption of two hours.

Dataset size:
  • Time steps: 140256

  • Nodes: 370

  • Channels: 1

  • Sampling rate: 15 minutes

  • Missing values: 20.15%

Parameters:
  • root – Root folder for data download.

  • freq – Resampling frequency.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

build()[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw() DataFrame[source]#

Loads raw dataset without any data preprocessing.

load()[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, gamma=10, trainlen=None, **kwargs) Optional[ndarray][source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class MetrLA(*args, **kwargs)[source]#

Traffic readings collected from 207 loop detectors on highways in Los Angeles County, aggregated in 5 minutes intervals over four months between March 2012 and June 2012.

A benchmark dataset for traffic forecasting as described in “Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting”.

Dataset information:
  • Time steps: 34272

  • Nodes: 207

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 8.11%

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

build() None[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load(impute_zeros=True)[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class PemsBay(*args, **kwargs)[source]#

The dataset contains 6 months of traffic readings from 01/01/2017 to 05/31/2017 collected every 5 minutes by 325 traffic sensors in San Francisco Bay Area.

The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). A benchmark dataset for traffic forecasting as described in “Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting”.

Dataset information:
  • Time steps: 52128

  • Nodes: 325

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 0.02%

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

build() None[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load(mask_zeros: bool = True)[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class ElectricityBenchmark(*args, **kwargs)[source]#

Electricity consumption (in kWh) measured hourly by 321 sensors from 2012 to 2014.

Imported from https://github.com/laiguokun/multivariate-time-series-data. The original dataset records values in kW for 370 nodes starting from 2011, with part of the nodes with missing values before 2012. For the original dataset refer to Elergone.

Dataset information:
  • Time steps: 26304

  • Nodes: 321

  • Channels: 1

  • Sampling rate: 1 hour

  • Missing values: 1.09%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class TrafficBenchmark(*args, **kwargs)[source]#

A collection of hourly road occupancy rates (between 0 and 1) measured by 862 sensors for 48 months (2015-2016) on San Francisco Bay Area freeways.

Imported from https://github.com/laiguokun/multivariate-time-series-data, raw data at California Department of Transportation.

Dataset information:
  • Time steps: 17544

  • Nodes: 862

  • Channels: 1

  • Sampling rate: 1 hour

  • Missing values: 0.90%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class SolarBenchmark(*args, **kwargs)[source]#

Solar power production records in the year of 2006, is sampled every 10 minutes from 137 synthetic PV farms in Alabama State. The mask denotes 55.10% of data corresponding to daily hours with nonzero power production.

Imported from https://github.com/laiguokun/multivariate-time-series-data, raw data at https://www.nrel.gov/grid/solar-power-data.html.

Dataset information:
  • Time steps: 52560

  • Nodes: 137

  • Channels: 1

  • Sampling rate: 10 minutes

  • Missing values: 0.00%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class ExchangeBenchmark(*args, **kwargs)[source]#

The collection of the daily exchange rates of eight foreign countries including Australia, British, Canada, Switzerland, China, Japan, New Zealand and Singapore ranging from 1990 to 2016.

Imported from https://github.com/laiguokun/multivariate-time-series-data.

Dataset information:
  • Time steps: 7588

  • Nodes: 8

  • Channels: 1

  • Sampling rate: 1 day

  • Missing values: 0.00%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class GaussianNoiseSyntheticDataset(*args, **kwargs)[source]#

A generator of synthetic datasets from an input model and input graph.

The input model must be implemented as a torch model and must return the observation at the next step and (optionally) the hidden state for the next step. Gaussian noise will be added to the output of the model at each step.

Args:

num_features (int): Number of features in the generated dataset. num_nodes (int): Number of nodes in the graph. num_steps (int): Number of steps to generate. connectivity (SparseTensArray): Connectivity of the underlying graph. model (optional, nn.Module): Model used to generate data. If None, it will attempt to create model from

model_class and model_kwargs.

model_class (optional, nn.Module): Class of the model used to generate the data. model_kwargs (optional, nn.Module): Keyword arguments to initialize the model. sigma_noise (float): Standard deviation of the noise. name (optional, str): Name for the generated dataset. seed (optional, int): Seed for the random number generator.

load_raw(*args, **kwargs)[source]#

Loads raw dataset without any data preprocessing.

property mae_optimal_model#

E[|X|] of a Gaussian X

generate_data(seed=None)[source]#
get_connectivity(layout: str = 'edge_index')[source]#

Returns the weighted adjacency matrix \(\mathbf{A} \in \mathbb{R}^{N \times N}\), where \(N=\)self.n_nodes. The element \(a_{i,j} \in \mathbf{A}\) is 0 if there not exists an edge connecting node \(i\) to node \(j\). The return type depends on the specified layout (default: edge_index).

Parameters:
  • method (str, optional) – Method for the similarity computation. If None, defaults to dataset-specific default method. (default: None)

  • threshold (float, optional) – If not None, set to 0 the values below the threshold. (default: None)

  • knn (int, optional) – If not None, keep only \(k=\) knn nearest incoming neighbors. (default: None)

  • binary_weights (bool) – If True, the positive weights of the adjacency matrix are set to 1. (default: False)

  • include_self (bool) – If False, self-loops are never taken into account. (default: True)

  • force_symmetric (bool) – Force adjacency matrix to be symmetric by taking the maximum value between the two directions for each edge. (default: False)

  • normalize_axis (int, optional) – Divide edge weight \(a_{i, j}\) by \(\sum_k a_{i, k}\), if normalize_axis=0 or \(\sum_k a_{k, j}\), if normalize_axis=1. None for no normalization. (default: None)

  • layout (str) –

    Convert matrix to a dense/sparse format. Available options are:

    • dense: keep matrix dense \(\mathbf{A} \in \mathbb{R}^{N \times N}\).

    • edge_index: convert to (edge_index, edge_weight) tuple, where edge_index has shape \([2, E]\) and edge_weight has shape \([E]\), being \(E\) the number of edges.

    • coo/csr/csc: convert to specified scipy sparse matrix type.

    (default: edge_index)

  • **kwargs (optional) – Additional optional keyword arguments for similarity computation.

Returns:

The similarity dense matrix.