Datasets in tsl#

AirQuality

Measurements of pollutant \(PM2.5\) collected by 437 air quality monitoring stations spread across 43 Chinese cities from May 2014 to April 2015.

Elergone

Load profiles of 370 points collected every 15 minutes from 2011 to 2014.

MetrLA

Traffic readings collected from 207 loop detectors on highways in Los Angeles County, aggregated in 5 minutes intervals over four months between March 2012 and June 2012.

PemsBay

The dataset contains 6 months of traffic readings from 01/01/2017 to 05/31/2017 collected every 5 minutes by 325 traffic sensors in San Francisco Bay Area.

PeMS03

The dataset contains 3 months of traffic readings from 09/01/2018 to 11/30/2018 collected every 5 minutes by 358 traffic sensors.

PeMS04

The dataset contains 2 months of traffic readings from 01/01/2018 to 02/28/2018 collected every 5 minutes by 307 traffic sensors in San Francisco Bay Area.

PeMS07

The dataset contains 4 months of traffic readings from 05/01/2017 to 08/31/2017 collected every 5 minutes by 883 traffic sensors.

PeMS08

The dataset contains 2 months of traffic readings from 07/01/2016 to 08/31/2016 collected every 5 minutes by 170 traffic sensors in San Bernardino.

LargeST

LargeST is a large-scale traffic forecasting dataset containing 5 years of traffic readings from 01/01/2017 to 12/31/2021 collected every 5 minutes by 8600 traffic sensors in California.

PvUS

Simulated solar power production from more than 5,000 photovoltaic plants in the US.

ElectricityBenchmark

Electricity consumption (in kWh) measured hourly by 321 sensors from 2012 to 2014.

TrafficBenchmark

A collection of hourly road occupancy rates (between 0 and 1) measured by 862 sensors for 48 months (2015-2016) on San Francisco Bay Area freeways.

SolarBenchmark

Solar power production records in the year of 2006, is sampled every 10 minutes from 137 synthetic PV farms in Alabama State.

ExchangeBenchmark

The collection of the daily exchange rates of eight foreign countries including Australia, British, Canada, Switzerland, China, Japan, New Zealand and Singapore ranging from 1990 to 2016.

GaussianNoiseSyntheticDataset

A generator of synthetic datasets from an input model and input graph.

GPVARDataset

Generator for synthetic datasets from a graph polynomial VAR filter on triangular community graphs as shown in the paper "AZ-whiteness test: a test for uncorrelated noise on spatio-temporal graphs" (Zambon et al., NeurIPS 22).

GPVARDatasetAZ

GPVARDataset generated with the same configuration used in the paper "AZ-whiteness test: a test for uncorrelated noise on spatio-temporal graphs" (Zambon et al., NeurIPS 22).

class AirQuality(*args, **kwargs)[source]#

Measurements of pollutant \(PM2.5\) collected by 437 air quality monitoring stations spread across 43 Chinese cities from May 2014 to April 2015.

The dataset contains also a smaller version AirQuality(small=True) with only the subset of nodes containing the 36 sensors in Beijing.

Data collected inside the Urban Air project.

Dataset size:
  • Time steps: 8760

  • Nodes: 437

  • Channels: 1

  • Sampling rate: 1 hour

  • Missing values: 25.67%

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names: List[str]#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names: List[str]#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download()[source]#

Downloads dataset’s files to the self.root_dir folder.

build()[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load(impute_nans=True)[source]#

Loads raw dataset and preprocess data. Default to load_raw.

get_splitter(method: Optional[str] = None, **kwargs)[source]#

Returns the splitter for a SpatioTemporalDataset. A Splitter provides the splits of the dataset – in terms of indices – for cross validation.

compute_similarity(method: str, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class Elergone(*args, **kwargs)[source]#

Load profiles of 370 points collected every 15 minutes from 2011 to 2014.

Raw data at https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014. The load method loads the values in kWh, computes the mask for the zero values and pads the missing steps.

From the original description:

Values in the original dataframe are in kW of each 15 min. To convert values in kWh values must be divided by 4. Each column represent one client. Some clients were created after 2011. In these cases consumption were considered zero. All time labels report to Portuguese hour. However, all days present 96 measures (24*4). Every year in March time change day (which has only 23 hours) the values between 1:00 am and 2:00 am are zero for all points. Every year in October time change day (which has 25 hours) the values between 1:00 am and 2:00 am aggregate the consumption of two hours.

Dataset size:
  • Time steps: 140256

  • Nodes: 370

  • Channels: 1

  • Sampling rate: 15 minutes

  • Missing values: 20.15%

Parameters:
  • root – Root folder for data download.

  • freq – Resampling frequency.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

build()[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw() DataFrame[source]#

Loads raw dataset without any data preprocessing.

load()[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, gamma=10, trainlen=None, **kwargs) Optional[ndarray][source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class MetrLA(*args, **kwargs)[source]#

Traffic readings collected from 207 loop detectors on highways in Los Angeles County, aggregated in 5 minutes intervals over four months between March 2012 and June 2012.

A benchmark dataset for traffic forecasting as described in “Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting”.

Dataset information:
  • Time steps: 34272

  • Nodes: 207

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 8.11%

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

build() None[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load(impute_zeros=True)[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class PemsBay(*args, **kwargs)[source]#

The dataset contains 6 months of traffic readings from 01/01/2017 to 05/31/2017 collected every 5 minutes by 325 traffic sensors in San Francisco Bay Area.

The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). A benchmark dataset for traffic forecasting as described in “Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting”.

Dataset information:
  • Time steps: 52128

  • Nodes: 325

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 0.02%

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

build() None[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load(mask_zeros: bool = True)[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class PeMS03(*args, **kwargs)[source]#

The dataset contains 3 months of traffic readings from 09/01/2018 to 11/30/2018 collected every 5 minutes by 358 traffic sensors.

The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). A benchmark dataset for traffic forecasting as described in the paper “Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting” (Guo et al., 2021).

Dataset information:
  • Time steps: 26208

  • Nodes: 358

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 0% (already imputed in the dataset)

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

class PeMS04(*args, **kwargs)[source]#

The dataset contains 2 months of traffic readings from 01/01/2018 to 02/28/2018 collected every 5 minutes by 307 traffic sensors in San Francisco Bay Area.

The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). A benchmark dataset for traffic forecasting as described in the paper “Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting” (Guo et al., 2021).

The target variable is the total flow (number of detected vehicles).

Dataset information:
  • Time steps: 16992

  • Nodes: 307

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 0% (already imputed in the dataset)

Covariates:
  • occupancy: \(T \times N \times 1\) Time series associated to the occupancy of the lanes.

  • speed: \(T \times N \times 1\) Time series associated to average speed of the detected vehicles.

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

class PeMS07(*args, **kwargs)[source]#

The dataset contains 4 months of traffic readings from 05/01/2017 to 08/31/2017 collected every 5 minutes by 883 traffic sensors.

The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). A benchmark dataset for traffic forecasting as described in the paper “Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting” (Guo et al., 2021).

Dataset information:
  • Time steps: 28224

  • Nodes: 883

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 0% (already imputed in the dataset)

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

class PeMS08(*args, **kwargs)[source]#

The dataset contains 2 months of traffic readings from 07/01/2016 to 08/31/2016 collected every 5 minutes by 170 traffic sensors in San Bernardino.

The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). A benchmark dataset for traffic forecasting as described in the paper “Learning Dynamics and Heterogeneity of Spatial-Temporal Graph Data for Traffic Forecasting” (Guo et al., 2021).

Dataset information:
  • Time steps: 17856

  • Nodes: 170

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 0% (already imputed in the dataset)

Covariates:
  • occupancy: \(T \times N \times 1\) Time series associated to the occupancy of the lanes.

  • speed: \(T \times N \times 1\) Time series associated to average speed of the detected vehicles.

Static attributes:
  • dist: \(N \times N\) matrix of node pairwise distances.

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

class LargeST(*args, **kwargs)[source]#

LargeST is a large-scale traffic forecasting dataset containing 5 years of traffic readings from 01/01/2017 to 12/31/2021 collected every 5 minutes by 8600 traffic sensors in California.

Given the large number of sensors in the dataset, there are 3 subsets of sensors that can be selected:

  • GLA (Greater Los Angeles)
    • Nodes: 3834

    • Edges: 98703

    • District: 7, 8, 12

  • GBA (Greater Bay Area)
    • Nodes: 2352

    • Edges: 61246

    • District: 4

  • SD (San Diego)
    • Nodes: 716

    • Edges: 17319

    • District: 11

By default, the full dataset CA is loaded, corresponding to the whole California.

The measurements are provided by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). Introduced in the paper “LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting” (Liu et al., 2023), where only readings from 2019 are considered, aggregated into 15-minutes intervals.

Dataset information:
  • Time steps: 525888

  • Nodes: 8600

  • Edges: 201363

  • Channels: 1

  • Sampling rate: 5 minutes

  • Missing values: 1.51%

Static attributes:
  • metadata: storing for each node:
    • lat: latitude of the sensor;

    • lon: longitude of the sensor;

    • district: California’s district where sensor is located (one of 3, 4, 5, 6, 7, 8, 10, 11, 12);

    • county: California’s county where sensor is located;

    • fwy_id: id of highway where a sensor is located;

    • n_lanes: the number of lanes in correspondence to the sensor (max 8);

    • direction: direction of the highway measured by the sensor (one of N, S, E, W).

  • adj: weighted adjacency matrix \(\mathbf{A} \in \mathbb{R}^{N \times N}\) built using road distances.

Parameters:
  • root (str, optional) – The root directory where data will be downloaded and stored. If None, then defaults to storage folder inside tsl’s root directory. (default: None)

  • subset (str) – The subset to be loaded. Must be one of "CA", "GLA", "GBA", "SD". (default: "CA")

  • year (int or list) – The year(s) to be loaded. Must be (a list) in [2017, 2021]. Note that raw data are divided by year and only requested years are downloaded. (default: 2019)

  • imputation_mode (str, optional) – How to impute missing values. If "nearest", then use nearest observation; if "zero", fill missing values with 0; if None, do not impute (leave nan). (default: "zero")

  • freq (str) – The sampling rate used for resampling (e.g., "15T" for 15-minutes intervals resampling). (default: "15T")

  • precision (int or str) – The float precision of the dataset. (default: 32)

property raw_file_names: Dict[str, str]#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load()[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class PvUS(*args, **kwargs)[source]#

Simulated solar power production from more than 5,000 photovoltaic plants in the US.

Data are provided by National Renewable Energy Laboratory (NREL)’s Solar Power Data for Integration Studies. Original raw data consist of 1 year (2006) of 5-minute solar power (in MW) for approximately 5,000 synthetic PV plants in the United States.

Preprocessed data are resampled in 10-minutes intervals taking the average. The entire dataset contains 5016 plants, divided in two macro zones (east and west). The “east” zone contains 4084 plants, the “west” zone has 1082 plants. Some states appear in both zones, with plants at same geographical position. When loading the entire datasets, duplicated plants in “east” zone are dropped.

Dataset size:
  • Time steps: 52560

  • Nodes:

    • Full graph: 5016

    • East only: 4084

    • West only: 1082

  • Channels: 1

  • Sampling rate: 10 minutes

  • Missing values: 0.00%

Parameters:
  • zones (Union[str, List], optional) – The US zones to include in the dataset. Can be "east", "west", or a list of both. If None, then the full dataset is loaded. (default: None)

  • mask_zeros (bool, optional) – If True, then zero values (corresponding to night hours) are masked out. (default: False)

  • root (str, optional) – The root directory for the data. (default: None)

  • freq (str, optional) – The data sampling rate for resampling. (default: None)

property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

download() None[source]#

Downloads dataset’s files to the self.root_dir folder.

load_raw()[source]#

Loads raw dataset without any data preprocessing.

load(mask_zeros)[source]#

Loads raw dataset and preprocess data. Default to load_raw.

compute_similarity(method: str, theta: float = 150, **kwargs)[source]#

Implements the options for the similarity matrix \(\mathbf{S} \in \mathbb{R}^{N \times N}\) computation, according to method.

Parameters:
  • method (str) – Method for the similarity computation.

  • **kwargs (optional) – Additional optional keyword arguments.

Returns:

The similarity dense matrix.

Return type:

ndarray

class ElectricityBenchmark(*args, **kwargs)[source]#

Electricity consumption (in kWh) measured hourly by 321 sensors from 2012 to 2014.

Imported from https://github.com/laiguokun/multivariate-time-series-data. The original dataset records values in kW for 370 nodes starting from 2011, with part of the nodes with missing values before 2012. For the original dataset refer to Elergone.

Dataset information:
  • Time steps: 26304

  • Nodes: 321

  • Channels: 1

  • Sampling rate: 1 hour

  • Missing values: 1.09%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class TrafficBenchmark(*args, **kwargs)[source]#

A collection of hourly road occupancy rates (between 0 and 1) measured by 862 sensors for 48 months (2015-2016) on San Francisco Bay Area freeways.

Imported from https://github.com/laiguokun/multivariate-time-series-data, raw data at California Department of Transportation.

Dataset information:
  • Time steps: 17544

  • Nodes: 862

  • Channels: 1

  • Sampling rate: 1 hour

  • Missing values: 0.90%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class SolarBenchmark(*args, **kwargs)[source]#

Solar power production records in the year of 2006, is sampled every 10 minutes from 137 synthetic PV farms in Alabama State. The mask denotes 55.10% of data corresponding to daily hours with nonzero power production.

Imported from https://github.com/laiguokun/multivariate-time-series-data, raw data at https://www.nrel.gov/grid/solar-power-data.html.

Dataset information:
  • Time steps: 52560

  • Nodes: 137

  • Channels: 1

  • Sampling rate: 10 minutes

  • Missing values: 0.00%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class ExchangeBenchmark(*args, **kwargs)[source]#

The collection of the daily exchange rates of eight foreign countries including Australia, British, Canada, Switzerland, China, Japan, New Zealand and Singapore ranging from 1990 to 2016.

Imported from https://github.com/laiguokun/multivariate-time-series-data.

Dataset information:
  • Time steps: 7588

  • Nodes: 8

  • Channels: 1

  • Sampling rate: 1 day

  • Missing values: 0.00%

similarity_options: Optional[Set] = None#
property raw_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip downloading.

class GaussianNoiseSyntheticDataset(*args, **kwargs)[source]#

A generator of synthetic datasets from an input model and input graph.

The input model must be implemented as a torch.nn.Module and must return the observation at the next step and (optionally) the hidden state for the next step. Gaussian noise will be added to the output of the model at each step.

Parameters:
  • num_features (int) – Number of features in the generated dataset.

  • num_nodes (int) – Number of nodes in the graph.

  • num_steps (int) – Number of steps to generate.

  • connectivity (SparseTensArray) – Connectivity of the underlying graph.

  • model (torch.nn.Module) – Model used to generate data. If None, it will attempt to create model from model_class and model_kwargs.

  • model_class (type, optional) – Class of the model used to generate the data. (default: None)

  • model_kwargs (dict, optional) – Keyword arguments needed to initialize the model. (default: None)

  • sigma_noise (float) – Standard deviation of the noise. (default: 0.2)

  • name (str, optional) – Name for the generated dataset. (default: None)

  • seed (int, optional) – Seed for the random number generator. (default: None)

load_raw(*args, **kwargs)[source]#

Loads raw dataset without any data preprocessing.

property mae_optimal_model#

\(\mathbb{E}[|\mathbf{X}|]\) of a Gaussian \(\mathbf{X} \sim \mathcal{N}(0, \sigma^2)\), computed as \(\varepsilon = \sqrt{\frac{2}{\pi}}\sigma\).

class GPVARDataset(*args, **kwargs)[source]#

Generator for synthetic datasets from a graph polynomial VAR filter on triangular community graphs as shown in the paper “AZ-whiteness test: a test for uncorrelated noise on spatio-temporal graphs” (Zambon et al., NeurIPS 22).

Parameters:
  • num_communities (int) – Number of communities (triangles) in the graph.

  • num_steps (int) – Length of the generated sequence.

  • filter_params (iterable) – Parameters of the graph polynomial filter used to generate the dataset.

  • sigma_noise (float) – Standard deviation of the noise.

  • norm (str) – The normalization used for edges and edge weights. The available options are: 'gcn', 'asym' and 'none'. (default: 'none')

  • name (optional, str) – Name of the dataset.

class GPVARDatasetAZ(*args, **kwargs)[source]#

GPVARDataset generated with the same configuration used in the paper “AZ-whiteness test: a test for uncorrelated noise on spatio-temporal graphs” (Zambon et al., NeurIPS 22).

Parameters:

root (str, optional) – Path to the directory to use for data storage. (default: None)

property required_file_names#

The name of the files in the self.root_dir folder that must be present in order to skip building.

build() None[source]#

Eventually build the dataset from raw data to self.root_dir folder.

load_raw(*args, **kwargs)[source]#

Loads raw dataset without any data preprocessing.