Organizing data#

DataModule#

SpatioTemporalDataModule

Base LightningDataModule for SpatioTemporalDataset.

class SpatioTemporalDataModule(dataset: SpatioTemporalDataset, scalers: Optional[Mapping] = None, mask_scaling: bool = True, splitter: Optional[Splitter] = None, batch_size: int = 32, workers: int = 0, pin_memory: bool = False)[source]#

Base LightningDataModule for SpatioTemporalDataset.

Parameters:
  • dataset (SpatioTemporalDataset) – The complete dataset.

  • scalers (dict, optional) – Named mapping of Scaler to be used for data rescaling after splitting. Every scaler is given as input the attribute of the dataset named as the scaler’s key. If None, no scaling is performed. (default None)

  • mask_scaling (bool) – If True, then compute statistics for dataset.target scaler (if any) by considering only valid values (according to dataset.mask). (default True)

  • splitter (Splitter, optional) – The Splitter to be used for splitting dataset into train/validation/test sets. (default None)

  • batch_size (int) – Size of the mini-batches for the dataloaders. (default 32)

  • workers (int) – Number of workers to use in the dataloaders. (default 0)

  • pin_memory (bool) – If True, then enable pinned GPU memory for train_dataloader(). (default False)

setup(stage: Optional[Literal['fit', 'validate', 'test', 'predict']] = None)[source]#

Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.

Parameters:

stage – either 'fit', 'validate', 'test', or 'predict'

Example:

class LitModel(...):
    def __init__(self):
        self.l1 = None

    def prepare_data(self):
        download_data()
        tokenize()

        # don't do this
        self.something = else

    def setup(self, stage):
        data = load_data(...)
        self.l1 = nn.Linear(28, data.num_classes)

Splitters#

Splitter

Base class for splitter module.

CustomSplitter

Create a Splitter using custom validation and test sets splitting functions.

TemporalSplitter

Split the data sequentially with specified lengths.

AtTimeStepSplitter

Split the data at given time steps (only for SpatioTemporalDataset with DatetimeIndex index).

class Splitter(*args, **kwargs)[source]#

Base class for splitter module.

class CustomSplitter(*args, **kwargs)[source]#

Create a Splitter using custom validation and test sets splitting functions.

class TemporalSplitter(*args, **kwargs)[source]#

Split the data sequentially with specified lengths.

Parameters:
  • val_len (int or float) – Length of the validation set.

  • test_len (int or float) – Length of the test set.

  • offset (str) –

    How to size the offset separating the splits so that samples do not leak across sets.

    • 'window': separate splits by dataset.samples_offset positions, so their lookback windows just touch. This avoids leakage (no target step shared across splits) as long as the horizon is short enough relative to the window.

    • 'sample': separate splits by ceil(sample_span / stride) positions, so that adjacent splits share no time step in any role, for any window/horizon/delay/stride.

    (default: 'window')

class AtTimeStepSplitter(*args, **kwargs)[source]#

Split the data at given time steps (only for SpatioTemporalDataset with DatetimeIndex index).

Each split is defined by a (first_ts, last_ts) timestamp range, following the chronological order train -> val -> test. A split is active when at least one of its bounds is given (training is always active); the remaining bounds are then inferred:

  • A missing inner boundary is placed min_offset positions away from the adjacent split (e.g. an open-ended last_val_ts ends just before the test range, and a missing first_test_ts starts just after validation).

  • A missing outer boundary falls back to the edge of the series: training defaults to start at the beginning, and the latest split extends to the end.

  • A held-out (validation or test) split with no bounds at all is left empty.

Splits are kept separated so they do not leak across each other. The separation is controlled by min_offset, using the same vocabulary as TemporalSplitter:

  • 'sample': separate the closest splits by at least ceil(sample_span / stride) positions, so that they share no time step in any role (input window or prediction horizon), for any window/horizon/delay/stride.

  • 'window': separate the closest splits by at least samples_offset positions, so their lookback windows just touch. This avoids leakage (no target step shared across splits) as long as the horizon is short enough relative to the window, and raises otherwise.

(default: 'sample')

After resolving the ranges, the splits are checked and a ValueError is raised if any two are closer than min_offset (e.g. when explicit boundaries would make the splits overlap).

Parameters:
  • first_val_ts (optional) – Bounds of the validation range.

  • last_val_ts (optional) – Bounds of the validation range.

  • first_test_ts (optional) – Bounds of the test range.

  • last_test_ts (optional) – Bounds of the test range.

  • first_train_ts (optional) – Start of the training range. Defaults to the beginning of the series.

  • last_train_ts (optional) – End of the training range. Defaults to the last position keeping training separated from the held-out splits.

  • min_offset (str) – Minimum separation between the closest splits, either 'sample' or 'window'. (default: 'sample')