Organizing data#
DataModule#
Base |
- class SpatioTemporalDataModule(dataset: SpatioTemporalDataset, scalers: Optional[Mapping] = None, mask_scaling: bool = True, splitter: Optional[Splitter] = None, batch_size: int = 32, workers: int = 0, pin_memory: bool = False)[source]#
Base
LightningDataModuleforSpatioTemporalDataset.- Parameters:
dataset (SpatioTemporalDataset) – The complete dataset.
scalers (dict, optional) – Named mapping of
Scalerto be used for data rescaling after splitting. Every scaler is given as input the attribute of the dataset named as the scaler’s key. IfNone, no scaling is performed. (defaultNone)mask_scaling (bool) – If
True, then compute statistics fordataset.targetscaler (if any) by considering only valid values (according todataset.mask). (defaultTrue)splitter (Splitter, optional) – The
Splitterto be used for splittingdatasetinto train/validation/test sets. (defaultNone)batch_size (int) – Size of the mini-batches for the dataloaders. (default
32)workers (int) – Number of workers to use in the dataloaders. (default
0)pin_memory (bool) – If
True, then enable pinned GPU memory fortrain_dataloader(). (defaultFalse)
- setup(stage: Optional[Literal['fit', 'validate', 'test', 'predict']] = None)[source]#
Called at the beginning of fit (train + validate), validate, test, or predict. This is a good hook when you need to build models dynamically or adjust something about them. This hook is called on every process when using DDP.
- Parameters:
stage – either
'fit','validate','test', or'predict'
Example:
class LitModel(...): def __init__(self): self.l1 = None def prepare_data(self): download_data() tokenize() # don't do this self.something = else def setup(self, stage): data = load_data(...) self.l1 = nn.Linear(28, data.num_classes)
Splitters#
Base class for splitter module. |
|
Create a |
|
Split the data sequentially with specified lengths. |
|
Split the data at given time steps (only for |
- class CustomSplitter(*args, **kwargs)[source]#
Create a
Splitterusing custom validation and test sets splitting functions.
- class TemporalSplitter(*args, **kwargs)[source]#
Split the data sequentially with specified lengths.
- Parameters:
offset (str) –
How to size the offset separating the splits so that samples do not leak across sets.
'window': separate splits bydataset.samples_offsetpositions, so their lookback windows just touch. This avoids leakage (no target step shared across splits) as long as the horizon is short enough relative to the window.'sample': separate splits byceil(sample_span / stride)positions, so that adjacent splits share no time step in any role, for any window/horizon/delay/stride.
(default:
'window')
- class AtTimeStepSplitter(*args, **kwargs)[source]#
Split the data at given time steps (only for
SpatioTemporalDatasetwithDatetimeIndexindex).Each split is defined by a (
first_ts,last_ts) timestamp range, following the chronological ordertrain->val->test. A split is active when at least one of its bounds is given (training is always active); the remaining bounds are then inferred:A missing inner boundary is placed
min_offsetpositions away from the adjacent split (e.g. an open-endedlast_val_tsends just before the test range, and a missingfirst_test_tsstarts just after validation).A missing outer boundary falls back to the edge of the series: training defaults to start at the beginning, and the latest split extends to the end.
A held-out (validation or test) split with no bounds at all is left empty.
Splits are kept separated so they do not leak across each other. The separation is controlled by
min_offset, using the same vocabulary asTemporalSplitter:'sample': separate the closest splits by at leastceil(sample_span / stride)positions, so that they share no time step in any role (input window or prediction horizon), for any window/horizon/delay/stride.'window': separate the closest splits by at leastsamples_offsetpositions, so their lookback windows just touch. This avoids leakage (no target step shared across splits) as long as the horizon is short enough relative to the window, and raises otherwise.
(default:
'sample')After resolving the ranges, the splits are checked and a
ValueErroris raised if any two are closer thanmin_offset(e.g. when explicit boundaries would make the splits overlap).- Parameters:
first_val_ts (optional) – Bounds of the validation range.
last_val_ts (optional) – Bounds of the validation range.
first_test_ts (optional) – Bounds of the test range.
last_test_ts (optional) – Bounds of the test range.
first_train_ts (optional) – Start of the training range. Defaults to the beginning of the series.
last_train_ts (optional) – End of the training range. Defaults to the last position keeping training separated from the held-out splits.
min_offset (str) – Minimum separation between the closest splits, either
'sample'or'window'. (default:'sample')