Generate spatio-temporal cross-validation index with anticlust
Source:R/base_learner.R
generate_cv_index_spt.Rd
This function generates a spatio-temporal cross-validation index
based on the anticlust package. The function first calculates the
spatial clustering index using the anticlust::balanced_clustering()
function as default, and if cv_pairs
is provided, it generates rank-based
pairs based on the proximity between cluster centroids.
cv_pairs
can be NULL, in which case only the spatial clustering index
is generated. ngroup_init
should be lower than cv_pairs
, while
it imposes a condition that nrow(data) %% ngroup_init
should be 0
and cv_pairs
should be less than the number of 2-combinations of
ngroup_init
. Each training set will get 50% overlap
with adjacent training sets. "Pairs (combinations)" are selected
based on the rank of centroids of ngroup_init
number of initial
clusters, where users have two options.
Arguments
- data
data.table with X, Y, and time information.
- target_cols
character(3). Names of columns for X, Y, and time. Default is c("lon", "lat", "time"). Order insensitive.
- preprocessing
character(1). Preprocessing method for the fields defined in
target_cols
. This serves to homogenize the scale of the data. Default is "none"."none": no preprocessing.
"normalize": normalize the data.
"standardize": standardize the data.
- ngroup_init
integer(1). Initial number of splits for pairing groups. Default is 5L.
- cv_pairs
integer(1). Number of pairs for cross-validation. This value will be used to generate a rank-based pairs based on
target_cols
values.- pairing
character(1) Pair selection method.
"1": search the nearest for each cluster then others are selected based on the rank.
"2": rank the pairwise distances directly
- ...
Additional arguments to be passed.
Value
List of numeric vectors with balanced cluster numbers and reference lists of assessment set pair numbers in attributes.
Details
Mode "1" assigns at least one pair for each initial cluster, meaning that
ngroup_init
pairs are assigned for each initial cluster, then the remaining pairs will be ranked to finalize thecv_pairs
sets.Mode "2" will rank the pairwise distances directly, which may ignore some overly large initial clusters for pairing.
Of course, mode "2" is faster than mode "1", thus users are advised to use mode "2" when they are sure that the initial clusters are spatially uniformly distributed.
Note
nrow(data) %% ngroup_init
should be 0. This is a required
condition for the anticlust::balanced_clustering().
Examples
library(data.table)
data <- data.table(
lon = runif(100),
lat = runif(100),
time =
rep(
seq.Date(from = as.Date("2021-01-01"), to = as.Date("2021-01-05"),
by = "day"),
20
)
)
rset_spt <-
generate_cv_index_spt(
data, preprocessing = "normalize",
ngroup_init = 5L, cv_pairs = 6L
)
rset_spt
#> [1] 1 2 3 1 4 4 5 2 3 5 1 4 1 2 3 1 2 3 5 4 4 5 2 3 5 1 4 5 2 3 1 2 3 5 4 4 5
#> [38] 2 3 1 1 4 5 2 3 1 2 3 5 4 4 1 2 3 5 1 4 1 2 3 5 2 3 1 4 4 5 2 3 5 1 4 1 2
#> [75] 3 1 2 3 5 4 4 1 2 3 5 1 4 1 2 3 5 2 3 5 4 4 5 2 3 5
#> attr(,"ref_list")
#> attr(,"ref_list")[[1]]
#> [1] 4 2
#>
#> attr(,"ref_list")[[2]]
#> [1] 4 3
#>
#> attr(,"ref_list")[[3]]
#> [1] 5 1
#>
#> attr(,"ref_list")[[4]]
#> [1] 3 1
#>
#> attr(,"ref_list")[[5]]
#> [1] 4 2
#>
#> attr(,"ref_list")[[6]]
#> [1] 4 3
#>