Skip to contents

This function generates a spatio-temporal cross-validation index based on the anticlust package. The function first calculates the spatial clustering index using the anticlust::balanced_clustering() function as default, and if cv_pairs is provided, it generates rank-based pairs based on the proximity between cluster centroids. cv_pairs can be NULL, in which case only the spatial clustering index is generated. ngroup_init should be lower than cv_pairs, while it imposes a condition that nrow(data) %% ngroup_init should be 0 and cv_pairs should be less than the number of 2-combinations of ngroup_init. Each training set will get 50% overlap with adjacent training sets. "Pairs (combinations)" are selected based on the rank of centroids of ngroup_init number of initial clusters, where users have two options.

Usage

generate_cv_index_spt(
  data,
  target_cols = c("lon", "lat", "time"),
  preprocessing = c("none", "normalize", "standardize"),
  ngroup_init = 5L,
  cv_pairs = NULL,
  pairing = c("1", "2"),
  ...
)

Arguments

data

data.table with X, Y, and time information.

target_cols

character(3). Names of columns for X, Y, and time. Default is c("lon", "lat", "time"). Order insensitive.

preprocessing

character(1). Preprocessing method for the fields defined in target_cols. This serves to homogenize the scale of the data. Default is "none".

  • "none": no preprocessing.

  • "normalize": normalize the data.

  • "standardize": standardize the data.

ngroup_init

integer(1). Initial number of splits for pairing groups. Default is 5L.

cv_pairs

integer(1). Number of pairs for cross-validation. This value will be used to generate a rank-based pairs based on target_cols values.

pairing

character(1) Pair selection method.

  • "1": search the nearest for each cluster then others are selected based on the rank.

  • "2": rank the pairwise distances directly

...

Additional arguments to be passed.

Value

List of numeric vectors with balanced cluster numbers and reference lists of assessment set pair numbers in attributes.

Details

  • Mode "1" assigns at least one pair for each initial cluster, meaning that ngroup_init pairs are assigned for each initial cluster, then the remaining pairs will be ranked to finalize the cv_pairs sets.

  • Mode "2" will rank the pairwise distances directly, which may ignore some overly large initial clusters for pairing.

Of course, mode "2" is faster than mode "1", thus users are advised to use mode "2" when they are sure that the initial clusters are spatially uniformly distributed.

Note

nrow(data) %% ngroup_init should be 0. This is a required condition for the anticlust::balanced_clustering().

Author

Insang Song

Examples

library(data.table)
data <- data.table(
  lon = runif(100),
  lat = runif(100),
  time =
  rep(
    seq.Date(from = as.Date("2021-01-01"), to = as.Date("2021-01-05"),
             by = "day"),
    20
  )
)
rset_spt <-
  generate_cv_index_spt(
    data, preprocessing = "normalize",
    ngroup_init = 5L, cv_pairs = 6L
  )
rset_spt
#>   [1] 1 2 3 1 4 4 5 2 3 5 1 4 1 2 3 1 2 3 5 4 4 5 2 3 5 1 4 5 2 3 1 2 3 5 4 4 5
#>  [38] 2 3 1 1 4 5 2 3 1 2 3 5 4 4 1 2 3 5 1 4 1 2 3 5 2 3 1 4 4 5 2 3 5 1 4 1 2
#>  [75] 3 1 2 3 5 4 4 1 2 3 5 1 4 1 2 3 5 2 3 5 4 4 5 2 3 5
#> attr(,"ref_list")
#> attr(,"ref_list")[[1]]
#> [1] 4 2
#> 
#> attr(,"ref_list")[[2]]
#> [1] 4 3
#> 
#> attr(,"ref_list")[[3]]
#> [1] 5 1
#> 
#> attr(,"ref_list")[[4]]
#> [1] 3 1
#> 
#> attr(,"ref_list")[[5]]
#> [1] 4 2
#> 
#> attr(,"ref_list")[[6]]
#> [1] 4 3
#>