Title: | Space-Time Anomaly Detection using Scan Statistics |
---|---|
Description: | Detection of anomalous space-time clusters using the scan statistics methodology. Focuses on prospective surveillance of data streams, scanning for clusters with ongoing anomalies. Hypothesis testing is made possible by Monte Carlo simulation. Allévius (2018) <doi:10.21105/joss.00515>. |
Authors: | Benjamin Allévius [aut], Paul Romer Present [ctb, cre] |
Maintainer: | Paul Romer Present <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.1.1 |
Built: | 2025-02-22 05:41:52 UTC |
Source: | https://github.com/promerpr/scanstatistics |
Get the k nearest neighbors for each location, including the location itself.
This function calls dist
, so the options for the
distance measure used is the same as for that one. Distances are calculated
between rows.
coords_to_knn(x, k = min(10, nrow(x)), method = "euclidean", p = 2)
coords_to_knn(x, k = min(10, nrow(x)), method = "euclidean", p = 2)
x |
a numeric matrix, data frame or |
k |
The number of nearest neighbors, counting the location itself. |
method |
the distance measure to be used. This must be one of
|
p |
The power of the Minkowski distance. |
An integer matrix of the nearest neighbors for each location.
Each row corresponds to a location, with the first element of each row
being the location itself. Locations are encoded as integers.
x <- matrix(c(0, 0, 1, 0, 2, 1, 0, 4, 1, 3), ncol = 2, byrow = TRUE) plot(x) coords_to_knn(x)
x <- matrix(c(0, 0, 1, 0, 2, 1, 0, 4, 1, 3), ncol = 2, byrow = TRUE) plot(x) coords_to_knn(x)
Convert a long data frame to a wide matrix, with time along the row dimension and locations along the column dimension. Values in the matrix could be e.g. the observed counts or the population.
df_to_matrix(df, time_col = 1, location_col = 2, value_col = 3)
df_to_matrix(df, time_col = 1, location_col = 2, value_col = 3)
df |
A data frame with at least 3 columns. |
time_col |
Integer or string that specifies the time column. |
location_col |
Integer or string that specifies the location column. |
value_col |
Integer or string that specifies the value column. |
A matrix with time on rows and locations on columns.
nearest neighbors.Given a distance matrix, calculate the nearest neighbors of each
location, including the location itself. The matrix should contain only zeros
on the diagonal, and all other elements should be positive.
dist_to_knn(x, k = min(10, nrow(x)))
dist_to_knn(x, k = min(10, nrow(x)))
x |
A (square) distance matrix. Elements should be non-negative and the diagonal zeros, but this is not checked. |
k |
The number of nearest neighbors, counting the location itself. |
A matrix of integers, row containing the
nearest
neighbors of location
, including itself.
x <- matrix(c(0, 0, 1, 0, 2, 1, 0, 4, 1, 3), ncol = 2, byrow = TRUE) d <- dist(x, diag = TRUE, upper = TRUE) dist_to_knn(d, k = 3)
x <- matrix(c(0, 0, 1, 0, 2, 1, 0, 4, 1, 3), ncol = 2, byrow = TRUE) d <- dist(x, diag = TRUE, upper = TRUE) dist_to_knn(d, k = 3)
Given a matrix of nearest neighbors and an adjacency matrix
for the locations involved, produces the set of flexibly shaped zones
as a list of integer vectors. The locations in these zones are all connected,
in the sense that any location in the zone can be reached from another by
traveling through adjacent locations within the zone.
flexible_zones(k_nearest, adjacency_matrix)
flexible_zones(k_nearest, adjacency_matrix)
k_nearest |
An integer matrix of the |
adjacency_matrix |
A boolean matrix, with element |
A list of integer vectors.
Tango, T. & Takahashi, K. (2005), A flexibly shaped spatial scan statistic for detecting clusters, International Journal of Health Geographics 4(1).
A <- matrix(c(0,1,0,0,0,0, 1,0,1,0,0,0, 0,1,0,0,0,0, 0,0,0,0,1,0, 0,0,0,1,0,0, 0,0,0,0,0,0), nrow = 6, byrow = TRUE) == 1 nn <- matrix(as.integer(c(1,2,3,4,5,6, 2,1,3,4,5,6, 3,2,1,4,5,6, 4,5,1,6,3,2, 5,4,6,1,3,2, 6,5,4,1,3,2)), nrow = 6, byrow = TRUE) flexible_zones(nn, A)
A <- matrix(c(0,1,0,0,0,0, 1,0,1,0,0,0, 0,1,0,0,0,0, 0,0,0,0,1,0, 0,0,0,1,0,0, 0,0,0,0,0,0), nrow = 6, byrow = TRUE) == 1 nn <- matrix(as.integer(c(1,2,3,4,5,6, 2,1,3,4,5,6, 3,2,1,4,5,6, 4,5,1,6,3,2, 5,4,6,1,3,2, 6,5,4,1,3,2)), nrow = 6, byrow = TRUE) flexible_zones(nn, A)
Extract zone number from the set of all zones.
get_zone(n, zones)
get_zone(n, zones)
n |
An integer; the number of the zone you wish to retrieve. |
zones |
A list of integer vectors, representing the set of all zones. |
An integer vector.
zones <- list(1L, 2L, 3L, 1:2, c(1L, 3L), c(2L, 3L)) get_zone(4, zones)
zones <- list(1L, 2L, 3L, 1:2, c(1L, 3L), c(2L, 3L)) get_zone(4, zones)
-value for a scan statistic.Given an observed scan statistic and a vector of replicate
scan statistics
,
, fit a Gumbel
distribution to the replicates and calculate a
-value for the observed
statistic based on the fitted distribution.
The function is vectorized, so multiple -values can be calculated if
several scan statistics (e.g. statistics from secondary clusters) are
supplied.
gumbel_pvalue(observed, replicates, method = "ML", ...)
gumbel_pvalue(observed, replicates, method = "ML", ...)
observed |
A scalar containing the observed value of the scan statistic, or a vector of observed values from secondary clusters. |
replicates |
A vector of Monte Carlo replicates of the scan statistic. |
method |
Either "ML", for maximum likelihood, or "MoM", for method of moments. |
... |
Additional arguments passed to |
The -value or
-values corresponding to the observed
scan statistic(s).
nearest neighbors for all locations.Returns the set of increasing nearest neighbor sets for all locations, as
a list of integer vectors. That is, for each location the list returned
contains one vector containing the location itself, another containing the
location and its nearest neighbor, and so on, up to the vector containing the
location and its nearest neighbors.
knn_zones(k_nearest)
knn_zones(k_nearest)
k_nearest |
An integer matrix of with |
A list of integer vectors.
nn <- matrix(c(1L, 2L, 4L, 3L, 5L, 2L, 1L, 3L, 4L, 5L, 3L, 2L, 4L, 1L, 5L, 4L, 1L, 2L, 3L, 5L, 5L, 3L, 4L, 2L, 1L), ncol = 5, byrow = TRUE) knn_zones(nn[, 1:3])
nn <- matrix(c(1L, 2L, 4L, 3L, 5L, 2L, 1L, 3L, 4L, 5L, 3L, 2L, 4L, 1L, 5L, 4L, 1L, 2L, 3L, 5L, 5L, 3L, 4L, 2L, 1L), ncol = 5, byrow = TRUE) knn_zones(nn[, 1:3])
-value for a scan statistic.Given an observed scan statistic and a vector of replicate
scan statistics
,
, calculate the Monte
Carlo
-value as
The function is vectorized, so multiple -values can be calculated if
several scan statistics (e.g. statistics from secondary clusters) are
supplied.
mc_pvalue(observed, replicates)
mc_pvalue(observed, replicates)
observed |
A scalar containing the observed value of the scan statistic, or a vector of observed values from secondary clusters. |
replicates |
A vector of Monte Carlo replicates of the scan statistic. |
The -value or
-values corresponding to the observed
scan statistic(s).
A dataset containing the longitude and latitude of the county seats of New Mexico, except for Cibola county.
NM_geo
NM_geo
A data frame with 33 rows and 7 variables:
Factor; the counties of New Mexico (no spaces).
Character; the name of the county seat, i.e. the administrative center or seat of government.
Numeric; the area in square kilometers of each county.
Numeric; the longitude of the county seat.
Numeric; the latitude of the county seat.
Numeric; the longitude of the geographical center of the county.
Numeric; the latitude of the geographical center of the county.
https://en.wikipedia.org/wiki/List_of_counties_in_New_Mexico
Map data for New Mexico. Was created using ggplot2::map_data
.
NM_map
NM_map
A data frame with 867 rows and 7 variables:
Numeric; longitude of county polygon corner.
Numeric; latitude of county polygon corner.
Numeric; grouping by county.
Numeric; order of the polygon corners.
Character; region is "new mexico" for all rows.
Character; the county name (with spaces).
Factor; the county name (no spaces).
A dataset containing the population count and number of brain cancer cases in the counties of New Mexico during the years 1973–1991. The population numbers are interpolations from the censuses conducted in 1973, 1982, and 1991. Interpolations were done using a quadratic function of time. Thus the year-to-year changes are overly smooth but match the census numbers in the three years mentioned.
NM_popcas
NM_popcas
A data frame with 608 rows and 4 variables:
Integer; the year the cases were recorded.
Character; the name of the county (no spaces).
Integer; the population in that county and year.
Integer; the number of brain cancer cases in that county and year.
Calculate the "Bayesian Spatial Scan Statistic" by Neill et al. (2006), adapted to a spatio-temporal setting. The scan statistic assumes that, given the relative risk, the data follows a Poisson distribution. The relative risk is in turn assigned a Gamma distribution prior, yielding a negative binomial marginal distribution for the counts under the null hypothesis. Under the alternative hypothesis, the
scan_bayes_negbin( counts, zones, baselines = NULL, population = NULL, outbreak_prob = 0.05, alpha_null = 1, beta_null = 1, alpha_alt = alpha_null, beta_alt = beta_null, inc_values = seq(1, 3, by = 0.1), inc_probs = 1 )
scan_bayes_negbin( counts, zones, baselines = NULL, population = NULL, outbreak_prob = 0.05, alpha_null = 1, beta_null = 1, alpha_alt = alpha_null, beta_alt = beta_null, inc_values = seq(1, 3, by = 0.1), inc_probs = 1 )
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
population |
Optional. A matrix or vector of populations for each
location. Not needed if |
outbreak_prob |
A scalar; the probability of an outbreak (at any time, any place). Defaults to 0.05. |
alpha_null |
A scalar; the shape parameter for the gamma distribution under the null hypothesis of no anomaly. Defaults to 1. |
beta_null |
A scalar; the scale parameter for the gamma distribution under the null hypothesis of no anomaly. Defaults to 1. |
alpha_alt |
A scalar; the shape parameter for the gamma distribution
under the alternative hypothesis of an anomaly. Defaults to the same value
as |
beta_alt |
A scalar; the scale parameter for the gamma distribution
under the alternative hypothesis of an anomaly. Defaults to the same value
as |
inc_values |
A vector of possible values for the increase in the mean (and variance) of an anomalous count. Defaults to evenly spaced values between 1 and 3, with a difference of 0.1 between consecutive values. |
inc_probs |
A vector of the prior probabilities of each value in
|
A list which, in addition to the information about the type of scan
statistic, has the following components: priors
(list),
posteriors
(list), MLC
(list) and marginal_data_prob
(scalar). The list MLC
has elements
The number of the spatial zone of the most likely cluster (MLC).
The most likely event duration.
The posterior log probability that an event is ongoing in the MLC.
The logarithm of the Bayes factor for the MLC.
The posterior probability that an event is ongoing in the MLC.
The locations involved in the MLC.
The list priors
has elements
The prior probability of no anomaly.
The prior probability of an anomaly.
A vectorof prior probabilities of each value in the
argument inc_values
.
The prior probability of an outbreak in any of the space-time windows.
The list posteriors
has elements
The posterior probability of no anomaly.
The posterior probability of an anomaly.
A data frame with columns inc_values
and
inc_posterior
.
A data frame with columns zone
,
duration
, log_posterior
and
log_bayes_factor
, each row corresponding
to a space-time window.
A matrix with the posterior anomaly probability of each location-time combination.
A vector with the posterior probability of an anomaly at each location.
Neill, D. B., Moore, A. W., Cooper, G. F. (2006). A Bayesian Spatial Scan Statistic. Advances in Neural Information Processing Systems 18.
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(baselines)), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols]), length(ob_rows), length(ob_cols)) res <- scan_bayes_negbin(counts = counts, zones = zones, baselines = baselines)
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(baselines)), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols]), length(ob_rows), length(ob_cols)) res <- scan_bayes_negbin(counts = counts, zones = zones, baselines = baselines)
Calculate the expectation-based negative binomial scan statistic devised by Tango et al. (2011).
scan_eb_negbin( counts, zones, baselines = NULL, thetas = 1, type = c("hotspot", "emerging"), n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
scan_eb_negbin( counts, zones, baselines = NULL, thetas = 1, type = c("hotspot", "emerging"), n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
thetas |
Optional. A matrix of the same dimensions as |
type |
A string, either "hotspot" or "emerging". If "hotspot", the relative risk is assumed to be fixed over time. If "emerging", the relative risk is assumed to increase with the duration of the outbreak. |
n_mcsim |
A non-negative integer; the number of replicate scan
statistics to generate in order to calculate a |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
A list which, in addition to the information about the type of scan statistic, has the following components:
A list containing the number of the zone of the most likely
cluster (MLC), the locations in that zone, the duration of the
MLC, and the calculated score. In order, the
elements of this list are named zone_number, locations,
duration, score
.
A data frame containing, for each combination of zone
and duration investigated, the zone number, duration, and score.
The table is sorted by score with the top-scoring location on top.
If max_only = TRUE
, only contains a single row
corresponding to the MLC.
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
The Monte Carlo -value.
A -value obtained by fitting a Gumbel
distribution to the replicate scan statistics.
The number of zones scanned.
The number of locations.
The maximum duration considered.
The number of Monte Carlo replicates made.
Tango, T., Takahashi, K. & Kohriyama, K. (2011), A space-time scan statistic for detecting emerging outbreaks, Biometrics 67(1), 106–115.
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) thetas <- matrix(runif(n_total, 0.05, 3), max_duration, n_locs) counts <- matrix(rnbinom(n_total, mu = baselines, size = thetas), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rnbinom(ob_dur * length(ob_cols), mu = 2 * baselines[ob_rows, ob_cols], size = thetas[ob_rows, ob_cols]), length(ob_rows), length(ob_cols)) res <- scan_eb_negbin(counts = counts, zones = zones, baselines = baselines, thetas = thetas, type = "hotspot", n_mcsim = 99, max_only = FALSE)
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) thetas <- matrix(runif(n_total, 0.05, 3), max_duration, n_locs) counts <- matrix(rnbinom(n_total, mu = baselines, size = thetas), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rnbinom(ob_dur * length(ob_cols), mu = 2 * baselines[ob_rows, ob_cols], size = thetas[ob_rows, ob_cols]), length(ob_rows), length(ob_cols)) res <- scan_eb_negbin(counts = counts, zones = zones, baselines = baselines, thetas = thetas, type = "hotspot", n_mcsim = 99, max_only = FALSE)
Calculate the expectation-based Poisson scan statistic devised by Neill et al. (2005).
scan_eb_poisson( counts, zones, baselines = NULL, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
scan_eb_poisson( counts, zones, baselines = NULL, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
population |
Optional. A matrix or vector of populations for each
location. Not needed if |
n_mcsim |
A non-negative integer; the number of replicate scan
statistics to generate in order to calculate a |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
A list which, in addition to the information about the type of scan statistic, has the following components:
A list containing the number of the zone of the most likely
cluster (MLC), the locations in that zone, the duration of the
MLC, the calculated score, and the relative risk. In order, the
elements of this list are named zone_number, locations,
duration, score, relative_risk
.
A data frame containing, for each combination of zone
and duration investigated, the zone number, duration, score,
relative risk. The table is sorted by score with the top-scoring
location on top. If max_only = TRUE
, only contains a single
row corresponding to the MLC.
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
The Monte Carlo -value.
A -value obtained by fitting a Gumbel
distribution to the replicate scan statistics.
The number of zones scanned.
The number of locations.
The maximum duration considered.
The number of Monte Carlo replicates made.
Neill, D. B., Moore, A. W., Sabhnani, M. and Daniel, K. (2005). Detection of emerging space-time clusters. Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining - KDD ’05, 218.
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(baselines)), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols]), length(ob_rows), length(ob_cols)) res <- scan_eb_poisson(counts = counts, zones = zones, baselines = baselines, n_mcsim = 99, max_only = FALSE)
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(baselines)), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols]), length(ob_rows), length(ob_cols)) res <- scan_eb_poisson(counts = counts, zones = zones, baselines = baselines, n_mcsim = 99, max_only = FALSE)
Calculates the expectation-based scan statistic. See details below.
scan_eb_zip( counts, zones, baselines = NULL, probs = NULL, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE, rel_tol = 0.001 )
scan_eb_zip( counts, zones, baselines = NULL, probs = NULL, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE, rel_tol = 0.001 )
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
baselines |
Optional. A matrix of the same dimensions as |
probs |
Optional. A matrix of the same dimensions as |
population |
Optional. A matrix or vector of populations for each
location. Only needed if |
n_mcsim |
A non-negative integer; the number of replicate scan
statistics to generate in order to calculate a |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
rel_tol |
A positive scalar. If the relative change in the incomplete information likelihood is less than this value, then the EM algorithm is deemed to have converged. |
For the expectation-based zero-inflated Poisson scan statistic
(Allévius & Höhle 2017), the null hypothesis of no anomaly holds that
the count observed at each location and duration
(the
number of time periods before present) has a zero-inflated Poisson
distribution with expected value parameter
and structural
zero probability
:
This holds for all locations and all durations
, with
being the maximum duration considered.
Under the alternative hypothesis, there is a space-time window
consisting of a spatial zone
and a time
window
such that the counts in that
window have their Poisson expected value parameters inflated by a factor
compared to the null hypothesis:
For locations and durations outside of this window, counts are assumed to
be distributed as under the null hypothesis. The sets considered
are those specified in the argument
zones
, while the maximum
duration is taken as the maximum value in the column
duration
of the input table
.
For each space-time window considered, (the log of) a likelihood
ratio is computed using the distributions under the alternative and null
hypotheses, and the expectation-based Poisson scan statistic is calculated
as the maximum of these quantities over all space-time windows. The
expectation-maximization (EM) algorithm is used to obtain maximum
likelihood estimates.
A list which, in addition to the information about the type of scan statistic, has the following components:
A list containing the number of the zone of the most likely
cluster (MLC), the locations in that zone, the duration of the
MLC, the calculated score, the relative risk, and the number of
iterations until convergence for the EM algorithm. In order, the
elements of this list are named zone_number, locations,
duration, score, relative_risk, n_iter
.
A data frame containing, for each combination of zone
and duration investigated, the zone number, duration, score,
relative risk, number of EM iterations. The table is sorted by
score with the top-scoring location on top. If
max_only = TRUE
, only contains a single row corresponding
to the MLC.
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
The Monte Carlo -value.
A -value obtained by fitting a Gumbel
distribution to the replicate scan statistics.
The number of zones scanned.
The number of locations.
The maximum duration considered.
The number of Monte Carlo replicates made.
Allévius, B. and Höhle, M, An expectation-based space-time scan statistic for ZIP-distributed data, (Technical report), Link to PDF.
if (require("gamlss.dist")) { set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) probs <- matrix(runif(n_total) / 4, max_duration, n_locs) counts <- matrix(gamlss.dist::rZIP(n_total, baselines, probs), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- gamlss.dist::rZIP( ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols], probs[ob_rows, ob_cols]) res <- scan_eb_zip(counts = counts, zones = zones, baselines = baselines, probs = probs, n_mcsim = 9, max_only = FALSE, rel_tol = 1e-3) }
if (require("gamlss.dist")) { set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data baselines <- matrix(rexp(n_total, 1/5), max_duration, n_locs) probs <- matrix(runif(n_total) / 4, max_duration, n_locs) counts <- matrix(gamlss.dist::rZIP(n_total, baselines, probs), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- gamlss.dist::rZIP( ob_dur * length(ob_cols), 2 * baselines[ob_rows, ob_cols], probs[ob_rows, ob_cols]) res <- scan_eb_zip(counts = counts, zones = zones, baselines = baselines, probs = probs, n_mcsim = 9, max_only = FALSE, rel_tol = 1e-3) }
Calculate the population-based Poisson scan statistic devised by Kulldorff (1997, 2001).
scan_pb_poisson( counts, zones, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
scan_pb_poisson( counts, zones, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
population |
Optional. A matrix or vector of populations for each
location and time point. Only needed if |
n_mcsim |
A non-negative integer; the number of replicate scan statistics to generate in order to calculate a P-value. |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
A list which, in addition to the information about the type of scan statistic, has the following components:
A list containing the number of the zone of the most likely
cluster (MLC), the locations in that zone, the duration of the
MLC, the calculated score, and the relative risk inside and
outside the cluster. In order, the elements of this list are named
zone_number, locations, duration, score, relrisk_in,
relrisk_out
.
A data frame containing, for each combination of zone
and duration investigated, the zone number, duration, score,
relative risks. The table is sorted by score with the top-scoring
location on top. If max_only = TRUE
, only contains a single
row corresponding to the MLC.
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
The Monte Carlo -value.
A -value obtained by fitting a Gumbel
distribution to the replicate scan statistics.
The number of zones scanned.
The number of locations.
The maximum duration considered.
The number of Monte Carlo replicates made.
Kulldorff, M. (1997). A spatial scan statistic. Communications in Statistics - Theory and Methods, 26, 1481–1496.
Kulldorff, M. (2001). Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society, Series A (Statistics in Society), 164, 61–72.
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data population <- matrix(rnorm(n_total, 100, 10), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(population) / 20), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * population[ob_rows, ob_cols] / 20), length(ob_rows), length(ob_cols)) res <- scan_pb_poisson(counts = counts, zones = zones, population = population, n_mcsim = 99, max_only = FALSE)
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data population <- matrix(rnorm(n_total, 100, 10), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(population) / 20), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * population[ob_rows, ob_cols] / 20), length(ob_rows), length(ob_cols)) res <- scan_pb_poisson(counts = counts, zones = zones, population = population, n_mcsim = 99, max_only = FALSE)
Calculate the space-time permutation scan statistic devised by Kulldorff (2005).
scan_permutation( counts, zones, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
scan_permutation( counts, zones, population = NULL, n_mcsim = 0, gumbel = FALSE, max_only = FALSE )
counts |
Either:
|
zones |
A list of integer vectors. Each vector corresponds to a single zone; its elements are the numbers of the locations in that zone. |
population |
Optional. A matrix or vector of populations for each
location and time point. Only needed if |
n_mcsim |
A non-negative integer; the number of replicate scan statistics to generate in order to calculate a P-value. |
gumbel |
Logical: should a Gumbel P-value be calculated? Default is
|
max_only |
Boolean. If |
A list which, in addition to the information about the type of scan statistic, has the following components:
A list containing the number of the zone of the most likely
cluster (MLC), the locations in that zone, the duration of the
MLC, the calculated score, and the relative risk inside and
outside the cluster. In order, the elements of this list are named
zone_number, locations, duration, score, relrisk_in,
relrisk_out
.
A data frame containing, for each combination of zone
and duration investigated, the zone number, duration, score,
relative risks. The table is sorted by score with the top-scoring
location on top. If max_only = TRUE
, only contains a single
row corresponding to the MLC.
A data frame of the Monte Carlo replicates of the scan statistic (if any), and the corresponding zones and durations.
The Monte Carlo -value.
A -value obtained by fitting a Gumbel
distribution to the replicate scan statistics.
The number of zones scanned.
The number of locations.
The maximum duration considered.
The number of Monte Carlo replicates made.
Kulldorff, M., Heffernan, R., Hartman, J., Assunção, R. M., Mostashari, F. (2005). A space-time permutation scan statistic for disease outbreak detection. PLoS Medicine, 2(3), 0216-0224.
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data population <- matrix(rnorm(n_total, 100, 10), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(population) / 20), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * population[ob_rows, ob_cols] / 20), length(ob_rows), length(ob_cols)) res <- scan_permutation(counts = counts, zones = zones, population = population, n_mcsim = 99, max_only = FALSE)
set.seed(1) # Create location coordinates, calculate nearest neighbors, and create zones n_locs <- 50 max_duration <- 5 n_total <- n_locs * max_duration geo <- matrix(rnorm(n_locs * 2), n_locs, 2) knn_mat <- coords_to_knn(geo, 15) zones <- knn_zones(knn_mat) # Simulate data population <- matrix(rnorm(n_total, 100, 10), max_duration, n_locs) counts <- matrix(rpois(n_total, as.vector(population) / 20), max_duration, n_locs) # Inject outbreak/event/anomaly ob_dur <- 3 ob_cols <- zones[[10]] ob_rows <- max_duration + 1 - seq_len(ob_dur) counts[ob_rows, ob_cols] <- matrix( rpois(ob_dur * length(ob_cols), 2 * population[ob_rows, ob_cols] / 20), length(ob_rows), length(ob_cols)) res <- scan_permutation(counts = counts, zones = zones, population = population, n_mcsim = 99, max_only = FALSE)
The scanstatistics package provides two categories of important functions: data preparation functions, and the scan statistics themselves.
These functions prepare your data for use. In particular, it helps you define the zones which will be considered by the scan statistics.
These are the functions used for space-time anomaly detection. Scan statistic
functions for univariate space-time data have a name that begins with
scan_
and functions for multivariate space-time data have a name that
begins with mscan_
.
For each location, compute the average of the statistic calculated for each space-time window that the location is included in, i.e. average the statistic over both zones and the maximum duration.
score_locations(x, zones)
score_locations(x, zones)
x |
An object of class |
zones |
A list of integer vectors. |
A data.table
with the following columns:
The locations (as integers).
For each location, the sum of all window statistics that the location appears in.
The number of spatial zones that the location appears in.
The total score divided by the number of zones and the maximum duration.
The score divided by the maximum score.
# Simple example set.seed(1) table <- data.frame(zone = 1:5, duration = 1, score = 5:1) zones <- list(1:2, 1:3, 2:5, 4:5, c(1, 5)) x <- list(observed = table, n_locations = 5, max_duration = 1, n_zones = 5) score_locations(x, zones)
# Simple example set.seed(1) table <- data.frame(zone = 1:5, duration = 1, score = 5:1) zones <- list(1:2, 1:3, 2:5, 4:5, c(1, 5)) x <- list(observed = table, n_locations = 5, max_duration = 1, n_zones = 5) score_locations(x, zones)
Get the top space-time clusters according to the statistic calculated
for each cluster (the maximum being the scan statistic). The default is to
return the spatially non-overlapping clusters, i.e. those that do not have
any locations in common.
top_clusters( x, zones, k = 5, overlapping = FALSE, gumbel = FALSE, alpha = NULL, ... )
top_clusters( x, zones, k = 5, overlapping = FALSE, gumbel = FALSE, alpha = NULL, ... )
x |
An object of class scanstatistics. |
zones |
A list of integer vectors. |
k |
An integer, the number of clusters to return. |
overlapping |
Logical; should the top clusters be allowed to overlap in
the spatial dimension? The default is |
gumbel |
Logical; should a Gumbel P-value be calculated? The default is
|
alpha |
A significance level, which if not |
... |
Parameters passed to |
A data frame with at most rows, with columns
zone, duration, score
and possibly MC_pvalue, Gumbel_pvalue
and critical_value
.
set.seed(1) counts <- matrix(rpois(15, 3), 3, 5) zones <- list(1:2, 1:3, 2:5, c(1, 3), 4:5, c(1, 5)) scanres <- scan_permutation(counts, zones, n_mcsim = 5) top_clusters(scanres, zones, k = 4, overlapping = FALSE)
set.seed(1) counts <- matrix(rpois(15, 3), 3, 5) zones <- list(1:2, 1:3, 2:5, c(1, 3), 4:5, c(1, 5)) scanres <- scan_permutation(counts, zones, n_mcsim = 5) top_clusters(scanres, zones, k = 4, overlapping = FALSE)