Sampling with Close Pairs

Introduction

The sidclosepairs package is designed to produce sampling designs for surveys. The emphasis is on getting a set of samples that are spatially spread across the survey area - a spatially inhibitory design - with a user-defined number of close pair samples to look at small-scale structure in the data.

Usage

Install, and attach in the usual way.

library(sidclosepairs)

Sampling Method

The package supports four different methods for generating sampling designs. Each work on an initial sequential rejection sampling process where n points are generated from a spatial distribution, and each point is accepted if it is greater than the given threshold distance from the previous accepted points.

The methods for the initial sample points are:

  • Sample from a uniform distribution over a polygon
  • Sample from a distribution defined by a raster grid over a polygon
  • Sample from a set of points in a polygon, such as a full set of census locations
  • Sample n points uniformly split across N polygons, maintaining the inhibitory distance across all polygons, so that a point in polygon 1 will be greater than the distance parameter from points in all polygons.

When the initial inhibitory sample points have been generated a random subset are selected and each gets a close pair point, chose uniformly randomly within a given radius. Note that the close pair points have no bearing on the non-uniformity of the initial sample point selection, and are not subject themselves to the inhibitory process. This means that it is possible for two points in the output to be closer than the inhibitory distance if close-pair points are generated from their parent points in some circumstances.

Data Structures and Coordinates

This package uses sf class data for points and polygons, and terra rasters for gridded values. The return value from the main function is also an sf class data frame of points. The code returns points in the same coordinate system as the input polygon.

Distances are computed using the coordinate system of the data, so unprojected (lat-long) coordinates will typically use degrees as a distance measure, even though this does not produce true circles (except on the equator). The code will warn you about this. Since sampling methodology is rarely implemented on a global scale, you should probably transform lat-long coordinates to a local cartesian coordinate system (such as the appropriate UTM zone) before use.

Examples

Initial Data

First we’ll create a polygon to represent the study area. This would normally come from a spatial data source such as a GeoPackage or Shapefile, or be created by drawing a simple boundary on screen.

library(sf)

study_area = st_sfc(st_polygon(list(
    cbind(c(-2.79479557075698, -2.80125014547061, -2.80231050277473, 
            -2.80235124747829, -2.80233386828029, -2.80221829171468, -2.80164935839459, 
            -2.80114175901799, -2.78462082038709, -2.78008674567188, -2.77832720481475, 
            -2.7779840993325, -2.77540503667928, -2.77598222899305, -2.79479557075698), 
          c(54.023222117637, 54.0268647946785, 54.0286678871387, 54.0289066469941, 
            54.0308376726309, 54.0318977236945, 54.0319945415705, 54.0320201272935, 
            54.032365192172, 54.0319603024302, 54.0308731143842, 54.0304837495073, 
            54.0239123648346, 54.0232677630067, 54.023222117637)))),
    crs=4326
    )

plot(study_area, axes=TRUE, las=2)

This is in the common WGS84 lat-long coordinate system used as default in most global mapping and GPS applications, so we convert it to a local coordinate system for further work. In this case we use the EPSG:27700 system as this is the standard reference for data in Great Britain.

study_xy <- st_transform(study_area, 27700)

Uniform Inhibitory Proposal

The simplest sampling scheme is where the initial inhibitory points are sampled uniformly from the polygon and rejected if too close to previously sampled points. To do this we call icpSample with the default proposal argument (ie leave it out). In this example we want to generate 50 (n) survey points in total, with 10 (k) close pairs. The initial 40 points must be at least 100 metres from each other (delta), and the close pairs are somewhere within a 20 metre radius of their inhibitory point (zeta).

p1 = icpSample(n=50, k=10, delta=100, zeta=20, poly=study_xy)
plot(study_xy)
plot(p1, add=TRUE)

From this plot you should be able to pick out the ten close pairs.

Density-Proportional Inhibitory Proposal

This method generates the initial inhibitory process points from a non-uniform density defined by a raster grid over the polygon. This grid must have the same coordinate system as the polygon.

The close pairs are generated in a uniform disc about each inhibitory point, with no consideration of the density grid. However they will be generated inside the polygon by rejection sampling.

library(terra)
#> terra 1.7.78
box = st_bbox(study_xy)
density = rast(matrix(1:(100*100),100,100), extent=box[c(1,3,2,4)],crs="epsg:27700")
density = mask(density, vect(study_xy))
density
#> class       : SpatRaster 
#> dimensions  : 100, 100, 1  (nrow, ncol, nlyr)
#> resolution  : 17.59, 10.18429  (x, y)
#> extent      : 347539.3, 349298.3, 458864.2, 459882.6  (xmin, xmax, ymin, ymax)
#> coord. ref. : OSGB36 / British National Grid (EPSG:27700) 
#> source(s)   : memory
#> name        : lyr.1 
#> min value   :     4 
#> max value   :  9997
pts_density = icpSample(50, 10, 100, 20, poly=study_xy, rdensity=density, proposal="density")
plot(density)
plot(study_xy, add=TRUE)
plot(pts_density, add=TRUE)

Census Sub-sampling Inhibitory Proposal

For this method you first need a set of points that would typically be a complete census of locations in the area under investigation. The process then draws a point from this set sequentially, rejecting the point if it is nearer than the threshold to any previously selected points. Then the close pairs are added, independent of the census points.

For an example, we’ll generate a cluster of a number of points inside the bounding box of the study area polygon, and then clip it to the polygon so we only get points inside our study area.

box = st_bbox(study_xy)
ncensus = 800
xy = data.frame(rnorm(ncensus, (box["xmax"]+box["xmin"])/2, (box["xmax"]-box["xmin"])/6),
           rnorm(ncensus, (box["ymax"]+box["ymin"])/2, (box["ymax"]-box["ymin"])/6))
pxy = st_as_sf(xy, coords=1:2, crs=27700)
pxy = st_intersection(pxy, study_xy)
plot(study_xy)
plot(pxy, add=TRUE)

To do inhibitory sampling of 50 points with 20 close pairs we specify the censusxy parameter and choose the “census” proposal method.

pts_census = icpSample(50, 20, 100, 10, censusxy=pxy, poly=study_xy, proposal="census")
plot(study_xy)
plot(pts_census, add=TRUE)

Note that the close pair points are not generated with regards to the census locations, but are uniformly sampled from a disc centred on the selected inhibitory points. Again, rejection sampling is used to generate a point guaranteed to be inside the polygon.

Also notice that even though the census locations are highly non-uniform, the output sampling points look evenly spread because of the inhibitory nature of the initial point selection.

Sampling from Multiple Polygons

To illustrate this process, first we create a spatial data frame with three fairly simple polygons. Normally these would be read from a spatial data source but for simplicity here they will be created from the WKT text format with a bit of processing.

wkt = "MULTIPOLYGON (
     ((348868.0 459145.5, 349149.6 459250.2, 349203.7 458842.1, 348918.5 458936.0, 348868.0 459145.5)),
     ((348286.7 459290.0, 348405.8 459318.8, 348586.3 459300.8, 348795.8 459210.5, 348723.5 458824.1,
       348290.3 458853.0, 348286.7 459290.0)),
     ((347712.6 459477.8, 347792.0 459730.6, 348044.8 459932.8, 348351.7 459893.1, 348261.4 459730.6,
       348243.3 459571.6, 348283.1 459376.6, 347712.6 459477.8))
)"

wkt = gsub("\n","",wkt) # remove line breaks

three = st_cast(st_as_sf(data.frame(geometry=wkt),wkt="geometry", crs=27700),"POLYGON")

Now we want to sample 30 total points, with 6 sets of close pairs. This means 24 initial inhibitory points, to be split across the 3 polygons, and then the 6 close pair points are randomly assigned.

pts_multi = icpSampleM(30, 6, 75, 20, three)
plot(three)
plot(pts_multi, add=TRUE)

If the requested number of inhibitory points does not divide by the number of polygons, then a warning is shown alerting the user to the uneven split of points.

pts_split = icpSampleM(31, 6, 75, 20, three)
#> Warning in inhibSampleM(n - k, delta, poly): Irregular division of 25 points
#> into 3 polygons

Inhibitory Points Only

You can specify zero close pairs if you only want to get a set of inhibitory points out. You can then test that the minimum point distance is over the threshold specified by delta.

p1 = icpSample(n=50, k=0, delta=100, zeta=20, poly=study_xy)
plot(study_xy)
plot(p1, add=TRUE)

d = st_distance(p1)
range(d[lower.tri(d)])
#> Units: [m]
#> [1]  100.0612 1717.4205

Adding Close Pairs

If you have a set of points and want to add a number of close pairs, then use the add_close_pairs function. This returns the input points with a number (k) of additional close pair points randomly sampled in a disc of radius zeta within the given polygon. In this example three new points are added to the inhibitory points created in the previous section. The close pair points will always be the last ones in the returned data, so it is possible to subset those out for example to colour them differently.

p1close = add_close_pairs(p1, 3, poly = study_xy, zeta=20)
plot(study_xy)
plot(p1close[1:50,], add=TRUE)
plot(p1close[51:53,], add=TRUE, col="red")

Warnings

Time

Sequential sampling of inhibitory points can result in very long, or even infinite, times to complete. Some requested sampling schemes may even be impossible, for example requesting 100 points in a 100m square that are 30m apart - its just impossible to squeeze 100 points in with that minimum separation. The code does not check for possibility or have any timeout for long-running possible simulations.

If you want to run this code in an automated manner and make sure it completes, then you might want to wrap the call in R.utils::withTimout. Otherwise you need to be there to hit Control-C or ESC or click the button that interrupts R once you think it might have got stuck.

# This should easily complete in 100 seconds (takes ~1s really)
pts_split = try(
                R.utils::withTimeout(
                  {icpSampleM(30, 6, 75, 10, three)},
                   timeout=100,
                   onTimeout="warning")
                )
# This might *never* complete! Will timeout after 1s
pts_timeout = try(
                R.utils::withTimeout(
                  {icpSampleM(30, 6, 75, 10, three)},
                   timeout=1,
                   onTimeout="warning")
                )

Coordinates

As mentioned previously, the code works using the coordinate system of the polygon given. Hence all distances and thresholds must be given in the units of the coordinate system. For unprojected lat-long coordinates, this means decimal degrees. A warning is given if this is used, since it results in distortion of ground distances due to the ellipsoid of the earth.

The study_area object created earlier, and here is an attempt to create points from that polygon object. Note how small the distance values are, since they now have to be in decimal degrees, and not metres.

pts_ll = icpSample(n=10, k=1, delta=0.001, zeta=0.00025, poly=study_area)
#> Warning in icpSample(n = 10, k = 1, delta = 0.001, zeta = 0.00025, poly =
#> study_area): Polygon is in lat-long coordinates so distances must be in
#> degrees. Consider transforming to a local projected coordinate system
pts_ll
#> Simple feature collection with 10 features and 0 fields
#> Geometry type: POINT
#> Dimension:     XY
#> Bounding box:  xmin: -2.796445 ymin: 54.02484 xmax: -2.778025 ymax: 54.03109
#> Geodetic CRS:  WGS 84
#>                      geometry
#> 1  POINT (-2.778025 54.02838)
#> 2  POINT (-2.794239 54.02745)
#> 3  POINT (-2.796445 54.03018)
#> 4   POINT (-2.794823 54.0289)
#> 5  POINT (-2.791562 54.03109)
#> 6  POINT (-2.784327 54.02904)
#> 7  POINT (-2.796178 54.02566)
#> 8  POINT (-2.790323 54.02484)
#> 9  POINT (-2.790983 54.02992)
#> 10  POINT (-2.784498 54.0292)

Credits

Most of this code was originally written by Melodie Sammarro as part of her PhD supervised by Chris Jewell and Barry Rowlingson. The code was then cleaned up and made more consistent by Barry Rowlingson.