1

While there are many similar topics (here and here, for example) I have a slightly different end goal than other questions I've seen on SO. For reference, I am using R v 3.1.0.

I have two matrices. Each contains coordinates for points. The first (A) contains 2,107,377 points, and the second (B) contains 26,577 points.

I want to find the point in B that each point in A is closest to. That is, I want to calculate the distance between point 1 in A and each point in B (26,577 distances), and store the minimum. I want to do this for every point in A (2,107,377 minima). The goal is to group the points in A together based on the point in B they are closest to. Thus, some points in B will not be assigned; while others (many) will be assigned to multiple points in A.

I have tried:

test = which.min(sapply(1:nrow(coordinates), function(i) 
            spDistsN1(matrix(A, ncol = 2), matrix(B[i,], ncol = 2), 
                      longlat = TRUE)))

but ran into a memory allocation problem (could not allocate a >16 Mb vector).

I'm running a for loop now:

for (i in 1:nrow(A)) {
    minimum[i] = which.min(spDistsN1(matrix(A, ncol = 2), matrix(B[i,], ncol = 2), 
                                     longlat = TRUE))  
}

But this, I'm expecting, is going to lead to the same result, just more slowly.

I figured before I tried a totally different approach (perhaps learning the raster package), I would see if anyone has any ideas.

Community
  • 1
  • 1
Nigel Stackhouse
  • 481
  • 4
  • 20

1 Answers1

1

Try breaking down your data into smaller chunks so you don't overload your memory. A reproducible example would be helpful, but I think this gets the job done:

library(sp)
# X1 is a small example and X2 is a large example
X1 <- cbind(pointX = 1:109, pointY = 1:109)
Y1 <- cbind(x = 11:20, y = 11:20)

X2 <- cbind(pointX  = 2e4 + sample(2e6), pointY  = 2e4 + sample(2e6))
Y2 <- cbind(x = sample(2e4), y = sample(2e4))

nearWrapper = function(X, Y, nBatches = 10){
    maxNumber = dim(X)[1]
    batchNumbers <- split(0:maxNumber, ceiling(seq_along(0:maxNumber)/nBatches))
    out <- numeric(maxNumber)
    for(batch in 1:(nBatches+1)){
        out[batchNumbers[[batch]]] <- apply(spDists(X[batchNumbers[[batch]],], Y), 1, which.min)
        }   
    return(out)
}
smallOut <- nearWrapper(X1, Y1)
largeOut <- nearWrapper(X2, Y2)

If this takes too long with your data, you might also check out parallel computing (using a foreach loop in place of the for loop in your case).

Community
  • 1
  • 1
Richard Erickson
  • 2,568
  • 8
  • 26
  • 39