What I would like to do
I have a data frame with several grouping factors and some other data. I would like to group the rows according to those factors and flag or extract all rows which belong to groups with more than one member.
The problem/question
I was able to come up with a solution (see example below) but the solution is not practical due to an inefficiency of interaction(). Even though drop = TRUE the running time of interaction() increases dramatically when the number of levels increases. Ultimately I would like to process 10 - 20 factors with up to 50'000 levels on a data.frame with a few hundred thousand rows.
Questions: 1) What is the most efficient approach to this problem? ("Efficient" measured in this order by execution time, memory requirement and readability of code)
Question 2) What is wrong with interaction()?
The example
# number of rows
nobs <- 100000
# number of levels
nlvl <- 5000
#create two factors with a decent number of levels
fc1 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
fc2 <- factor(sample.int(nlvl, size = nobs, replace = TRUE))
#package in a data.frame together with some arbitrary data
wdf <- data.frame(fc1, fc2, vals = sample.int(2, size = nobs, replace = TRUE))
#just for information: number of unique combinations of factors, i.e. groups
ngroups <- nrow(unique(wdf[,1:2]))
print(ngroups)
#granular grouping, tt has nobs elements and ngroups levels
tt <- interaction(wdf[,1:2], drop = TRUE)
#grpidx contains for each row the corresponding group (i.e. level of tt)
#observe that length(grpidx) == nobs and max(grpidx) == ngroups
grpidx <- match(tt, levels(tt))
#split into list of groups (containing row indices)
grplst <- split(seq_along(grpidx), grpidx)
#flag groups with more than one member
flg_dup <- vapply(grplst, FUN = function(x)length(x)>1, FUN.VALUE = TRUE)
#collect all row indices of groups with more than one member
dupidx <- unlist(grplst[flg_dup])
#select the corresponding rows
nonunqdf <- cbind(grpidx[dupidx], wdf[dupidx,])
Timing of the line tt <- interaction(wdf[,1:2], drop = TRUE)
- nlvl == 500: 82 milliseconds
- nlvl == 5000: 28 seconds
- nlvl == 10000: 233 seconds