General comments.
- The
ave approach is the only one here that preserves the data's initial row ordering.
- The
by approach should be very slow. I suspect that data.table and dplyr are not much faster than ave and tapply (yet) at selecting groups. Benchmarks to prove me wrong welcome!
base R (Thanks to @thelatemail for both of the first two approaches.)
1) Each row is assigned the length of its df$ID group, and we filter based on the vector of lengths.
df[ ave(1:nrow(df), df$ID, FUN=length) > 1 , ]
2) Alternately, we split row names or numbers by df$ID, selecting which groups' rows to keep. tapply returns a list of groups of rows, so we must unlist them into a single vector of rows.
df[ unlist(tapply(1:nrow(df), df$ID, function(x) if (length(x) > 1) x)) , ]
What follows is a worse approach, but better parallels what you see with data.table and dplyr:
3) The data is split by df$ID, keeping each subset of data, SD if if has more than one row. by returns a list, so we must rbind them back together.
do.call( rbind, c(list(make.row.names = FALSE),
by(df, df$ID, FUN=function(SD) if (nrow(SD) > 1) SD )))
data.table .N corresponds to nrow within a by=ID group; and .SD is the subset of data.
library(data.table)
setDT(df)[, if (.N>1) .SD, by=ID]
# ID Cat1 Cat2 Cat3 Cat4
# 1: A0001 358 11.2500 37428 0
# 2: A0001 279 14.6875 38605 0
# 3: A0020 367 8.8750 37797 0
# 4: A0020 339 9.6250 39324 0
dplyr n() corresponds to nrow within a group_by(ID) group.
library(dplyr)
df %>% group_by(ID) %>% filter( n() > 1 )
# Source: local data frame [4 x 5]
# Groups: ID
#
# ID Cat1 Cat2 Cat3 Cat4
# 1 A0001 358 11.2500 37428 0
# 2 A0001 279 14.6875 38605 0
# 3 A0020 367 8.8750 37797 0
# 4 A0020 339 9.6250 39324 0