I have only just recently been bitten by dplyr::filter removing a large number of NAs from my tibble when filtering. I have mostly worked on complete data sets but am now venturing into messier data where I want to make comparisons. Therefore I want to create a function with the same capabilities as filter but without removing NAs. Here are some suggestions: Why does dplyr's filter drop NA values from a factor variable? or How to filter data without losing NA rows using dplyr however they are cumbersome solutions when dealing with lots of missing data values and many comparisons. Below is an example of some ways to get around it.
This is sample data, with missing NAs in both columns A and B
df = tibble(A = rep(c(1,2,3,NA,NA),10000),
B = rep(c(NA,1,2,3,4),10000))
This is intuitively what I want to do. Return values where A does not equal B, however it drops all the NAs (as expected).
df %>% filter(A != B)
1st solution: A solution to fix this problem is to use the %in% from base R, but you need to do this row by row and then ungroup, so it slows the process down. But gives the right result by keeping NAs when they appear in either A or B.
df %>% rowwise() %>% filter(!A %in% B) %>% ungroup()
2nd solution: The other option that has previously been suggested is using | to return A and B if they are NA.
df %>% filter(A != B|is.na(A)|is.na(B))
Now if you are doing multiple filtering and comparisons, this becomes tiresome and you are likely to stuff up somewhere! Therefore is it possible to create a function that automatically has is.na() keep inbuilt. Maybe something like this.
filter_keepna = function(data, expression){
data %>% filter(expression|is.na(column1)|is.na(column2)
}
I do not have enough knowlege to get something like this to work. But I am assuming from all the comments across various platforms that it is something that is required.