1

I have only just recently been bitten by dplyr::filter removing a large number of NAs from my tibble when filtering. I have mostly worked on complete data sets but am now venturing into messier data where I want to make comparisons. Therefore I want to create a function with the same capabilities as filter but without removing NAs. Here are some suggestions: Why does dplyr's filter drop NA values from a factor variable? or How to filter data without losing NA rows using dplyr however they are cumbersome solutions when dealing with lots of missing data values and many comparisons. Below is an example of some ways to get around it.

This is sample data, with missing NAs in both columns A and B

df = tibble(A = rep(c(1,2,3,NA,NA),10000),
            B = rep(c(NA,1,2,3,4),10000))

This is intuitively what I want to do. Return values where A does not equal B, however it drops all the NAs (as expected).

df %>% filter(A != B)

1st solution: A solution to fix this problem is to use the %in% from base R, but you need to do this row by row and then ungroup, so it slows the process down. But gives the right result by keeping NAs when they appear in either A or B.

df %>% rowwise() %>%  filter(!A %in% B) %>% ungroup()

2nd solution: The other option that has previously been suggested is using | to return A and B if they are NA.

df %>% filter(A != B|is.na(A)|is.na(B))

Now if you are doing multiple filtering and comparisons, this becomes tiresome and you are likely to stuff up somewhere! Therefore is it possible to create a function that automatically has is.na() keep inbuilt. Maybe something like this.

    filter_keepna = function(data, expression){
data %>% filter(expression|is.na(column1)|is.na(column2)
}

I do not have enough knowlege to get something like this to work. But I am assuming from all the comments across various platforms that it is something that is required.

JFG123
  • 577
  • 5
  • 13
  • More suggestions here: https://stackoverflow.com/questions/46378437/how-to-filter-data-without-losing-na-rows-using-dplyr – jtr13 May 03 '21 at 14:03

2 Answers2

1

In your function you can use the functions for tidy evaluation from rlang package. The enquo(), f_lhs() and quo_get_expr() functions can help to extract variables from the expression. Also you need the bang bang operator (!!) to interpret the quosures. On your example, is:

filter_keepna <- function(data, expre){
  expre <- enquo(expre) #Quotation 

  data %>% 
    filter(!!expre | #!! is a tidy evaluator

             # get quoted left variable from expre
             is.na(!!f_lhs(quo_get_expr(expre))) |

             # get quoted right variable from expre
             is.na(!!f_rhs(quo_get_expr(expre))))
}

Using the filter_keepna() function in your example data:

df = tibble(A = rep(c(1,2,3,NA,NA),10000),
            B = rep(c(NA,1,2,3,4),10000))

filter_keepna(df, A != B)
# A tibble: 40,000 x 2
#        A     B
#     <dbl> <dbl>
#  1     2     1
#  2     3     2
#  3    NA     3
#  4    NA     4
#  5     2     1
#  6     3     2
#  7    NA     3
#  8    NA     4
#  9     2     1
# 10     3     2
# # ... with 39,990 more rows

Detailed information in the quotation reference and Quosure getters reference of rlang package. .........

JFG123
  • 577
  • 5
  • 13
gavg712
  • 300
  • 1
  • 10
0

Try coalesce

df %>% filter(coalesce(A != B, TRUE))
Hugh
  • 15,521
  • 12
  • 57
  • 100
  • Not suitable for filtering, because `TRUE` includes the complementary of condition `A != B`. You can test with `all_equal(df %>% filter(coalesce(A != B, TRUE)), df)` – gavg712 Feb 13 '19 at 15:36
  • 1
    This seems to be what OP wants though. `all_equal(df %>% filter(coalesce(A != B, TRUE)), df)` is not always true; it just happens to be true for this particular case. – Hugh Feb 14 '19 at 03:14
  • I like the flexibility of using coalesce, it means that I can combine multiple filtering options within the one filter() call, thus giving added flexibility. @gavg712 is eexactly the solution for my original question, this coalesce is just more conveinient. – JFG123 Feb 15 '19 at 04:51