2

I have a df data.frame that consists of 8 years of daily values.

date <- rep(as.Date(seq(as.Date("2001-05-01"),
                    as.Date("2008-04-30"), by= 1), format="%Y-%m-%d"), 3)

site <- c(rep("Site_1", 2557), rep("Site_2", 2557), rep("Site_3", 2557))

value <- c(as.numeric(sample(90:271, 2557, replace=T)),
           as.numeric(sample(125:340, 2557, replace=T)),
           as.numeric(sample(70:173, 2557, replace=T)))

df <- data.frame(date, site, value)

In this case, each year starts in May and ends in April.

I want to get the mean and sd for value for each year at the 3 sites.

I did the following

df1 <- df %>%
  dplyr::mutate(year = ifelse(date < "2002-05-01", "2001-2002",
                              ifelse(date < "2003-05-01", "2002-2003",
                                     ifelse(date < "2004-05-01", "2003-2004",
                                            ifelse(date < "2005-05-01", "2004-2005",
                                                   ifelse(date < "2006-05-01", "2005-2006",
                                                          ifelse(date < "2007-05-01", "2006-2007",
                                                                 ifelse(date < "2008-05-01", "2007-2008", NA )))))))) %>%
  dplyr::select(site, year, value) %>%
  dplyr::group_by(site, year) %>%
  dplyr::summarise_each(funs(
    mean(.),
    sd(.)
  ))

It gave me what I wanted. However, it is time taking if I have data for 30-50 years. Also, if each new data.frame has a different start month, I need to modify ifelse() each time to assign the year ID to be able to group by year and do different calculations.

Is there any straightforward way to assign yearID if the start month is any month other than January?

shiny
  • 3,380
  • 9
  • 42
  • 79

2 Answers2

6

What about

library(dplyr)
df %>% 
  group_by(year=cut(date, seq(as.Date("2001-05-01"), as.Date("2008-05-01"), "1 year"), include.lowest = TRUE), site) %>%
  summarise(sd = sd(value), mean = mean(value)) 
# Source: local data frame [21 x 4]
# Groups: year [?]
# 
#          year   site       sd     mean
#        (fctr) (fctr)    (dbl)    (dbl)
# 1  2001-05-01 Site_1 51.82622 182.5890
# 2  2001-05-01 Site_2 63.33385 241.1260
# 3  2001-05-01 Site_3 30.04042 120.1233
# 4  2002-05-01 Site_1 51.66325 182.6658
# 5  2002-05-01 Site_2 62.87470 236.4192
# 6  2002-05-01 Site_3 28.54769 122.2329
# 7  2003-05-01 Site_1 50.97588 179.0874
# 8  2003-05-01 Site_2 63.48810 227.1230
# 9  2003-05-01 Site_3 30.87933 120.4918
# 10 2004-05-01 Site_1 53.19898 176.5589
# ..        ...    ...      ...      ...
lukeA
  • 53,097
  • 5
  • 97
  • 100
6

Using package lubridate you could first add the column year like so:

library(lubridate) 
df$year <- ifelse(month(ymd(df$date)) < 5, 
                  paste(year(ymd(df$date))-1, year(ymd(df$date)), sep="-"),
                  paste(year(ymd(df$date)), year(ymd(df$date))+1, sep="-"))



df %>% dplyr::select(site, year, value) %>%
    dplyr::group_by(site, year) %>%
    dplyr::summarise_each(funs(
      mean(.),
      sd(.)
    ))

Source: local data frame [6 x 4]
Groups: site [1]

    site      year     mean       sd
  (fctr)     (chr)    (dbl)    (dbl)
1 Site_1 2001-2002 178.2055 54.58277
2 Site_1 2002-2003 176.9342 49.64435
3 Site_1 2003-2004 177.4153 52.20447
4 Site_1 2004-2005 179.5370 52.77848
5 Site_1 2005-2006 180.3671 51.41292
6 Site_1 2006-2007 179.3616 53.02291
erc
  • 10,113
  • 11
  • 57
  • 88