7

This is using python with Spark 1.6.1 and dataframes.

I have timestamps in UTC that I want to convert to local time, but a given row could be in any of several timezones. I have an 'offset' value (or alternately, the local timezone abbreviation. I can adjust all the timestamps to a single zone or with a single offset easily enough, but I can't figure out how to make the adjustment dependent on the 'offset' or 'tz' column.

There appear to be 2 main ways of adjusting a timestamp: using the 'INTERVAL' method, or using pyspark.sql.from_utc_timestamp.

Here's an example:

data = [
  ("2015-01-01 23:59:59", "2015-01-02 00:01:02", 1, 300,"MST"),  
  ("2015-01-02 23:00:00", "2015-01-02 23:59:59", 2, 60, "EST"),  
  ("2015-01-02 22:59:58", "2015-01-02 23:59:59", 3, 120,"EST"),  
  ("2015-03-02 15:59:58", "2015-01-02 23:59:59", 4, 120,"PST"),  
  ("2015-03-16 15:15:58", "2015-01-02 23:59:59", 5, 120,"PST"),  
  ("2015-10-02 18:59:58", "2015-01-02 23:59:59", 4, 120,"PST"),  
  ("2015-11-16 18:58:58", "2015-01-02 23:59:59", 5, 120,"PST"),  
  ("2015-03-02 15:59:58", "2015-01-02 23:59:59", 4, 120,"MST"),  
  ("2015-03-16 15:15:58", "2015-01-02 23:59:59", 5, 120,"MST"),  
  ("2015-10-02 18:59:58", "2015-01-02 23:59:59", 4, 120,"MST"),  
  ("2015-11-16 18:58:58", "2015-01-02 23:59:59", 5, 120,"MST"),
  ...
]

(I realize the offset and tz columns aren't consistent - this isn't real data)

df = sqlCtx.createDataFrame(data, ["start_time", "end_time", "id","offset","tz"])  
from pyspark.sql import functions as F

these two options both do what is expected:

df.withColumn('testthis', F.from_utc_timestamp(df.start_time, "PST")).show()
df.withColumn('testThat', df.start_time.cast("timestamp") - F.expr("INTERVAL 50 MINUTES")).show()

But if I try to replace the "PST" string with df.tz, or the " 50 " string with df.offset.cast('string'), I get a type error:

TypeError: 'Column' object is not callable

I've tried variations on this, but to no avail.

Emre
  • 10,541
  • 1
  • 31
  • 39
Eric Hilton
  • 71
  • 1
  • 1
  • 2

3 Answers3

4

You can use the SQL interface to get what you want:

> df.selectExpr("from_utc_timestamp(start_time, tz) as testthis").show()
+--------------------+
|            testthis|
+--------------------+
|2015-01-01 16:59:...|
|2015-01-02 18:00:...|
|2015-01-02 17:59:...|
|2015-03-02 07:59:...|
|2015-03-16 08:15:...|
|2015-10-02 11:59:...|
|2015-11-16 10:58:...|
|2015-03-02 08:59:...|
|2015-03-16 08:15:...|
|2015-10-02 11:59:...|
|2015-11-16 11:58:...|
+--------------------+
Emre
  • 10,541
  • 1
  • 31
  • 39
0

If you want to use the same dataframe and just add a new column with converted timestamp, you can use expr and withColumn in a very efficient way.

df = df.withColumn('localTimestamp',
                   expr("from_utc_timestamp(utcTimestamp, timezone)"))

Where utcTimestamp and timezone are columns in your data frame. This will add a new column localTimestamp with the converted time.

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
RodrigoM
  • 101
-1

if the data type of your "tz" column is a string, then you can do:

df.select(from_utc_timestamp(df.start_time,tz).alias('start_time')).show()
anumap
  • 1