0

I am trying to handle data coming from software that has as a terrible format for a time duration: [days-]hours:minutes:seconds[.microseconds]. In case someone else has already traveled this path, I'm fighting with the Elapsed and field from SLURM's sacct output. (I'm also fighting with ReqMem but that's a problem for another day)

For example, one row might read 02:42:05 meaning 2 hours, 42 minutes, 5 seconds. Another row might read 6-02:42:05 which means the same, plus 6 days. Finally, on occasion, the seconds value has a microseconds value following it delimited by a decimal point, for example 6-02:42:05.745 meaning the same as the prior, plus 745 microseconds. Note that both the day and microsecond fields (and their delimiters) are optional and thus inconsistently present, and I can't just treat seconds as a float.

I need to replace this value with an integer number of seconds, or something suitably equivalent.

I have managed to muddle my way to a solution utilizing apply() and a python function, but I'm aware that this essentially breaks most of the benefit of using Polars? It's faster than the original pandas implementation it seems, but I would love to get as much as I can out of this work. The DataFrame geometry on this dataset is something like 109 columns and over 1 million rows, before filtering.

Here's my working but terrible code:

# this is going to be called hundreds of thousands of times, so yea, compiling it is probably helpful
elapsed_time_re = re.compile(r'([0-9]+-)?([0-9]{2}):([0-9]{2}):([0-9.]{2})(\.[0-9]+)?')

def get_elapsed_seconds(data): match = elapsed_time_re.match(data) if match is None: return data groups = match.groups('0') days = int(groups[0].strip('-')) hours = int(groups[1]) minutes = int(groups[2]) seconds = int(groups[3]) microseconds = int(groups[4].strip('.')) if microseconds > 0: seconds = seconds + round(microseconds / 1e6) return seconds + (minutes * 60) + (hours * 3600) + (days * 86400)

df is a polars.LazyFrame

df = df.with_columns(pl.col('Elapsed').apply(get_elapsed_seconds))

I have a thought on how to proceed, but I can't just find my way there:

  1. using expression conditionals, concatenate the string literal '0-' to the front of the existing value if it doesn't contain a '-' already. Problem: I can only find how to concatenate dataframes or series data, no matter how I phrase the search, and not how to concatenate a literal string to a column (of dtype str) existing value
  2. parse this new string with strptime(). Problem 1: chrono::format::strftime has no format specifier for microseconds (only nanoseconds), but this part of the timestamp is not useful to me and could be dropped - but how? Problem 2: that'll give me a Datetime, but I don't know how to go from that to a Duration. I think if I create a second Datetime object from 0000-00-00 00:00:00 or similar and perform an addition between the two, I'd get a Duration object of the correct time?

For some context: I'm just getting started with Polars. I have almost no prior experience with writing Pandas, and can read it only with constantly looking things up. As such, examples/explanations using Pandas (or comparisons to) won't save me.

I am aware that you can perform some amount of logic with Polars expressions, but it remains opaque to me. One roadblock is that the lambda syntax most examples seem to include is very difficult for me to parse, and even once past that I'm not understanding how one would branch within such expressions.

draeath
  • 123
  • 1
  • 7

1 Answers1

2

I am not too familiar with polars myself, but have you tried using the str.extract method? You can use this to extract the days/hours/etc. from the input on the whole column without using apply. I tested it using the three examples you give and it gives the expected result, and based on 100.000 rows it is roughly four times faster than using apply, with the relative speed likely improving more as you are increasing the number of rows.

import polars as pl

df = pl.DataFrame({ "elapsed": ["02:42:05", "6-02:42:05", "6-02:42:05.745"] })

( df # extract days/hours/minutes/seconds/microseconds and cast to ints .with_columns([ pl.col("elapsed").str.extract(r"(\d+)-", 1).cast(pl.UInt32).alias("days"), pl.col("elapsed").str.extract(r"(\d{2}):\d{2}:\d{2}", 1).cast(pl.UInt32).alias("hours"), pl.col("elapsed").str.extract(r"\d{2}:(\d{2}):\d{2}", 1).cast(pl.UInt32).alias("minutes"), pl.col("elapsed").str.extract(r"\d{2}:\d{2}:(\d{2})", 1).cast(pl.UInt32).alias("seconds"), pl.col("elapsed").str.extract(r".(\d+)", 1).cast(pl.UInt32).alias("microseconds"), ]) # calculate the number of seconds elapsed .with_columns([ ( pl.col("seconds").fill_null(0) + (pl.col("microseconds").fill_null(0) / 1e6) + (pl.col("minutes").fill_null(0) * 60) + (pl.col("hours").fill_null(0) * 3600) + (pl.col("days").fill_null(0) * 86400) ).alias("result") ]) )

Which gives the following dataframe:

elapsed days hours minutes seconds microseconds result
02:42:05 0 2 42 5 0 9725
6-02:42:05 6 2 42 5 0 528125
6-02:42:05.745 6 2 42 5 745 528125
Oxbowerce
  • 8,522
  • 2
  • 10
  • 26