FInding active users based on login and logoff timestamp using PySpark

Question

There is a data frame with data of iot. Columns are device id(unique id), connected at(Timestamp), disconnect at(Timestamp). I have to get the active users based on timings. First, we need get dates from connection date and disconnected date and then create a date column. Then we have to create another column like "active users" which has to show the number of devices which are active on the date.

Example:

date          active users
20-08-2021     1 (it will check connection and disconnection status in dataframe)

Please don't make more work for others by vandalizing your posts. By posting on the Stack Exchange (SE) network, you've granted a non-revocable right, under the [CC BY-SA license](https://creativecommons.org/licenses/by-sa/4.0), for SE to distribute the content (regardless of your future choices). By SE policy, the non-vandalized version is distributed. Thus, any such destructive edits will be reverted. Please see [How does deleting work?](/help/what-to-do-instead-of-deleting-question) for more information on how deleting content works on this site. — Ryan M, Jul 01 '22 at 05:40

ZygD · Accepted Answer · 2022-07-01T05:15:46.990

I've created example data and script for you. Example data and your attempt was your job. Normally on Stack Overflow people don't do such things and such question could have been closed as lacking details or a script to debug. Just because I'm in the mood...

Make something along these lines...

Example data:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [(101, '2020-01-01 13:01:01', '2020-01-01 14:01:01'),
     (101, '2020-01-01 15:01:01', '2020-01-01 16:01:01'),  # same user 2nd time
     (101, '2020-01-01 17:01:01', '2020-01-02 11:01:01'),  # same user 3rd time, but overnight
     (102, '2020-01-02 05:01:01', '2020-01-02 07:01:01'),
     (103, '2020-01-02 05:01:01', '2020-01-02 07:01:01'),
     (104,                  None, '2020-01-02 07:01:01'),  # No connected_at time
     (105, '2020-01-04 10:01:01',                  None)], # No disconnected_at time
    ['device_id', 'connected_at', 'disconnected_at'])

Script:

connection_timeout_days = -1 # If negative, not applied
logged_days_if_no_start_date = 3


# Change timestamps to dates
df = df.withColumn('connected_at', F.to_date('connected_at')) \
       .withColumn('disconnected_at', F.to_date('disconnected_at'))

# If no connected_at, subtract default number2
df = df.withColumn('connected_at', F.coalesce('connected_at',
                                   F.date_sub('disconnected_at', logged_days_if_no_start_date)))
# If no disconnected_at, add today's date or timeout days
if connection_timeout_days < 0:
    df = df.withColumn('disconnected_at', F.coalesce('disconnected_at', F.current_date()))
else:
    df = df.withColumn(
        'disconnected_at',
        F.coalesce('disconnected_at',
                   F.least(F.current_date(), F.date_add('connected_at', connection_timeout_days))))

# Create a separate df with dates for the whole period
min_date = df.agg(F.min('connected_at')).head()[0]
max_date = df.agg(F.max('disconnected_at')).head()[0]
df_dates = spark.range(1).select(
    F.explode(F.sequence(F.lit(min_date), F.lit(max_date))).alias('date')
)

# Transform original df - count distinct users per dates
df = (df
    .select('device_id',
            F.explode(F.sequence('connected_at', 'disconnected_at')).alias('date'))
    .groupBy('date')
    .agg(F.countDistinct('device_id').alias('device_distinct_count'))
)

# Join both dfs
df = df_dates.join(df, 'date', 'left') \
             .fillna(0).sort('date')

Result:

df.show()
# +----------+---------------------+
# |      date|device_distinct_count|
# +----------+---------------------+
# |2019-12-30|                    1|
# |2019-12-31|                    1|
# |2020-01-01|                    2|
# |2020-01-02|                    4|
# |2020-01-03|                    0|
# |2020-01-04|                    1|
# |2020-01-05|                    1|
# |2020-01-06|                    1|
# |2020-01-07|                    1|
# |2020-01-08|                    1|
# |2020-01-09|                    1|
# |2020-01-10|                    1|
# |2020-01-11|                    1|
# |2020-01-12|                    1|
# |2020-01-13|                    1|
# |2020-01-14|                    1|
# |2020-01-15|                    1|
# |2020-01-16|                    1|
# |2020-01-17|                    1|
# |2020-01-18|                    1|
# +----------+---------------------+
# only showing top 20 rows

@Emvundiley Please stop trying to post ASCII tables in the comments; it's completely unreadable. — Ryan M, Jul 01 '22 at 05:40
Please create a separate question. Having this answer as a guideline, please create a [minimal reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples) Please read this post, it will help a lot in your future. — ZygD, Jul 07 '22 at 10:07

FInding active users based on login and logoff timestamp using PySpark

1 Answers1