Efficient Way to Calculate Timebased Followership

Question

Problem

A time based followship is defined as a person changing to a new job, and there is an existing employee there in the new company whom he used to work together. In this case, the old guy gets 1 follow.

Challenge

I wrote some simulation code, but the complexity is quadratic and already taking a few minutes for 10k-ish of people, very hard to distribute and scale for a large population. (estimated to take 4hs for 100K people)

What I Tried

Here is my implementation:

loop through every year
for each year, find who changed the job
for each job change, find the new coworkers
for each new coworkers, find whether they used to work together or not
if so, add a point to the coworker as the "leader"

input The is individual employment history (person, year, company), similar to linkedin.

year | person | company
2000 | p1     | c1
2000 | p2     | c2  
2000 | p3     | c2     p2, p3 coworking at c2
2010 | p1     | c3
2010 | p2     | c10 
2011 | p1     | c3 
2011 | p2     | c3     p2 followed p1 at c3  
...                    because p2 changed job from c10 (2010) to c3 (2011) 
...                    and p2 and p3 used to work together

outcome: The final outcome is a list of people and how many followers they have at the last year. for example:

person | year | follower 
p3     | 2010 | 0 
p2     | 2010 | 0 
p3     | 2011 | 1 <- p2 followed p3 
p2     | 2011 | 0

I had some optimization for the find/lookups, for example:

a dictionary maintaining company/year to quickly find coworkers
a set maintaining pair of coworkers (once coworker, always coworker)

The code was implemented in Python with a lot of for loops, bit manipulation and vectorization could help but I don't think it will change the quadratic nature of the problem, the bottleneck will still exist.

I was thinking about relational database (postgres) or graph database (Neo4j) but cannot think of any immediate benefit.

Is there an algorithm that can handle this efficiently? My performance benchmark states it is O(n^2) so looking for something that is O(n) or faster.

score 1 · Accepted Answer · answered Dec 19 '22 at 01:38

Here is a pragmatic approach that is quadratic in principle but might be good enough in practice, depending patterns of employee movement from company to company.

For each employee, enumerate all pairs of two companies that this employee has worked for. Build up a dict that maps from a pair of companies to the list of employees who have worked for both companies.

Next, for each pair of employees who have worked at the same two companies, check if they were at those two companies at overlapping times, and if so, increment one of their follower counts.

My hope for efficiency is based on the following assumptions: (1) each person is only employed by a few companies during their career, (2) there aren't too many pairs of employees who have worked at the same pair of companies. If these assumptions are sufficiently wrong, this algorithm might be very inefficient.

At a technical level, this appears to be related to finding instances of $K_{2,2}$ as a subgraph of a large bipartite graph.

Efficient Way to Calculate Timebased Followership

1 Answers1