15

I have a dataset including a set of customers in different cities of California, time of calling for each customer, and the status of call (True if customer answers the call and False if customer does not answer).

I have to find an appropriate time of calling for future customers such that the probability of answering the call is high. So, what is the best strategy for this problem? Should I consider it as a classification problem which the hours (0,1,2,... 23) are the classes? Or should I consider it as a regression task which the time is a continuous variable? How can I make sure that the probability of answering the call will be high?

Any help would be appreciated. It also would be great if you refer me to similar problems.

Below is a snapshot of the data.

Sean Owen
  • 6,664
  • 6
  • 33
  • 44
Hamid Mahdavian
  • 159
  • 1
  • 3

3 Answers3

7

You might actually encounter problems if you model this as a regression problem without a suitable transformation. For instance, we know that most calls are probably answered during daytime and less during night time and early morning. A linear regression would have difficulty because the relationship is likely curvilinear, not linear. For the same reason, treating this as a classification task with logistic regression would also be problematic.

As suggested by other respondents, reclassifying your data into time periods will help, and I'd suggest you try something like a decision tree or random forest first.

That all said, this might be a case for simple descriptive statistics. If you plot the proportion answered calls by time of day (split by city or any other demographic), is there a clear best time? If so, why complicate things with a model?

HEITZ
  • 911
  • 4
  • 7
4

You could try the following:

  1. Divide up the day into various parts - early-morning, morning, noon, afternoon, evening, late evening, night, etc.
  2. Assign time boundaries to each part of the day, e.g. noon could be 12 pm to 1 pm.
  3. Create 3 new labels - "part of the day to call the customer", for each positive case (status of call=true) assign it the corresponding label (morning/noon/evening). These labels will be in one-hot encoded format e.g. prefer_morning=0/1, prefer_noon, prefer_evening, etc.
  4. Build 3 models to predict whether the lead prefers morning/noon/or evening time of the day for a call to be successful.

Additionally, I recommend adding additional features such as occupation, gender, etc. since the features listed in the table (city, etc.) are too ambiguous and do not give much information to differentiate among customers.

EDITED as per suggestion in comments:

When using the model, each lead would get classified as prefers_morning=yes/no, prefers_noon=yes/no and prefers_evening=yes/no. Based on the time of the day, for example in the morning, the call center agent (or software) could pick up and call leads classified in the morning preference set. When its noon, the call software picks up form the noon preferred list, and so on.

Sandeep S. Sandhu
  • 2,625
  • 17
  • 20
2

I would use a logistic regression - you're going to need sample's where they did not pick up. Then I would treat the hour as a seasonal dummy regressor (23 hours as dummy variables and let one flow to the intercept).

If you do not treat it as a seasonal dummy regressor, you are going to have to perform some sort of transformation, because the relationship isn't going to be linear.

Someone previously suggested substituting mid-afternoon, etc as a categorical variable. That is a bad idea because you have the detail and you're losing detail there. That would have a similar effect to utilizing optimal binning to make the relationship linear, but I still don't think that would work. Try the seasonal dummy regressors.