3

I am trying to figure out how the amount of money that a customer would want to withdraw on an ATM tell us if the transaction is fraudulent or not.There are other attributes, of course, but now I would want to hear your views on the amount of money that the customer wants to withdraw.

Data may be of this form:

Let us assume that a customer, for ten consecutive transactions, withdrew the following amounts:

100.33, 384 , 458, 77.90, 456, 213.55, 500 , 500, 300, 304.

Questions:

  1. How can we use this data to tell if the next transaction done on this account is fraudulent of not?

  2. Are there specific algorithms that can be used for this classification?

What I was thinking:

I was thinking to calculate the average amount of money, say for the last ten transactions, and check how far is the next transaction amount from the average. Too much deviation would signal an anomaly. But this does not sound much, does it?

CN1002
  • 243
  • 2
  • 7

5 Answers5

1

What's the underlying model of how much someone requests from an ATM? It doesn't seem like it's a simple distribution like a Gaussian, where comparing new amounts to the mean is sensible. Consider a person who always pulls out either \$40 or \$400. Ideally we want to build a distribution of what normal transactions from a user look like, and notice if new datapoints don't look like they're sampled from that distribution.

idclark's suggestion, to look at the nearest n datapoints from that user and compute the distance from just them, is a good and fast implementation of that sort of test.

One other possibility is to try to find similar users, and then aggregate data across users. If I only have 10 withdrawals from each user, I'm not going to be able to reject any new withdrawals with confidence, but if I have seven clusters of users, with a thousand withdrawals per cluster, I can notice when a user who was in a particular cluster deviates from the overall cluster distribution. (This also helps you make use of knowledge about which previous transactions were fraudulent.)

Matthew Gray
  • 735
  • 4
  • 10
1

I was thinking to calculate the average amount of money, say for the last ten transactions, and check how far is the next transaction amount from the average. Too much deviation would signal an anomaly. But this does not sound much, does it?

A typical outlier detection approach. This would work in most cases. But, as the problem statement deals with credit card fraud detection, the detection technique/algorithm/implementation should be more robust.

You might want to have a look at the Mahalanobis Distance metric for this type of outlier detection.

Coming to the algorithms for fraud detection, I would point out to the standards used in the industry (as I have no experience in this, but felt these resources would be useful to you).

Check my answer for this question. It contains the popular approaches and algorithms used in the domain of fraud detection. The Genetic Algorithm is the most popular amongst them.

Dawny33
  • 8,476
  • 12
  • 49
  • 106
0

I was thinking to calculate the average amount of money, say for the last ten transactions, and check how far is the next transaction amount from the average.

This sounds like a good start. I'd look into Local Outlier Probabilities. For a given data point you could calculate the distance from n nearest neighbors and figure out if the data point under consideration is an outlier.

basic overview can be found here I'd also consider the source, destination, volume and frequency of transactions as features.

idclark
  • 521
  • 1
  • 5
  • 7
0

I isn't actually answering your question, but it is an idea of how you can improve it. In my opinion, I don't believe that you will be able to build a classification model with only those data. And if you do it, it will not have high enough accuracy. In your position, I would start looking for more data to use as features.

Here are a few examples:

  1. ATM's code of the withdrawal. People use most of the time similar ATM in their daily routine. If you know the lat and long of their previous ATM, you can check if one of them is far away and combining it with the other features, you will increase your accuracy.
  2. Seconds spent on the ATM for each withdrawal. People tend to follow specific patterns when they withdraw money. If all of their previous data are similar on the spending time and then you see lower or higher time on a data point, you will be able to increase the accuracy of the model.
  3. Labeled data. In models like this, it is far better if you use supervised algorithms instead of unsupervised. Thus, I would seek for labeled data for fraud usages. This will also let you to calculate the actual accuracy of your model.
  4. Time between the two withdrawals. As I said before, people tends to follow patterns. An "anomaly" on this with a sooner withdrawal than the expected will also raise your accuracy.

As far as the algorithms are concerned, I am not keen on choosing one, because it is popular. If you have done all the Data Munging and the feature selection, you have the 90% of the job and the algorithm that you will choose is 2-3 lines on the code (in case you are using a language like Python). What I usually do, is to check all the possible algorithms and evaluate their accuracy. Then I either use a combination of them or the one with the highest accuracy.

Tasos
  • 3,960
  • 5
  • 25
  • 54
0

Firstly you should probably be creating models of classes/segments of users (unsupervised clustering). Otherwise it is difficult to predict what a given user will do. (More on that further below.)

Nextly, I think "deviation from recent transactions" is also fundamentally flawed. Most likely there are time patterns (time of day, days of the week, working hours, holidays and so on). To understand how to conceptualise time as useful features, see this excellent answer on Machine learning - features engineering from date/time data And similarly there are amount patterns, partly having to do with practical reasons (eg by withdrawing 38, one can receive 20, 10, 5 and 1 denominations, although this is not possible in some markets, like the USA).

Modelling the user is more complicated. You will likely not have enough data on each user, but you can build some user models. (Too few, then the system will make similar predictions for all users, without nuance - eg > $400 always detected as fraud. Too many and there will be sparsity, overfitting, and generally the same problems as having no profile models at all, ie one model per actual user - eg fraud incorrectly detected at every time a given user goes to a new ATM.) This is basically unsupervised clustering. (Search for user profile categorisation, user models, user model clustering)

Much depends on the data available to you. Perhaps you can be more specific about the scale and scope. In any case, I wish you luck - banks/Visa do this very poorly right now.