6

I have a dataset that comprises 76 countries, and 6 columns of distinct quantitative variables, which are the mean values of that variable relative to each country:

enter image description here

If I were to take a random sample of the 6 variables - an individual within one of the countries - how would I best go about predicting to which country that individual belongs?

I have a whole separate dataset, with thousands of data points, so I know which country the data belongs to and can thus know for certain whether the algorithm is predicting accurately.

Thus far I have been looking at decision trees and random forest options, but none of the example use cases I've seen translate very well to what I am trying to do. Perhaps I'm looking in the wrong place.

It's not predicting the behavior, as the behavior is already known...it's more about predicting to which country-classification the behavior belongs. Ideas? Comments?

free_road
  • 71
  • 4

3 Answers3

3

Very interesting question indeed.

I will dare to give you an answer, but I could be changing it if I find a better reasoning about the problem after watching the data.

Your problem is perhaps a problem of multi(many)-label classification, where your label is the country and your data is what you have as "patience", "risktaking", etc.

You could work with random forest or decision trees but on the dataset containing all the individuals (not the summarised by country).

Why on this dataset and not on the summarised? Because the summarised is summarised (!), and only has the data for the mean of "risktaking" (for example) and does not take into account the internal variance of risktakingness inside the countries.

Is important to take this into account, because is possible that risktaking has a variance so large inside a country that is makes the variable meaningless to select a nationality for a person and this is not seen in the summarised database.

If is mandatory that you run the model in the summarised dataset first, then the solution is this:

  • You should run a tree where your node_size parameter is 1, because your summarised dataset has "one individual (country) per label", this is a extreme case where you need a node leaf for every country.
2

Looking at the description I will go for a 'distance based' approach. Fist, I would produce prototypes of the six variables per country (the summarised table may be an example but there could be more than one single prototype per country). Then, in order to assign a datum to a given country, I would follow a k-nearest or similar approch. Neverthless, producing the prototypes is not necessary to use k-nearest. It only helps to characterise each country. To make a classification based on prototypes, a supervised Self Organising Map (SOM) may be a good approach but a Radial basis function Neural Network (RNN) may work as well. Probabilistic Neural Networks (PNNs) will be an option, although they do not provide prototypes.

1

The SVM.SVC() model is a good choice to go with, though obviously there is more than one option. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC

The 'Target' values I set as the countries included in the dataset, and then labeled using label_encoder as quantitative variables (1-76).

After defining the model, I was then able to input via 'clf.predict' 6 variables and get a predicted outcome. Clearly, this model was unable to make an accurate prediction first-time around, but we'll work on it.

I'm interested to get more ideas on how to implement this question and to compare the different approaches and improve upon them. Thanks!

X = df[['patience','risktaking','posrecip','negrecip','altruism','trust']].values 
Y = df[['target']].values

label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(Y)

X_train , X_test, y_train, y_test = train_test_split(X,Y)
clf = SVC(C=1.0, kernel='rbf').fit(X_train,y_train)

prediction = int(clf.predict([[2.2529752, 2.466159, 0.95989114, -1.5864599, 0.47468176, 0.9504338]]))
free_road
  • 71
  • 4