14

I know someone who is working on a project that involves ingesting files of data without regard to the columns or data types. The task is to take a file with any number of columns and various data types and output summary statistics on the numerical data.

However, he is unsure of how to go about dynamically assigning data types for certain number-based data. For example:

    CITY
    Albuquerque
    Boston
    Chicago

This is obviously not numerical data and will be stored as text. However,

    ZIP
    80221
    60653
    25525

are not clearly marked as categorical. His software would assign the ZIP code as numerical and output summary statistics for it, which does not make sense for that sort of data.


A couple ideas we had were:

  1. If a column is all integers, label it as categorical. This clearly wouldn't work, but it was an idea.
  2. If a column has fewer than n unique values and is numeric, label it categorical. This might be closer, but there could still be issues with numerical data falling through.
  3. Maintain a list of common numeric data that should actually be categorical and compare the column headers to this list for matches. For example, anything with "ZIP" in it would be categorical.

My gut tells me that there is no way to accurately assign numeric data as categorical or numerical, but was hoping for a suggestion. Any insight you have is greatly appreciated.

Poisson Fish
  • 243
  • 3
  • 6

4 Answers4

10

I'm not aware of a foolproof way to do this. Here's one idea off the top of my head:

  1. Treat values as categorical by default.
  2. Check for various attributes of the data that would imply it is actually continuous. Weight these attributes based on how likely they are to correlate with continuous data. Here are some possible examples:
    • Values are integers: +.7
    • Values are floats: +.8
    • Values are normally distributed: +.3
    • Values contain a relatively small number of unique values: +.3
    • Values aren't all the same number of characters: +.1
    • Values don't contain leading zeros: +.1
  3. Treat any columns that sum to greater than 1 as being numerical. Adjust the factors and weights based on testing against different data sets to suit your needs. You could even build and train a separate machine learning algorithm just to do this.
jncraton
  • 578
  • 5
  • 9
4

If you have, for example, number of children of a family (which could range, for example, between 0 and 5), is it a categorical or numerical variable? Actually it depends on your problem and how you intend to solve it. In this sense, you can do the following:

  • Compute the number of unique values of that column
  • Divide this number by the total number of rows
  • If this ratio is bellow some threshold (for example, 20%), you consider it categorical.

In case of discrete values, one additional test could be: use a regression model to estimate some of the parameters and check if the estimated values are contained in the original set of values. If this is not true, you are probably dealing with categorical data (as it is the case of ZIP).

It worked relatively well for me in the past...

Bernardo Aflalo
  • 323
  • 2
  • 9
2

Since this question has been cross-posted, the initial comments by @nickcox on Cross Validated are highly relevant and true. My views are slightly different. For instance, I would rephrase the question, decomposing it into two parts: first, there is the issue of how one would go about classifying a stream of unknown information by data type and, second, what kind misclassification rate can be expected. Both questions are immediately answerable, if someone wants to take the time to do it.

The first issue amounts to developing a rules-based expert system along the lines of what has been preliminarily proposed in the comments and answers on these threads using samples of real-world data that are readily available.

Second, given that this real-world data contains known data types, benchmarking the classification error.

A third step would involve iteratively updating the rules to more accurately reflect the information that is being passed through -- to the extent that this is possible.

In my mind, while tedious, such a project wouldn't involve a prohibitive amount of work.

---- Additional Comment ----

A good discussion of feature extraction as it relates to the data generating process popped up on CV here:

https://stats.stackexchange.com/questions/191935/what-does-the-process-that-generates-the-data-mean-and-how-does-feature-selec

Feature extraction, selection and the DGP is an excellent point of departure for developing an expert system.

DJohnson
  • 141
  • 5
0

What if we use LLM to classify the column as Categorical or Numerical .. it will be able to understand based on the column name to which category it will fall