12

I have data sets that contain, among many features, GPS coordinates (latitude and longitude). I'd like to use these data sets to explore problems such as: (1) computing ETA to drive between start and end points; and (2) estimating the amount of crime for a specific point.

I'd like to use a linear regression model. However, can I use these GPS coordinates directly in a linear model?

Latitude and longitude do not have an ordinal property, such as with a person's age. For example, the two points (40.805996, -96.681473) and (41.226682, -95.986587) do not seem to have any meaningful ordering. They are just points in space. I was thinking of replacing them with categorical US zip codes and then doing one-hot encoding, but that would result in a lot of variables.

3 Answers3

7

You cannot use them directly, as it is unlikely there is a true linear relationship unless you're looking to predict "how far east or north" someone is. As mentioned in the comments, you need to convert them into zones. If you wanted to keep it really simple, you could use a kNN clustering algorithm with a low number of potential clusters and then assign each instance a new feature with the cluster ID, and then one-hot encode that.

You may also want to read about how people interpolate coordinates to predict values across a whole map. The first example is with temperature stations, but you can also imagine it being "hot zones" for crime.

(DOCS)

Stephen Rauch
  • 1,831
  • 11
  • 23
  • 34
CalZ
  • 1,673
  • 8
  • 14
2

You could do whatever your heart desire, but unless your model predicts the temperature or time-difference, I cannot come up with any other target variable that depends solely on the coordinates.

What you probably want to do, is use an external data source and enrich your data with Country / Zip code / climate / other geographic features that will help your model perform.

GregA
  • 131
  • 2
1

GPS coordinates can be directly converted to a geohash. Geohash divides the Earth into "buckets" of different size based on the number of digits (short Geohash codes create big areas and longer codes for smaller areas).

A geohash is a single number that can be used as a feature in a model.

Geohash applies only to the entire world, zipcodes do not.

Brian Spiering
  • 23,131
  • 2
  • 29
  • 113