6

I have a data from GPS in the form

1.('latitude', 'longitude','Timestamp').
2.('latitude', 'longitude','Timestamp').
3.('latitude', 'longitude','Timestamp').

I am changing this data into the below form

'latitude_1', 'longitude_1', 'Timestamp_1', 'latitude_2', 'longitude_2', 'Timestamp_2, Timestamp_2-Timestamp_1.

With this format I am training a LinearRegressionWithSGD model of spark where label is Timestamp_2-Timestamp_1 and features are latitude_1, longitude_1, latitude_2, longitude_2.

But when I am giving Origin ( latitude and longitude ) and destination ( latitude and longitude ) the results are very bad.

Kindly guide me whether this approach is the right approach ? and if not then how to build a prediction model from given data to predict Estimated Time of Arrival.

Marcus D
  • 571
  • 1
  • 5
  • 21
user825828
  • 99
  • 1
  • 5

4 Answers4

4

I suggest to calculate the Haversine distance between two points, and fit a linear regression to find the relation between the Haversine distance and the trip duration. So your regression will be

$duration_t = timestamp_t - timestamp_{t-1} = \alpha + \beta*d(point_t,point_{t-1})$

Where $d$ is the Haversine distance. $point_t$ is a lat/long pair at time $t$.

Note however that there's an assumption that the user drives at the same speed. If half of your data was gathered while walking and half while driving, then your relation between time and ETA is possibly not linear.

Omri374
  • 215
  • 1
  • 8
2

To predict timestamps from two predictor variables longitude and latitude, you want to train a multiple linear regression model of the form

$$Timestamp = \alpha + \beta_0 \cdot Longitude + \beta_1 \cdot Latitude.$$

Given a new latitude-longitude pair of you destination, you can then compute the ETA.

Spark's LinearRegressionWithSGD model should be able to perform multiple linear regression out of the box, using Timestamp as label and latitude and longitude as features. There's no need to transform the data beforehand.

0

The problem is with how you are training the problem. If you use Timestamp1 and Timestamp2 as training parameters, they will carry 100% predictive power, and the algorithm will completely disregard any location parameters. If you want to make predictions based on only an origin and destination, you need to train your model using only those parameters.

j.a.gartner
  • 1,215
  • 1
  • 9
  • 18
0

This might not be of much help, but I wanted to point out that you might have to control for direction, given that the GPS co-ordinates will reduce and increase depending on the direction of travel.