Guest post: Clustering and predicting NYC taxi activity

Home > data science, guest post, modeling > Guest post: Clustering and predicting NYC taxi activity

Guest post: Clustering and predicting NYC taxi activity

October 27, 2014 Cathy O'Neil, mathbabe

This is a guest post by Deepak Subburam, a data scientist who works at Tessellate.

from NYCTaxi.info

Greetings fellow Mathbabers! At Cathy’s invitation, I am writing here about NYCTaxi.info, a public service web app my co-founder and I have developed. It overlays on a Google map around you estimated taxi activity, as expected number of passenger pickups and dropoffs this current hour. We modeled these estimates from the recently released 2013 NYC taxi trips dataset comprising 173 million trips, the same dataset that Cathy’s post last week on deanonymization referenced. Our work will not help you stalk your favorite NYC celebrity, but guide your search for a taxi and maybe save some commute time. My writeup below shall take you through the four broad stages our work proceeded through: data extraction and cleaning , clustering, modeling, and visualization.

We extract three columns from the data: the longitude and latitude GPS coordinates of the passenger pickup or dropoff location, and the timestamp. We make no distinction between pickups and dropoffs, since both of these events imply an available taxicab at that location. The data was generally clean, with a very small fraction of a percent of coordinates looking bad, e.g. in the middle of the Hudson River. These coordinate errors get screened out by the clustering step that follows.

We cluster the pickup and dropoff locations into areas of high density, i.e. where many pickups and dropoffs happen, to determine where on the map it is worth making and displaying estimates of taxi activity. We rolled our own algorithm, a variation on heatmap generation, after finding existing clustering algorithms such as K-means unsuitable—we are seeking centroids of areas of high density rather than cluster membership per se. See figure below which shows the cluster centers as identified by our algorithm on a square-mile patch of Manhattan. The axes represent the longitude and latitude of the area; the small blue crosses a random sample of pickups and dropoffs; and the red numbers the identified cluster centers, in descending order of activity.

Taxi activity clusters

We then model taxi activity at each cluster. We discretize time into hourly intervals—for each cluster, we sum all pickups and dropoffs that occur each hour in 2013. So our datapoints now are triples of the form [<cluster>, <hour>, <activity>], with <hour> being some hour in 2013 and <activity> being the number of pickups and dropoffs that occurred in hour <hour> in cluster <cluster>. We then regress each <activity> against neighboring clusters’ and neighboring times’ <activity> values. This regression serves to smooth estimates across time and space, smoothing out effects of special events or weather in the prior year that don’t repeat this year. It required some tricky choices on arranging and aligning the various data elements; not technically difficult or maybe even interesting, but nevertheless likely better part of an hour at a whiteboard to explain. In other words, typical data science. We then extrapolate these predictions to 2014, by mapping each hour in 2014 to the most similar hour in 2013. So we now have a prediction at each cluster location, for each hour in 2014, the number of passenger pickups and dropoffs.

We display these predictions by overlaying them on a Google maps at the corresponding cluster locations. We round <activity> to values like 20, 30 to avoid giving users number dyslexia. We color the labels based on these values, using the black body radiation color temperatures for the color scale, as that is one of two color scales where the ordering of change is perceptually intuitive.

If you live in New York, we hope you find NYCTaxi.info useful. Regardless, we look forward to receiving any comments.

Categories: data science, guest post, modeling

Comments (4)

Laurie Skelly (@laurieskelly)

October 27, 2014 at 9:57 am

Neat post! Definitely helps me get my brain warmed up on a Monday morning.

This map has some really useful information, but I have a question: Does it get us all the way to “wait times,” as advertised?
— to estimate a wait time, I want to think about the probability that *I* will get a taxi at this location, but it seems like this map only tells me more whether *someone* will.

As you mention there are 3 kinds of data generating taxi stops.
there are:
1. taxis that arrive empty and pick someone up
2. taxis that drop someone off and leave empty
but then there are also:
3. taxis that drop & pickup in the same place/time, (div this by 2 to correct over-counting?)

If I want to know where I can get a taxi quickly, I want to find a sweet spot where I can maximize the # of available taxis and minimize the competition I have to get a taxi. Is there a way to get a closer estimate of this from the above types of pickup and drop-off data?

Would it improve the utility of the map to combine these records in a way that rewards an area for # or % of [2] taxis that drop off and leave empty (available taxis!) and penalizes for # or % of [3] taxis that drop & pick (competition is waiting)?

LikeLike
- Min
  
  October 27, 2014 at 6:18 pm
  
  IIUC, this data is symmetrical between taxis and carloads of passengers. So taxi drivers could use the information to look for potential passengers. A person could also use the information to look for where other people are also. That being the case, if you are looking for a taxi, you are also likely to find other people who want to take a taxi.
  
  The data do not show people wanting to take a taxi who do not find one, nor do they show taxis looking for riders without finding them. I would think that there is an asymmetry here, because you can have several carloads of riders trying to take a specific taxi, but you can’t have several taxis trying to drop off a specific carload of people. So I think that the information is a better predictor of where people are than of where taxis are. If I were looking for a taxi I would also be more interested in dropoffs than pickups, because at a dropoff I know than some of the people I am predicted to find will **not** be looking for a taxi. 😉
  
  LikeLike
abekohen

October 27, 2014 at 10:14 am

i guess I’m confused. At certain times of the day there will be 20-30 taxis rolling by empty, cruising for riders, along York Avenue off the FDR. These are NOT dropoffs or pickups, yet they mean instantly available cabs. I don’t see how this is represented.

At shift end time, a dropoff does NOT imply an available taxi, as taxis make their way to the 59th Street bridge back to their respective garages in Queens.

So help me out here and tell me how to use this data.

LikeLike
Deepak Subburam

October 27, 2014 at 12:12 pm

Hi Laurie and abekohen,

You’re both driving at limitations to the interpretation of the activity numbers on the map — the activity numbers are not the only determinants of waiting times to get a taxi. There may be competition for available taxis, so even though there may be high activity you could still end up waiting a long time (e.g. in line for a taxi at the airport). And as abekohen mentioned, low activity (few pickups and dropoffs) does not imply long wait times as there could be many empty cabs passing by.

Initially, we gave in to the temptation of converting activity to a waiting time estimate using the transformation:

<waiting minutes> = 60 / <activity>

but got quick feedback (from mathbabe no less) that the resulting estimate was too large at some locations (like at abekohen’s example). We could try to be more clever, looking at sources and sinks and likely taxi routes taken between them, and/or use a more intelligent transformation like in Laurie’s post, but this (just like the above simpler transformation) would likely require some empirical testing on the ground for calibration and confirmation, that would require resources, and may simply not be worth it.

So we took the step of just presenting the activity estimates, and letting the users exercise their judgment, based on their knowledge of local conditions, to make the appropriate interpretation. So abekohen, treat this as a tool — an overlay of data that you didn’t have before — that you can use as an input in your taxi search. While the tool has limitations and is not super precise, there are a variety of use cases where it should be quite helpful. E.g. you are coming out of a party late at night somewhere unfamiliar and don’t see much traffic or taxis in the immediate vicinity.

LikeLike