Our task is to predict the response time of an ambulance dispatched to a location in New York City based on attributes such as zip code, severity level, and incident dispatch area. This project is important because it will predict how long it will take Emergency Medical Services to reach a location in New York City and also could show a correlation between response times and socioeconomic factors. This could help the city re-allocate resources to improve response times for certain areas of the city.
The dataset we are using is the EMS Incident Dispatch Data from the City of New York OpenData repository. This dataset has 32 attributes and 4.83 million examples.
We initially took our large data set and filtered it using the valid incident response time indicator to remove data that did not have a valid incident response. Then, we selected 13 attributes that were important or potentially relevant to predicting an EMS response time. Many of the attributes were originally text, but in order to analyze them efficiently, we changed all of the text lables to be numbers so that all of our attributes were numeric. For example, binary "yes" and "no" values were converted to 1 and 0.
Attribute | Description |
---|---|
Initial call type | Type of incident based on information gathered during call | Initial severity level | Priority assigned to incident at the time of the call |
Held indicator | Indicates if a unit could not be assigned immediately |
Borough | County-level administrative divisions of NYC |
Atom | Smallest subset of a borough where incident was located |
Incident dispatch area | Dispatch area of the incident |
Zipcode | Zip code of the incident |
Police Precinct | Police precinct of the incident |
City Council District | City council district of the incident |
Community District | Community district of the incident |
Community School District | Larger location subset |
Congressional District | Congressional district of the incident |
Special Event Indicator | Tells if the incident was a special event (NYC Marathon,etc.) |
We were interested in how well we could predict a response time by using continuous labels vs. discretized labels. Our original dataset had a specific response time for each example, which we used for our continuous label analysis. From our dataset of continuous labels we employed two methods to bin the data to make discretized EMS response time labels. Our first method separates the data by making sure that each bin has the same number of examples. As a result, the range of response time varies for each bin. Our other method separated the bins by doing equal intervals of response times, so each bin had a varied number of examples in them.
We were also interested in how smaller datasets performed in comparison to larger datasets. We felt as though this would be useful if a city wanted to analyze a smaller area within a city or only had a small amount of data available. In order to note the differences between the dataset sizes we trained and tested up to 10,000 examples using 10-fold cross validation on Random Forest, Gaussian Naive Bayes, Support Vector Machine, Multi-Layer Perceptron, Linear Regression, and AdaBoost classifiers and plotted the corresponding learning curves using the Scikit python package.
The learning curves show that using fixed time intervals corresponding to each bin was a much more successful strategy than using a fixed number of examples to create each bin. With the fixed number of bins, we found that the number of bins also affected the accuracy of the models. Overall, a smaller number of bins was more accurate because there were fewer options for classifying the data, so the likelihood of selecting the correct bin was higher. Using a smaller number of bins, however, is less informative about the response time because each bin has a larger range of time for the response times.
Training and 10-fold Cross Validation for 30 response time intervals:
Training and 10-fold Cross Validation for 50 response time intervals:
We successfully calculated the average EMS response times for each zip code. We could not find very reliable and consistent data for all of the average household incomes for each zip code. However with the information we did find, when looking at the zip codes with the fastest average response times compared to the slowest, we could not draw any conclusions between the average income and the response time. In the future, we could look into other social observations that may contribute to slower response times in certain areas.