Welcome to the new NFL Arrest blog, I have been wanting to document some of the technical aspects of the website for a while now. This post is going to focus on the database that drives http://NFLArrest.com and the API. Since the summer of 2015 my database has expanded not only in number of arrests, but also in the form of new data sources. Originally, I began only with a list of arrests from http://usatoday.com. Since then I have been manually recording new arrests as they have happened. This process is not completely automated yet, in the future I plan to have a post detailing that process.
Over the last 2 years I have added data from a few external sources:
- NFL Season data (Season Start, End & Playoff dates, superbowl outcomes & teams)
- NFL team data (Team Name, city, state, mascot, & prefered name)
- Team Details (Stadium name, capacity, location, coordinates, first team season, current coach, team colors and conference/division)
- NFL position data (Offense, Defense or Special Teams as well as full names related to abbreviations)
- Generalized Crime Categories to group the arrests for aggregate statistics
Take a look below at the diagram of my database structure.
The 6 boxes across the top represent my source tables, these are manually managed records that are later used to compile the reporting views & materialized tables. The table 'arrest_stats' is the table that holds the individual arrest records and gets updated most often. The remaining source tables are lookup tables for values stored as keys in arrest stats.
The grey and green boxes are views that relate the data using joins and/or derive new fields using SQL logic. For example the flow along the left side of the image determines if each arrest occurred during the season or offseason. ArrestsDateView is the main reporting view that contains the most details, it also is used the most heavily by the API.
Splitting my logic into multiple queries that are stored as views & stored procedures allows for easy modification that cascades to related reports. This probably isn't the best method for performance, but I like having separated logic similar to Object Oriented Programming.
View Full Size Image.
Performance & Caching
The amount of data and derived reports has been evolving over time as I add more organized information. For a long time the NFL Arrest API was running directly from the ArrestsDateView, which meant the server had to compile the dataset every time someone used the API to request data by executing the queries in the view. About 6 months ago I realized I could improve response time by creating stored procedures (red rounded boxes) to cache the data in a materialized table. These processes run once a day, or after an arrest is added. This means that I am using one static table rather than doing computationally expensive joins and calculations.
I am glad I was utilizing caching when recently I saw a huge spike in traffic, it has resulted in decreased CPU load, decreased response times and the ability to support more concurrent users. However, from September 24th to 26th at the peak of the site being shared, I was seeing ~400-600 concurrent users for a most of the day. This was still taxing on my server. My website went down for a few days as I worked to further improve performance.
One thing I have done is make the API serve up the precomputed aggregated data for the most often requested data, the homepage. This is another form of Caching, this made a huge difference as it was serving a static file rather than having to open DB connections, query, aggregate and serve the computed response. I also have setup the API to use your local cache if you request the same data twice, this means when someone changes the parameters like the date range they get new data but if they revisit the same page with the same date range my server is never hit. Using these and a long list of other small tweaks, in combination with decreasing traffic I was able to get the website back online and hopefully prepared for the next time it goes viral!
NFLArrest.com unique user traffic Sept. 22 to Sept. 29. (Google Analytics)
I have a lot of goals for the site, unfortunately NFL Arrest is just a side project so I don't always have time to dedicate as much energy to the site as I would like. Some of these are currently in progress but not yet released, some are goals that I haven't started yet. I would love feedback about what data you are most interested in or suggested other data dimensions you think might make a good addition to this data set.
- NFL Suspensions - I have a dataset imported but not utilized about players who have been suspended for various on and off the field reasons. I would like to relate this to my arrest data to be able to show suspensions as a result of an arrest, as well as display visualizations based on Suspensions by team, season, suspension reason etc similar to the current site.
- NFL Game Python API - There is an open source NFL.com scraping script for python that has a lot of great information I would like to import.
- I have begun logging NFL rosters and status, the goal here is to be able to accurately report whether a player is still active in the NFL. This would make for a great option to only show data from current players, calculate the percent of current players who have been arrested. (and current/historical injury status for my fantasy team predictive algorithm I want to develop 😊)
- Player game performance: I would like to import historical and ongoing on the field performance to have some empirical data to show if off the field activity effects on the field performance or if declining performance is a good predictor for potential arrests.
- FBI crime statistics - I would like to import FBI Unified Crime Reporting (UCR) statistics to compare NFL players to the USA as a whole.
- I would like to get detailed by state/metro data to allow users to compare statistics to a variety of different data sets, it would be kind of fun to see where the NFL sits in ranking of crimes by metropolitan area. For example in a list of top 50 highest crime cities would the NFL come in at #2, #10, #50? What cities rank higher or lower than the NFL?
- I would like to gather data sets for crime committed by Males from age ~20-35, by income and more. Again the goal would to create an interactive tool where a user could choose comparison data sets and explore the data.
These are the priorities right now, I would also like to start logging links to News stories & Official Announcements for each arrests to be able to cite sources, as well as show the progression (Arrested, Suspended, Dropped from Team, Convicted, Released etc). It would be great to have better data on the outcome of each incident, I would love to show the likelihood someone is dropped from a team for driving without a license or domestic violence for example or average suspension weeks by crime. Again I would love to hear any ideas of how I could use the data I mentioned above, or new sources I might be unaware of. Leave a comment below or contact me on my resume!
I hope this has been informative and not too confusing! In future posts I plan on detailing the API design, how the website interacts with the API & the semi-automated data collection process. I also want to use some of the additional data I mentioned above to write blog posts with specific statistics calculated around a topic. For example I have analytics data for my website, I would like to look into some aggregate trends in my audience by referral source in a post. Analyze some of the political claims made around the web using my website as a source. My goal is to remain impartial in this debate but present statistical evidence.