A Toe Dipped into Sports Analytics   1 comment

One of the things I’m trying to do is to write about my experiences learning about sports analytics. I am learning more about it because it’s at the intersection of sports, data analysis, and statistical methods, three things I’m interested in. Recently, I read Scorecasting by Moskowitz and Wertheim, a rather interesting book about the “conventional wisdom” of sports and questioning it in the light of data. To that extent, it’s pretty much a domain-limited version of Freakonomics.

My analysis will focus on football. 1 I wasn’t interested in football, or sports in general, growing up. As a Chicagoan, I watched the Bulls during their 90’s dynasty, and I would watch the occasional White Sox or Bears game, but it was never something I followed. But I went to college in Pittsburgh, and wound up sticking around in the city for over a decade. Pittsburgh, being the epicenter of Steelers Nation, meant that I developed an interest in football in general and the Steelers in particular, first as a lingua franca, and from there as an honest-to-goodness passion. And while I’ve developed an appreciation for other sports, especially baseball and (increasingly) soccer, football is the sport I look to the most; it’s the one where I read a bunch of websites about, and I willingly watch NFL Total Access. 2 And so it seems natural that as a means for my furthering my analytics skills by finding an interesting project to work on, I’d take a gander at doing some armchair football analysis.

From what I gather, it was not until recently that there has been a concerted effort to apply statistical analysis, and to collect play-level data, on par with that found in baseball. At least, the statistical data one could get would either have to be scraped from the NFL site, or purchased at high cost from a company like STATS or Elias Sports Bureau. ESPN has new APIs for their data, but only the headlines are available to the general public. Since I neither have tons of money, nor do I have a job where I can get my employer to pay for sports data, this generally meant I was kind of hosed.

As a way fix the lack of decent NFL data for less than a king’s ransom, the Armchair Analysis site offers a collection of play-level data from every game from the 2000 season to the 2011 season. During the season, Dennis Erny, the proprietor of Armchair Analysis, sells play-level data updated weekly for handicapping purposes. I am more interested in doing long-term analytics, so having complete data, if old, is more important than frequent updates.

Last season, Erny offered the data as two CSV files — one on a game level, and one on a play level — and I had to modify the field names since many contained ‘/’ characters, and the sheer number of fields meant I spent a ton of time trying to figure things out to answer fairly simple questions. This season, Erny has done a masterful job of breaking apart the data into separate CSV files, using a naming scheme without special characters, and adding in a SQL schema so that I could import them into a DBMS. 3

One of the upsides of using the a database to store the information is that I don’t have to read it in each time I want to do analysis. Moreover, since I plan on doing work on this in Stata, R, and Python, I just want one place where I make changes and updates, not several. It’s a little bit of extra work on the front end, but later on I think it’s going to pay off.

The initial loading process was fairly slow, and not without quirks:

  • The SQL script uses latin1 rather than utf-8. Since I don’t think there are characters outside of the standard ASCII printing characters in the file, this shouldn’t make too much difference, but it’s odd all the same.
  • When I used the included .sql file to build the database in phpMyAdmin, latin1 is converted into latin1_swedish_ci (case-insensitive). I do not know why, and Googling it suggests that this happens, and the best thing to do is convert those text fields to utf-8_unicode_ci. 4
  • Because phpMyAdmin has a maximum file size limit of 7MB, I sometimes had to split files into smaller chunks. Go go gadget split!

Other than that, though, I seem to have the data in MySQL without a hitch.

Since I was getting tired, but I wanted to play around with the data a bit, I posed a fundamental question: is there a home-field advantage? I didn’t want to probe the reasons for it; I just wanted to see the numbers.

Let’s ask the database:

mysql> select count(gid) from games where ptsh > ptsv and wk < 18;
+------------+
| count(gid) |
+------------+
| 1738       |
+------------+
1 row in set (0.10 sec)

mysql> select count(gid) from games where ptsv > ptsh and wk < 18;
+------------+
| count(gid) |
+------------+
| 1316       |
+------------+
1 row in set (0.10 sec)

mysql> select count(gid) from games where wk < 18;
+------------+
| count(gid) |
+------------+
| 3056       |
+------------+
1 row in set (0.10 sec)

So, there were 3,056 in-season games between 2000 and 2010. (I am ignoring playoff games.) 1,738 of those games ended with the home team winning, 1,316 ended with the visiting team winning.

But that’s 3,054 games. The other two must have been ties:


mysql> select gid, seas, wk, v, h from games where ptsv = ptsh and wk < 18;
+------+------+----+-----+-----+
| gid  | seas | wk | v   | h   |
+------+------+----+-----+-----+
| 657  | 2002 | 10 | ATL | PIT |
| 2272 | 2008 | 11 | PHI | CIN |
+------+------+----+-----+-----+
2 rows in set (0.09 sec)

From the previous queries, 1738/3056=56.9%, and 1316/3056=43.1%. So, there’s a home field advantage, and it’s pretty sizable.

Of course, I can break it down by season, by team, and so forth. But for now, the aggregate over the 2000-2011 time frame is sufficient.

  1. By which I mean “American football”, but I like other forms of football as well.
  2. I am more interested in pro football than college football.
  3. I am using MySQL for now, with phpMyAdmin as my import interface of choice. Maybe later I will try my hand at porting all of this over to SQLite.
  4. Since I am writing this late at night, and I have to manually change all the text columns in all 30 tables, I will deal with that later. At least for now, it’s not making a huge difference to the analysis, but I want it to be consistent.

Posted April 3, 2012 by techstep in blather, sports, statistics

Tagged with , , , , ,

One response to A Toe Dipped into Sports Analytics

Subscribe to comments with RSS.

  1. Pingback: » Back in the analytic groove Procrustean Analytics

Leave a Reply