Back in the analytic groove   Leave a comment

I haven’t taken a look or done much of anything with the NFL play-level data in a while.  Most of what I did involved getting it into MySQL. Since then, I’ve been terribly busy with work and personal life, and so analytics have gone to the wayside.

But over the past week or so, I’ve been working on SQL as part of a skills-building process.

In the last post on this topic, I started out by asking about home field advantage. I noted there was one through the 12 years of the sample. But how persistent is it?

select seas season, count(gid) homewins,
 (select count(gid) from games where wk < 18 and seas=season) total,
 (count(gid)/(select count(gid) from games where wk < 18 and seas=season))*100 pct
 from games
 where ptsh > ptsv and wk < 18 group by season;

| season | homewins | total |   pct   |
| 2000   | 138      | 248   | 55.6452 |
| 2001   | 136      | 248   | 54.8387 |
| 2002   | 148      | 256   | 57.8125 |
| 2003   | 157      | 256   | 61.3281 |
| 2004   | 145      | 256   | 56.6406 |
| 2005   | 151      | 256   | 58.9844 |
| 2006   | 136      | 256   | 53.1250 |
| 2007   | 147      | 256   | 57.4219 |
| 2008   | 146      | 256   | 57.0313 |
| 2009   | 146      | 256   | 57.0313 |
| 2010   | 143      | 256   | 55.8594 |
| 2011   | 145      | 256   | 56.6406 |
12 rows in set (0.15 sec)

We get from this that there’s a persistent home-field advantage of at least 53.1% and as high as 61.3% in the 2003 season. I won’t get into the reasons for the home-field advantage here (Moskowitz and Wertheim cover the hypotheses at length in their book) but suffice to say that it’s not merely a phenomena that we just think exists.

Another question is: in the sample, who’s the best passer? This question is certainly thorny. Before we do that, though, I’m going to create a table that holds some passer-related information:

create table qbs
 select distinct games.seas season, games.wk week, games.v visitor, games.h home, games.ptsv visitorpts,
 games.ptsh homepts, passatt, offense.pc passcomp, passyds,
 offense.tdp touchdowns, interceptions, offense.player, roster.fname, roster.lname
 from ((offense left join team on offense.gid = team.gid)
 left join games on offense.gid = games.gid)
 left join roster on offense.player = roster.player
 where > 0 and offense.pc > 0;

Now, not everyone here who has made an attempt (PA) and has a completion (PC) is a quarterback. That said, having a high number of attempts and completions, all else being equal, is likelier to make one a QB than a WR or RB.

At first, I tried calculating the number of TDs the players had during the sample. But, there’s an obvious problem: passers who have been in the NFL over the full sample are likely to have more TDs than not. (Peyton Manning’s 347 TDs blow away Ben Roethlisberger’s 165, but the former played from 2000-2010 (missing the 2011 season entirely), and the latter played from 2004 onwards.) So, we want some determination of a passer’s performance, while taking into account differences in career.

One of these normalization methods is the passer rating, a measurement that takes into account completions, attempts, yardage, and interceptions. It’s not perfect (a QB’s ability to throw long-yardage passes depends on a solid O-line and solid rhythm with wide receivers, among other non-QB features, so the rating doesn’t take that into account; rushing yards and TDs don’t count). But it’s better than nothing, and it’s an understood metric.  So, I implemented the equation for NFL passer rating:

delimiter |
create function passer_rating (tds int, ints int, yds int, comps int, atts int)
returns float
declare a float;
declare b float;
declare c float;
declare d float;
declare e float;
declare f float;
declare g float;
declare h float;
declare pr float;

set a = ((comps/atts) - 0.3) * 5;
set b = ((yds/atts) - 3) * .25;
set c = (tds/atts) * 20;
set d = 2.375 - ((ints/atts)*25);

set e = greatest(0, least(a, 2.375));
set f = greatest(0, least(b, 2.375));
set g = greatest(0, least(c, 2.375));
set h = greatest(0, least(d, 2.375));

set pr = ((e+f+g+h)/6) * 100;

return pr;

delimiter ;

And from there I used this function for the following query:

select concat(fname,' ', lname) name, sum(touchdowns) tds, sum(interceptions) ints,
 passer_rating(sum(touchdowns), sum(interceptions), sum(passyds), sum(passcomp), sum(passatt)) pr
 from qbs
 where week < 18
 group by player
 having tds >100
 order by pr desc limit 10;
|      name          | tds  | ints |        pr        |
| Aaron Rodgers      | 132  |  38  | 103.741760253906 |
| Peyton Manning     | 347  | 155  | 97.2291412353516 |
| Tony Romo          | 149  |  72  | 96.8970489501953 |
| Tom Brady          | 300  | 115  | 96.2420272827148 |
| Philip Rivers      | 163  |  78  | 95.1152267456055 |
| Drew Brees         | 282  | 146  | 94.0112228393555 |
| Ben Roethlisberger | 165  | 100  | 92.0614852905273 |
| Kurt Warner        | 168  | 115  | 91.4776992797852 |
| Chad Pennington    | 102  |  64  | 89.9092712402344 |
| Jeff Garcia        | 150  |  72  | 88.5577087402344 |

Aaron Rodgers is at the top of this list. Interestingly, though, four of the people on the top 10 (Romo, Rivers, Pennington, Garcia) do not have Super Bowl titles. This doesn’t say much, other than to say that you need more than a QB with a high rating to win titles. (Obviously.)

Really, this was more of an exercise than anything else. As time goes on, I hope to say more interesting things.

Posted July 16, 2012 by techstep in analytics, programming, sports

Tagged with , , , , ,

Why Not Learn to Code?   Leave a comment

I’m reading Jeff Atwood’s Please Don’t Learn to Code. He’s a software developer by trade, and I’m not. So take my comments with a grain of salt.

Atwood writes, of NYC Mayor Bloomberg tweeting that he is learning to code using Codecademy:

A noble gesture to garner the NYC tech community vote, for sure, but if the mayor of New York City actually needs to sling JavaScript code to do his job, something is deeply, horribly, terribly wrong with politics in the state of New York.

I do not disagree with that sentence. I’d be confused if he needed to write code as part of his job as well.

But suppose Bloomberg is doing this, not to debug JavaScript on the site, but to learn something new. Suppose he’s doing this as a hobby, to learn enough to do something interesting, because he finds it fun. Some people learn about the subtle details of historical events as a hobby. Some people solve crossword puzzles. Others learn to play an instrument (even if poorly), or draw (again, even if poorly). Some do all of these and more. Why not coding as a hobby? Nothing says that because he’s been working on these programming exercises, he has designs on working on this stuff in his daily job.

Later, Atwood lists a number of reasons why he considers what he calls the “‘everyone should learn to code’ movement” is flawed. Some of these reasons are related to the relationship between coding and problem solving.

It assumes that coding is the goal. Software developers tend to be software addicts who think their job is to write code. But it’s not. Their job is to solve problems. Don’t celebrate the creation of code, celebrate the creation of solutions. We have way too many coders addicted to doing just one more line of code already.

Related to problem solving, he also writes:

It puts the method before the problem. Before you go rushing out to learn to code, figure out what your problem actually is. Do you even have a problem? Can you explain it to others in a way they can understand? Have you researched the problem, and its possible solutions, deeply? Does coding solve that problem? Are you sure?

In part, I agree with him. Coding is not an end in itself, but a means to an end. It is a method of solving problems. And without understanding the problem, coding is not going to help. I get that.

But I do believe that knowing coding, even if a little bit, allows you to approach problems differently than before. Instead of, say, using the menus in Stata to perform an analysis, I write a bit of code. It gives me flexibility; if I do it enough, I can abstract away some of the details and make a loop. I can do my own data cleaning and normalization rather than just being content with what I’ve found. I can even start working on different problems, because I can use what I know about coding to write web scrapers to extract data. It works both ways.

Atwood also writes:

It implies that there’s a thin, easily permeable membrane between learning to program and getting paid to program professionally. Just look at these new programmers who got offered jobs at an average salary of $79k/year after attending a mere two and a half month bootcamp! Maybe you too can teach yourself Perl in 24 hours! While I love that programming is an egalitarian field where degrees and certifications are irrelevant in the face of experience, you still gotta put in your ten thousand hours like the rest of us.

I do not understand. Learning the basics of coding does not imply there’s a “thin, easily permeable membrane between learning to program and getting paid to program professionally”, anymore than taking a class in basic art implies the same “easily permeable membrane” between learning art and being a professional or semi-professional artist. Is the problem learning coding per se, or is it that perhaps some of the schools, books, and programs lead people to a false sense of rapid expertise?

In point of fact, I do not consider myself a programmer. I took some computer science courses as an undergrad, I’ve written programs in a variety of languages as a sysadmin, researcher, and business analyst. I know enough to solve some basic problems I run into where other approaches are irritating. But I have no illusion about being a competent programmer, or someone skilled enough to do it as a full-time job, much less a great programmer. It’s one of a few tools in my analytic arsenal, and taking courses on Udacity, Coursera, and Codecademy is only going to help with that.

It seems that Atwood’s problem is less of “learning to code is useful and interesting” than:

  1. People overstating the importance of learning to code (especially vis-à-vis other skills, like critical thinking and clear writing);
  2. A promulgated illusion of going from “zero to expert” (or employable in the software development field in a few steps);
  3. The dangers that the overconfidence fostered in (2) can yield (e.g., so much bad code on projects).

Maybe I am wrong on this; perhaps he actually believes that only people who are willing to devote their careers to programming should program, and that either you should spend the “ten thousand hours” and become an expert, or not do it. (“Go big or go home”, anyone?) But there’s something that just doesn’t ring quite right about that; otherwise, why do much of anything unless we can devote our lives to it?

Hammer (yes, that Hammer) on Analytics   Leave a comment

This morning, from the INFORMS Twitter feed, there was a link to a video from Hammer (formerly MC Hammer) about the importance of analytics in marketing rap music. It’s only a minute, and has clips from “U Can’t Touch This”, so you might as well watch it:

As someone who loves rap, and loves working with data, I don’t disagree with him about the power of analytics. I’d hesitate to call analytics the “endgame”, though. It’s a tool (more precisely, a set of tools), but not the outcome. But it’s a powerful toolkit, especially in this day of being able to directly communicate with fans via Twitter and Facebook, being able to release music when you feel like it on SoundCloud, DatPiff, and other services, and myriad routes to getting the music out there that didn’t exist a decade ago. While I don’t think it will replace promoters, and street teams, and all of that, I think it’s going to be a useful adjunct to make sure those limited resources are used better.

Posted May 3, 2012 by techstep in analytics, statistics

Tagged with , ,

More Working with MySQL   Leave a comment

In my previous post, I talked about setting up the SQL database based on Dennis Erny’s NFL play-by-play dataset. I spent some time generating a script that did the appropriate ALTER TABLE statements to ensure that the character encodings and collations are utf-8, not latin1_swedish_ci as in the last post. I started doing it manually through phpMyAdmin, but decided after five minutes that just writing a script was going to be faster, especially across 30 tables.

So, I now have a remote database that I can reach. That’s quite exciting. (Again, later I can dump everything into SQLite to just have everything as a single file. But for now, I feel like playing with remote databases.)

I thought about using R, Python, and Stata for my analytics work, because each one has different strengths, and I feel as though I’m better served knowing all three, and being able to use them for statistics and visualization. Python and R have low-headache interfaces to MySQL — Python with the MySQLdb module, and R with the RMySQL library. They were both rather easy to set up and configure.

But…I also have Stata, and I like it. Except to get Stata to work with databases, I need to use ODBC. I used to have to deal with this in previous jobs, and I was hoping I would never have to deal with this ever again. MySQL offers an ODBC connector for various platforms, including Mac OS X. Except the documentation is inconsistent between the online documentation, the README file, and the actual library itself. It was a bear to install, and I couldn’t get it to work.

For now, at least, I’m going to use R and Python because those interfaces to MySQL work.

Posted April 8, 2012 by techstep in sports

Tagged with , ,

A Toe Dipped into Sports Analytics   1 comment

One of the things I’m trying to do is to write about my experiences learning about sports analytics. I am learning more about it because it’s at the intersection of sports, data analysis, and statistical methods, three things I’m interested in. Recently, I read Scorecasting by Moskowitz and Wertheim, a rather interesting book about the “conventional wisdom” of sports and questioning it in the light of data. To that extent, it’s pretty much a domain-limited version of Freakonomics.

My analysis will focus on football. 1 I wasn’t interested in football, or sports in general, growing up. As a Chicagoan, I watched the Bulls during their 90’s dynasty, and I would watch the occasional White Sox or Bears game, but it was never something I followed. But I went to college in Pittsburgh, and wound up sticking around in the city for over a decade. Pittsburgh, being the epicenter of Steelers Nation, meant that I developed an interest in football in general and the Steelers in particular, first as a lingua franca, and from there as an honest-to-goodness passion. And while I’ve developed an appreciation for other sports, especially baseball and (increasingly) soccer, football is the sport I look to the most; it’s the one where I read a bunch of websites about, and I willingly watch NFL Total Access. 2 And so it seems natural that as a means for my furthering my analytics skills by finding an interesting project to work on, I’d take a gander at doing some armchair football analysis.

From what I gather, it was not until recently that there has been a concerted effort to apply statistical analysis, and to collect play-level data, on par with that found in baseball. At least, the statistical data one could get would either have to be scraped from the NFL site, or purchased at high cost from a company like STATS or Elias Sports Bureau. ESPN has new APIs for their data, but only the headlines are available to the general public. Since I neither have tons of money, nor do I have a job where I can get my employer to pay for sports data, this generally meant I was kind of hosed.

As a way fix the lack of decent NFL data for less than a king’s ransom, the Armchair Analysis site offers a collection of play-level data from every game from the 2000 season to the 2011 season. During the season, Dennis Erny, the proprietor of Armchair Analysis, sells play-level data updated weekly for handicapping purposes. I am more interested in doing long-term analytics, so having complete data, if old, is more important than frequent updates.

Last season, Erny offered the data as two CSV files — one on a game level, and one on a play level — and I had to modify the field names since many contained ‘/’ characters, and the sheer number of fields meant I spent a ton of time trying to figure things out to answer fairly simple questions. This season, Erny has done a masterful job of breaking apart the data into separate CSV files, using a naming scheme without special characters, and adding in a SQL schema so that I could import them into a DBMS. 3

One of the upsides of using the a database to store the information is that I don’t have to read it in each time I want to do analysis. Moreover, since I plan on doing work on this in Stata, R, and Python, I just want one place where I make changes and updates, not several. It’s a little bit of extra work on the front end, but later on I think it’s going to pay off.

The initial loading process was fairly slow, and not without quirks:

  • The SQL script uses latin1 rather than utf-8. Since I don’t think there are characters outside of the standard ASCII printing characters in the file, this shouldn’t make too much difference, but it’s odd all the same.
  • When I used the included .sql file to build the database in phpMyAdmin, latin1 is converted into latin1_swedish_ci (case-insensitive). I do not know why, and Googling it suggests that this happens, and the best thing to do is convert those text fields to utf-8_unicode_ci. 4
  • Because phpMyAdmin has a maximum file size limit of 7MB, I sometimes had to split files into smaller chunks. Go go gadget split!

Other than that, though, I seem to have the data in MySQL without a hitch.

Since I was getting tired, but I wanted to play around with the data a bit, I posed a fundamental question: is there a home-field advantage? I didn’t want to probe the reasons for it; I just wanted to see the numbers.

Let’s ask the database:

mysql> select count(gid) from games where ptsh > ptsv and wk < 18;
| count(gid) |
| 1738       |
1 row in set (0.10 sec)

mysql> select count(gid) from games where ptsv > ptsh and wk < 18;
| count(gid) |
| 1316       |
1 row in set (0.10 sec)

mysql> select count(gid) from games where wk < 18;
| count(gid) |
| 3056       |
1 row in set (0.10 sec)

So, there were 3,056 in-season games between 2000 and 2010. (I am ignoring playoff games.) 1,738 of those games ended with the home team winning, 1,316 ended with the visiting team winning.

But that’s 3,054 games. The other two must have been ties:

mysql> select gid, seas, wk, v, h from games where ptsv = ptsh and wk < 18;
| gid  | seas | wk | v   | h   |
| 657  | 2002 | 10 | ATL | PIT |
| 2272 | 2008 | 11 | PHI | CIN |
2 rows in set (0.09 sec)

From the previous queries, 1738/3056=56.9%, and 1316/3056=43.1%. So, there’s a home field advantage, and it’s pretty sizable.

Of course, I can break it down by season, by team, and so forth. But for now, the aggregate over the 2000-2011 time frame is sufficient.

  1. By which I mean “American football”, but I like other forms of football as well.
  2. I am more interested in pro football than college football.
  3. I am using MySQL for now, with phpMyAdmin as my import interface of choice. Maybe later I will try my hand at porting all of this over to SQLite.
  4. Since I am writing this late at night, and I have to manually change all the text columns in all 30 tables, I will deal with that later. At least for now, it’s not making a huge difference to the analysis, but I want it to be consistent.

Posted April 3, 2012 by techstep in blather, sports, statistics

Tagged with , , , , ,

On odds ratios   Leave a comment

Even though I’ve had a fair amount of training in technical areas of statistics and probability, my work (primarily in the realm of “public policy”) does not generally lend itself to having to understand the intricacies of, say, GMM estimators. (Though sometimes I develop Monte Carlo simulations for sensitivity analysis.) The vast majority of what I do involves fairly straightforward things — writing proposals and plans; reading papers; occasionally generating spreadsheets to help write the proposals and plans. Very little math is required, and most of it is basic algebra. I don’t even use that much statistical knowledge except on a tacit level. And it turns out that a lot of things I run into are things that I’ve never formally studied (for example, ANOVA, odds ratios, experimental design, surveying methods), so I wind up having to teach myself fairly basic concepts.

A fair amount of what I am reading these days involves trying to make sense of medical literature. For example, I am trying to understand the risk of someone with particular characteristics getting a disease, like asthma or congestive heart failure. Or, I would read a set of papers outlining the effectiveness of various treatments (measured in either hospitalizations reduced, or length of time in the hospital reduced). A few concepts crop up over and over, some of which I have an understanding of from other fields (e.g. p-values). One of those concepts that crops up fairly frequently is that of the odds ratio, which I haven’t seen much outside of the medical literature thus far. It took me a while to understand what the odds ratio was, and I’m still not quite sure.

But I’ll try to explain, and if anyone is reading this wants to provide suggestions/corrections, please let me know!

So, we start with two groups of people. Let’s call them A and B. They’re a pool of people who have had heart attacks in the past year. We are running a clinical trial to determine if a new medication, jokeapril, will reduce the probability of a heart attack in the next year. 1  People are randomly assigned to group A or group B. Those in group A get the jokeapril, while, those in group B do not. Out of a total pool of 100 people, 35 are in group A, and 65 are in group B. 2

Over the next year, 10 of the people in group A have heart attacks, and 32 of the people in group B have heart attacks.  The probability of people in group A having a heart attack in the next year, then, is 10/35 = 0.286, which means that the probability of people in group A not having a heart attack is 25/35=0.714. The odds of having a heart attack while being in group A are

$latex \frac{\frac{10}{35}}{\frac{25}{35}} = 0.4&s=2$

In other words, the odds are 1 in 2.5 that someone in the next year with this treatment will have a heart attack under this treatment.

Similarly, the probability of people in group B getting heart attacks is 32/65=0.492, and that of people in group B not getting heart attacks is 33/65=1-0.492=0.508. The odds of someone in group B getting a heart attack in such a case are:

$latex \frac{\frac{32}{65}}{\frac{33}{65}} = 0.9697&s=2$

That is, the odds are about even (1:1) that someone in the next year in the control group will have a heart attack.

So, we have the odds of someone in group A getting a heart attack in the next year, and the odds of someone in group B getting a heart attack. The odds ratio, as the term implies, is the ratio between the two:

$latex OR = \frac{\frac{10}{25}}{\frac{32}{33}} = 0.4125&s=2$

Here, since the denominators in both parts of group A are the same, and those in both parts of group B are the same, I just canceled them. The odds ratio in this case is 0.4125, which means that the treatment group (A) is 59% (well, 1-0.4125=0.5875) less likely than the control group (B) to have a heart attack in the next year.

Now, suppose what we were counting was the number of people who avoided getting heart attacks in the next year. What would change? Simply, the numbers would be flipped around. The odds for group A would be 25/10 = 2.5, and the odds for group B would be 33/32= 1.03125. Dividing these two would give an odds ratio of 2.424, which means the people in group A would be 142% more likely to avoid a heart attack in a year than group B.

What I’ve just done is an analysis of odds ratios in a simple situation — two groups of people, and two levels of treatment. What I am trying to figure out is to what extent it’s generalizable to multiple levels of treatment and multiple groups. I know there are ways, but I don’t know enough yet to make sense of it.

  1. As far as I know, this is not an actual medication — I hope the jokea– prefix makes that clear — but the -pril suffix denotes an ACE inhibitor, which expands blood vessels and decreases resistance.
  2. I am making up numbers here — this has no correspondence to any clinical trial that I know of.

Posted March 25, 2012 by techstep in blather, statistics

Tagged with , ,

just a test of Jetpack   Leave a comment

I decided I would try out Jetpack and its great feature set.

One of the reasons for installing the plugin is being able to insert $latex \LaTeX$ code in documents. Even though this works now:

$latex i\hbar\frac{\partial}{\partial t}\left|\Psi(t)\right> = H\left|\Psi(t)\right>$

rendering does not seem to consistently work. And I am not sure why it’s not working consistently, even when I follow the documentation provided by WordPress. So I am fairly puzzled.

I guess it is something to contemplate later.

Posted December 11, 2011 by techstep in administrivia

A Logorrheic First Post   Leave a comment

It has been a long time since I have written in a blog. Other than Facebook and Twitter, most of my writing now is for work. To that extent, the vast majority of my writing focuses around policy briefs and proposals. This is not a bad thing, but I believe that I can write about things either unrelated to, or at most tangentially related to work.

The question is: What in the world should I write about? I do not want to turn it into a work blog for various reasons, but a lot of things that are interesting intersect with work. It will probably consist of musings/grousings about public policy, math, programming, sports, music, food, and other things that come to mind. Whether it is interesting is left to be seen; at best, I should just hope that it is interesting to me.

I will see how this goes.

Posted December 11, 2011 by techstep in administrivia