Back in the analytic groove   Leave a comment

I haven’t taken a look or done much of anything with the NFL play-level data in a while.  Most of what I did involved getting it into MySQL. Since then, I’ve been terribly busy with work and personal life, and so analytics have gone to the wayside.

But over the past week or so, I’ve been working on SQL as part of a skills-building process.

In the last post on this topic, I started out by asking about home field advantage. I noted there was one through the 12 years of the sample. But how persistent is it?

select seas season, count(gid) homewins,
 (select count(gid) from games where wk < 18 and seas=season) total,
 (count(gid)/(select count(gid) from games where wk < 18 and seas=season))*100 pct
 from games
 where ptsh > ptsv and wk < 18 group by season;

+--------+----------+-------+---------+
| season | homewins | total |   pct   |
+--------+----------+-------+---------+
| 2000   | 138      | 248   | 55.6452 |
| 2001   | 136      | 248   | 54.8387 |
| 2002   | 148      | 256   | 57.8125 |
| 2003   | 157      | 256   | 61.3281 |
| 2004   | 145      | 256   | 56.6406 |
| 2005   | 151      | 256   | 58.9844 |
| 2006   | 136      | 256   | 53.1250 |
| 2007   | 147      | 256   | 57.4219 |
| 2008   | 146      | 256   | 57.0313 |
| 2009   | 146      | 256   | 57.0313 |
| 2010   | 143      | 256   | 55.8594 |
| 2011   | 145      | 256   | 56.6406 |
+--------+----------+-------+---------+
12 rows in set (0.15 sec)

We get from this that there’s a persistent home-field advantage of at least 53.1% and as high as 61.3% in the 2003 season. I won’t get into the reasons for the home-field advantage here (Moskowitz and Wertheim cover the hypotheses at length in their book) but suffice to say that it’s not merely a phenomena that we just think exists.

Another question is: in the sample, who’s the best passer? This question is certainly thorny. Before we do that, though, I’m going to create a table that holds some passer-related information:


create table qbs
 select distinct games.seas season, games.wk week, games.v visitor, games.h home, games.ptsv visitorpts,
 games.ptsh homepts, offense.pa passatt, offense.pc passcomp, offense.py passyds,
 offense.tdp touchdowns, offense.int interceptions, offense.player, roster.fname, roster.lname
 from ((offense left join team on offense.gid = team.gid)
 left join games on offense.gid = games.gid)
 left join roster on offense.player = roster.player
 where offense.pa > 0 and offense.pc > 0;

Now, not everyone here who has made an attempt (PA) and has a completion (PC) is a quarterback. That said, having a high number of attempts and completions, all else being equal, is likelier to make one a QB than a WR or RB.

At first, I tried calculating the number of TDs the players had during the sample. But, there’s an obvious problem: passers who have been in the NFL over the full sample are likely to have more TDs than not. (Peyton Manning’s 347 TDs blow away Ben Roethlisberger’s 165, but the former played from 2000-2010 (missing the 2011 season entirely), and the latter played from 2004 onwards.) So, we want some determination of a passer’s performance, while taking into account differences in career.

One of these normalization methods is the passer rating, a measurement that takes into account completions, attempts, yardage, and interceptions. It’s not perfect (a QB’s ability to throw long-yardage passes depends on a solid O-line and solid rhythm with wide receivers, among other non-QB features, so the rating doesn’t take that into account; rushing yards and TDs don’t count). But it’s better than nothing, and it’s an understood metric.  So, I implemented the equation for NFL passer rating:

delimiter |
create function passer_rating (tds int, ints int, yds int, comps int, atts int)
returns float
deterministic
begin
declare a float;
declare b float;
declare c float;
declare d float;
declare e float;
declare f float;
declare g float;
declare h float;
declare pr float;

set a = ((comps/atts) - 0.3) * 5;
set b = ((yds/atts) - 3) * .25;
set c = (tds/atts) * 20;
set d = 2.375 - ((ints/atts)*25);

set e = greatest(0, least(a, 2.375));
set f = greatest(0, least(b, 2.375));
set g = greatest(0, least(c, 2.375));
set h = greatest(0, least(d, 2.375));

set pr = ((e+f+g+h)/6) * 100;

return pr;
end|

delimiter ;

And from there I used this function for the following query:

select concat(fname,' ', lname) name, sum(touchdowns) tds, sum(interceptions) ints,
 passer_rating(sum(touchdowns), sum(interceptions), sum(passyds), sum(passcomp), sum(passatt)) pr
 from qbs
 where week < 18
 group by player
 having tds >100
 order by pr desc limit 10;
+--------------------+------+------+------------------+
|      name          | tds  | ints |        pr        |
+--------------------+------+------+------------------+
| Aaron Rodgers      | 132  |  38  | 103.741760253906 |
| Peyton Manning     | 347  | 155  | 97.2291412353516 |
| Tony Romo          | 149  |  72  | 96.8970489501953 |
| Tom Brady          | 300  | 115  | 96.2420272827148 |
| Philip Rivers      | 163  |  78  | 95.1152267456055 |
| Drew Brees         | 282  | 146  | 94.0112228393555 |
| Ben Roethlisberger | 165  | 100  | 92.0614852905273 |
| Kurt Warner        | 168  | 115  | 91.4776992797852 |
| Chad Pennington    | 102  |  64  | 89.9092712402344 |
| Jeff Garcia        | 150  |  72  | 88.5577087402344 |
+--------------------+------+------+------------------+

Aaron Rodgers is at the top of this list. Interestingly, though, four of the people on the top 10 (Romo, Rivers, Pennington, Garcia) do not have Super Bowl titles. This doesn’t say much, other than to say that you need more than a QB with a high rating to win titles. (Obviously.)

Really, this was more of an exercise than anything else. As time goes on, I hope to say more interesting things.

Posted July 16, 2012 by techstep in analytics, programming, sports

Tagged with , , , , ,

Leave a Reply