Wednesday, October 9, 2013

The Data and the Simulation

The data include 5220 professional football scores from the 1994 through 2013 seasons that were obtained from internet sources and were partially cross-checked for accuracy. This was not an easy task as I used information from freely available sources and had to piece together some of the information. There are paid internet sports data sources that would have made this task easier. Besides scores, I also have a measure of the point spread (line) and the over/under. The point spread reflects the pregame expected margin of victory and the over/under reflects the expected total number of points to be scored by both teams. The initial year 1994 was chosen as it was the first season with the two point conversion rule in effect. It is thought that the size of the sample of 5220 games is sufficient for the studies described in the following posts and it is desirable that a uniform set of scoring rules have been in place. Some studies of football squares make use of a very limited subset of such scores such as only playoff games or even only a few dozen Super Bowl game scores. Even with such limited statistics, it becomes fairly obvious which numbers are better than the others.  In this series of blog posts, we try to uncover an actual statistical strategy that would not be possible unless the large sample of games were available.

It should be stated that an underlying assumption is that the games from 1994 through 2013 reflect characteristic future scoring in professional football. A study could be made to determine whether there is a trend towards higher scoring or more fist quarter touchdowns, for instance. As I make random selections as to what previous football games I include, my simulation would not reflect those trends. For a specific game, you might be more precise by simulating the scoring trends of those particular teams.

The simulation is a self-written program in VB6 where I fill in squares assigning them to one of 26 players, named by a single letter, A-Z. I enter a number of games that I wish to simulate and then "Begin Sim." For each simulated game, I use random numbers to pick one of the 5220 historical game scores and then to assign a random order to the numbers 0-9 to be assigned to the rows and columns. I look in detail at the 60 unique ways a player can select up to 5 squares. Thus, the simulation samples a total combination of 4,124,276,932,608,000,000 possibilities (that's over 4 quintillion (4 billion billion): 10!x10!x5220x60). I then tabulate the number of winning scores for each of players and calculate additional statistics. Below is a screen output of an earlier version of my simulation program.

In the upper left, you see the 10 x 10 grid with a player (A-Z) assigned to squares and a few squares left blank. Also, each row and column is assigned a random order of the numbers 0-9. Below the grid is the "Begin Sim" button that starts the simulation where the Num(Sim), in this case, started at 10,000,000 and now reflects "-1: indicating the simulation is finished. I wrote my program so that I can clear and load different configurations of which squares are selected. On the display is the most recently run simulation which happened to pick a Packers vs Jets game from week 1 of the 2000 season. The Packers were the home team and had Q1-FS scores of 7, 10, 13, and 16 and thus winning row numbers of 7, 0, 3, and 6 and the Jets - Q1-FS scores of 7,  10, 10, and 20 - gave winning  column numbers 7, 0, 0, 0. Thus, for this simulation, the winning (row, colum) were labeled with numbers (7,7)=Player G, (0,0)=Player J, (3,0)=Player M, and (6,0)=Player J. In the boxes on the right are some of the tabulated statistics. For each player, I record the number of squares won, and the number of simulations that player won two, three, or all four squares. In the example game, Player J won twice and so the number was incremented to 170854 in the second column.

One detail, to my surprise, was an issue - the built-in RANDOMIZE and RND random number generator in VB6 were not sufficiently random and I observed consistent 1-2% variations in the squares won by players in repeated simulations when I expect no variations due to their symmetrical placement on the grid. I was forced to use system level calls to generate my own random numbers from built-in cryptology system algorithms (CryptGenRandom). This function returns random bytes with some computational overhead. I modified my code to be efficient at using the bits within the random bytes.  My simulation time decreased about a factor of 10 after optimizing the use of the random bits. A large simulation with 100 million trials takes around 10 hours on a modest personal computer with an i5-3570 CPU.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.