In Part 2, we wrote a command line interface to a Python program that will handle grabbing all of the CFL play-by-play data for any season in a matter of a few clicks and keystrokes.
In Part 1, we achieved the first part of our goal by implementing a scraper that allows us to grab the play-by-play from a given CFL URL. The next step is to build out a workflow that will use this scraper, collect a given season's worth of data, and save it to a database. It should be flexible in terms of using the same code across seasons (for instance, in 2013 the Ottawa Redblacks were not a part of the CFL and from 2014 on they are: it should not take a rewrite to handle this).
My first foray into sports statistics was a research project for Carleton University's Data Mining course, in which we were given our first chance to select a dataset with which to work aside from the canonical cars and housing ones that come bundled with R. My class partner and I immediately jumped on Brian Burke's NFL play-by-play dataset. We poured hours into creating new variables, sorting the data into a format better-suited to analysis, finally creating two models; the first to model the probability of a given team winning a game against a specific opponent, and the second to model the number of wins for each team in a given season. It was really satisfying to try our hand at what had always been a "blackbox" process to us, and to manage to produce some successful predictions.