CrossFit Games Data 2012-2015

CrossFit Open Data 2012-2015

 Mission part 1: Bring CrossFit data mainstream.  Success!

Part 1 in my mission to bring more data to the CrossFit Open is now a success.  I always wondered why CrossFit HQ didn’t do more to bring the story of the Open to the fans and participants through its data.  Now they have.  The 4 x “M” team of Mike Macpherson and Megan Mitchell brought us the first ever analytical piece from HQ breaking down performances on 15.3.  It bears a striking resemblance to the analytical approach I’ve taken here in analyses of 2014, 15.1, and 15.2, and I’m happy to have made an impression.  This is CrossFit, you can’t get out to a lead and expect that nobody will catch up.  It’s better if they do.

Mission part 2: Take CrossFit data to the next level.

I’m definitely not the only one playing with this data.  I’ve linked to some of the other interesting projects this year:

I think we’ve got a lot more skills out there but the overhead of scraping all the data is somewhat prohibitive, especially in the case of athlete profile data. I’d be thrilled if more people could contribute interesting analyses, so I thought I’d share everything I’ve collected and see where it takes us.  CrossFit has an appreciation for the quantitative approach in its genes.  Lets see if we can make it an example for other amateur sports of what’s possible.

Please let me know if you’re working on something interesting, and I’d be interested to collaborate if I have time.  If you use the data I’ve collected for something public, please just reference this post, thanks!

What are we working with

All data posted in .csv, zipped .csvs (.zip), and R binary (.RData) for your convenience.

  • Leaderboard scores, ranks, athlete_ids, and scaled designations.  2012, 2013, 2014, and 2015 to date.  2015 will be added as available.
    • fields refer to the URL parameters used by the games.crossfit site
      • division: 1= male, 2=female
      • stage: 0 = roster (all athletes who signed up), 1 = WoD 1 of that year, etc
      • score: expressed in reps or seconds
      • scaled: 0 = Rx, scaled = 1.  all NA before 2015
    • I refer to 15.1A as 15.1.1 so that the database field can be numeric
    • the scraping process is not 100% successful, so there may be a small number of missing records.  Not a problem if you’re summarizing trends.  Might be a problem if you’re doing reporting for individuals.
  • Athlete profiles: everything on the athlete profile pages (age, weight, affiliate, team, lift PRs, workout PRs, background, training and diet descriptives).
    • athlete profiles can change at any time.  73% of records have a retrieved_datetime field to make this less ambiguous.  The rest were scraped before I thought to do that, but were scraped in March 2015.
    • some of the profile fields are very sparse and are optionally self-reported. amateur statisticians beware.
    • some (~20%) user profile pages do not exist.  these are mostly athletes who only participated in the earlier years, but there’s no real rhyme or reason to it as far as I can tell.
  • Code (written in R) to scrape, compile, and analyze as a starting point

 

Screen Shot 2015-03-18 at 11.52.01 PM

 

Posted in Posts Tagged with: ,
  • Pingback: WOD Data: Crossfit Open 15.3 | Sam Swift()

  • Pingback: WOD Data: Crossfit Open 15.4 | Sam Swift()

  • Sara

    Hi Sam – This is great stuff. I was looking into the data you’ve scraped. I found that Score has a value of NA for all records in the 15.5 CSV. There are rank values. Does the R data set your working with have missing values for all rows 15.5 score?

    • Sam Swift

      Hi Sara,
      The first time I scraped 15.5 I forgot to tell my program that the workout was ‘for time’ rather than AMRAP, so it didn’t know how to deal with the MM:SS format. It’s all fixed and updated now, with scores in seconds. Thanks for asking, and let me know what you do with it!

      • Sara

        Thanks Sam! I am a consultant for SAS my customer is going to be purchasing SAS Visual Analytics/Visual Statistics. I haven’t had any other customers using VA so far so I haven’t had much of opportunity to get to know the product. I thought this set of data would be great to set up a playground with and start to familiarize myself with the product.

  • Pingback: WOD Data: CrossFit Open 15.5 | Sam Swift()

  • Martin H

    Sam, it looks like there is no field for age category, although the number of athletes with a ranking of 1 implies that all age groups are included. A quick glance makes it appear that each age group/gender category (i.e. division) are grouped together, although I am not sure in what order? Is there an easy way to extract data by division?

    • http://swift.pw Sam Swift

      Hi Martin,

      I didn’t scape the age-group specific leaderboards (you can see here which url I am getting the data from https://github.com/swiftsam/CrossfitRankings/blob/master/scrape_fns.R#L9). There should be many athletes with the same rank for some WODs, although I agree that many with rank 1 seems strange. If you point me to the specific file and values I could investigate.

      You can also reconstruct the age group leaderboard using the age info in the athlete profile data and the scores.

      • Martin H

        Thanks! I think the easiest way to scrape the age group leaderboards would be to use division parameter directly. Looks like you scrape division=1 and division=2, but you can also use divisions 3-17 which includes all of the masters age groups, teens and teams.

        It’s a pity they got rid of the stage parameter for 2015 and 2016, that was a useful way of tracking how many points you needed after each WOD to be in the top 20 or 200, to make either Regionals or the Masters Qualifier.

  • Kevin D

    This is great stuff! Thanks! Do you happen to have any of the 2016 data yet? I am really interested – I just joined Qlik and am learning how to use their software so I’m downloading (thank you for all of your hard work) the data so I can then (hopefully!) learn how to create some cool visualizations! Once I get it looking good – happy to share. If you have any new data other than what I just downloaded as of 6/16/2016 – would love to have it…thanks so much! Cheers! Kevin