top of page

On data provenance, data hygiene and replicable analysis (or being prepared for tough questions).

Writer's picture: Jason ThatcherJason Thatcher

As a Political Science student, I was taught to retain the data & syntax for all of my studies, because a reader might request them. I was trained to be a policy analyst. If you are informing policy decisions, then you would want to replicate & confirm analysis. When I switched to a business discipline, my Ph.D. program did not emphasize retaining data or syntax nearly as much - the presumption was that we trust each other. My field was a small one - where everyone knew everyone - & the idea that someone you knew might cook data - well, it was anathema. The field was, & mostly remains, trust & relationship-based, with strong social norms about how to conduct research. However. My field is no longer a small one & while the idea of cooking data remains anathema - things changed. As strong ties have become weak one, as the field has globalized, & as new subcommunities have emerged, relationship-based social norms about how to conduct research have fragmented. In parallel, the replication crisis emerged in behavioral science (e.g., no one seemed to be able to replicate key papers) & big data became problematic (e.g., folks secured either proprietary firm data or assembled unique "mashed-up dataset"). Where discipline-based change eroded social norms, these external changes seem to have undermined trust in data analysis. Today, across all disciplines, we've seen a pivot from blind trust in analysis to greater transparency - like there has been in policy research for some time. In fact, many fields now go beyond asking people to make data & syntax available to readers & verify analysis before papers appear in major publications. So how can an early career scholar navigate this emerging landscape? First, practice good data hygiene. When you create a dataset (e.g., download, mash-up, or parse something out from a bigger dataset), store the original data in a separate, secure file. You may need to make this available to reviewers. Second, develop a clear #datadictionary. The data dictionary should define variables, how they are measured, and, if a variable is calculated from raw data, what calculations were performed. These should not be secrets - if you are sending data out for peer review. Third, think through how you will anonymize data. Few, if any programs, talk about anonymizing data. Take time to ensure no individual or organization is identifiable. Document how. Fourth, retain records on data provenance. Keep a doc in that core folder with the #raw data, including agreements needed to secure data, when data were collected, & rules were used to include & exclude cases. Fifth, save your #syntax. You don't have to explain the syntax. A reader should be able to figure it out. You do have to provide it for verification. These steps make it possible for you, or a reviewer, to #replicate your work. Best of luck!






 
 
 

Kommentare


  • Linkedin
bottom of page