You can have anonymous data or you can have open science, but you can’t have both.
That’s the conclusion that several colleagues and I reach in an article now online at Queue and forthcoming in Communications of the Association of Computing Machinery.
The short version: many people have called for making science more open and transparent by sharing data and posting data openly. This allows researchers to check each other’s work and to aggregate smaller datasets into larger ones. One saying that I’m fond of is: “the best use of your dataset is something that someone else will come up with.” The problem is that increasingly, all of this data is about us. In education, it’s about our demographics, our learning behavior, and our performance. Across the social sciences, it’s about our health, our beliefs, and our social connections. Sharing and merging data adds to the risk of disclosing those data.
The article shares a case study of our efforts to strike a balance between anonymity and open science by de-identifying a dataset of learner data from HarvardX and releasing it to the public. In order to de-identify the data to a standard that we thought was reasonably resistant to reidentification efforts, we had to delete some records and blur some variables. If a learner’s combination of identifying variables was too unique, we either deleted the record or scrubbed the data to make it look less unique. The result was suitable for release (in our view), but as we looked more closely at the released dataset, it wasn’t suitable for science. We scrubbed the data to the point where it was problematically dissimilar from the original dataset. If you do research using our data, you can’t be sure if your findings are legitimate or an artifact of de-identification.
This was a powerful relevation for many of us, especially in the face of evidence that the weapons of re-identification, in the long run, will probably outpace the shields of de-identification. We all increasingly share so much about ourselves, and ultimately the datasets created outside learning platforms will be able to be merged with datasets from learning platforms to re-identify people. It may simply not be possible to do science with anonymized data, in education or anywhere in the social sciences.
Right now, we conflate privacy with anonymity, though we need not. The Federalist Papers were anonymous but not private. Voting is private but not anonymous. If we are going to have open science with human subjects data, we’ll need to explore new approaches to balancing open science and privacy. We conclude our essay:
If we must have trust in researchers to enable open science, then researchers will need to earn that trust.
The opinions expressed in EdTech Researcher are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.