Opinion
Education Opinion

Privacy, Anonymity, and Big Data in the Social Sciences

By Justin Reich — August 17, 2014 2 min read
  • Save to favorites
  • Print

You can have anonymous data or you can have open science, but you can’t have both.

That’s the conclusion that several colleagues and I reach in an article now online at Queue and forthcoming in Communications of the Association of Computing Machinery.

The short version: many people have called for making science more open and transparent by sharing data and posting data openly. This allows researchers to check each other’s work and to aggregate smaller datasets into larger ones. One saying that I’m fond of is: “the best use of your dataset is something that someone else will come up with.” The problem is that increasingly, all of this data is about us. In education, it’s about our demographics, our learning behavior, and our performance. Across the social sciences, it’s about our health, our beliefs, and our social connections. Sharing and merging data adds to the risk of disclosing those data.

The article shares a case study of our efforts to strike a balance between anonymity and open science by de-identifying a dataset of learner data from HarvardX and releasing it to the public. In order to de-identify the data to a standard that we thought was reasonably resistant to reidentification efforts, we had to delete some records and blur some variables. If a learner’s combination of identifying variables was too unique, we either deleted the record or scrubbed the data to make it look less unique. The result was suitable for release (in our view), but as we looked more closely at the released dataset, it wasn’t suitable for science. We scrubbed the data to the point where it was problematically dissimilar from the original dataset. If you do research using our data, you can’t be sure if your findings are legitimate or an artifact of de-identification.

This was a powerful relevation for many of us, especially in the face of evidence that the weapons of re-identification, in the long run, will probably outpace the shields of de-identification. We all increasingly share so much about ourselves, and ultimately the datasets created outside learning platforms will be able to be merged with datasets from learning platforms to re-identify people. It may simply not be possible to do science with anonymized data, in education or anywhere in the social sciences.

Right now, we conflate privacy with anonymity, though we need not. The Federalist Papers were anonymous but not private. Voting is private but not anonymous. If we are going to have open science with human subjects data, we’ll need to explore new approaches to balancing open science and privacy. We conclude our essay:

This example of our efforts to de-identify a simple set of student data--a tiny fraction of the granular event logs available from the edX platform--reveals a conflict between open data, the replicability of results, and the potential for novel analyses on one hand, and the anonymity of research subjects on the other. This tension extends beyond MOOC data to much of social science data, but the challenge is acute in educational research because FERPA conflates anonymity--and therefore de-identification--with privacy. One conclusion could be that the data is too sensitive to share; so if de-identification has too large an impact on the integrity of a data set, then the data should not be shared. We believe that this is an undesirable position, because the few researchers privileged enough to have access to the data would then be working in a bubble where few of their peers have the ability to challenge or augment their findings. Such limits would, at best, slow down the advancement of knowledge. At worst, these limits would prevent groundbreaking research from ever being conducted. Neither abandoning open data nor loosening student privacy protections is a wise option. Rather, the research community should vigorously pursue technology and policy solutions to the tension between open data and privacy. A promising technological solution is differential privacy.3 Under the framework of differential privacy, the original data is maintained, but raw PII is not accessed by the researcher. Instead, it resides in a secure database that has the ability to answer questions about the data. A researcher can submit a model--a regression equation, for example--to the database, and the regression coefficients and R-squared are returned. Differential privacy has challenges of its own, and remains an open research question because implementing such a system would require carefully crafting limits around the number and specificity of questions that can be asked in order to prevent identification of subjects. For example, no answer could be returned if it drew upon fewer than k rows, where k is the same minimum cell size used in k-anonymity. Policy changes may be more feasible in the short term. An approach suggested by the U.S. PCAST (President's Council of Advisors on Science and Technology) is to accept that anonymization is an obsolete tactic made increasingly difficult by advances in data mining and big data.14 PCAST recommends that privacy policy emphasize that the use of data should not compromise privacy and should focus "on the 'what' rather than the 'how.'"14One can imagine a system whereby researchers accessing an open data set would agree to use the data only to pursue particular ends, such as research, and not to contact subjects for commercial purposes or to rerelease the data. Such a policy would need to be accompanied by provisions for enforcement and audits, and the creation of practicable systems for enforcement is, admittedly, no small feat. We propose that privacy can be upheld by researchers bound to an ethical and legal framework, even if these researchers can identify individuals and all of their actions. If we want to have high-quality social science research and privacy of human subjects, we must eventually have trust in researchers. Otherwise, we'll always have a strict tradeoff between anonymity and science.

If we must have trust in researchers to enable open science, then researchers will need to earn that trust.

For regular updates, follow me on Twitter at @bjfr and for my papers, presentations and so forth, visit EdTechResearcher.

Related Tags:

The opinions expressed in EdTech Researcher are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.


Commenting has been disabled on edweek.org effective Sept. 8. Please visit our FAQ section for more details. To get in touch with us visit our contact page, follow us on social media, or submit a Letter to the Editor.


Events

This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Teaching Webinar
6 Key Trends in Teaching and Learning
As we enter the third school year affected by the pandemic—and a return to the classroom for many—we come better prepared, but questions remain. How will the last year impact teaching and learning this school
Content provided by Instructure
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Student Well-Being Webinar
Attendance Awareness Month: The Research Behind Effective Interventions
More than a year has passed since American schools were abruptly closed to halt the spread of COVID-19. Many children have been out of regular school for most, or even all, of that time. Some
Content provided by AllHere
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
School & District Management Webinar
Ensuring Continuity of Learning: How to Prepare for the Next Disruption
Across the country, K-12 schools and districts are, again, considering how to ensure effective continuity of learning in the face of emerging COVID variants, politicized debates, and more. Learn from Alexandria City Public Schools superintendent
Content provided by Class

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Education Schools Get the Brunt of Latest COVID Wave in South Carolina
In the past few weeks, South Carolina has set records for COVID-19 hospitalizations and new cases have approached peak levels of last winter.
4 min read
Two Camden Elementary School students in masks listen as South Carolina Gov. Henry McMaster talks about steps the school is taking to fight COVID-19, Wednesday, Sept. 15, 2021, in Camden, S.C. McMaster has adamantly and repeatedly come out against requiring masks in schools even as the average number of daily COVID-19 cases in the state has risen since early June. (AP Photo/Jeffrey Collins)
Education More States Are Requiring Schools to Teach Native American History and Culture
Advocates say their efforts have gained some momentum with the nation’s reckoning over racial injustice since the killing of George Floyd.
3 min read
A dancer participates in an intertribal dance at Schemitzun on the Mashantucket Pequot Reservation in Mashantucket, Conn., Saturday, Aug. 28, 2021. Connecticut and a handful of other states have recently decided to mandate students be taught about Native American culture and history. (AP Photo/Jessica Hill)
Education Judge's Temporary Order Allows Iowa Schools to Mandate Masks
A federal judge ordered the state to immediately halt enforcement of a law that prevents school boards from ordering masks to be worn.
4 min read
Iowa Gov. Kim Reynolds speaks to reporters following a news conference, Thursday, Aug. 19, 2021, in West Des Moines, Iowa. Reynolds lashed out at President Joe Biden Thursday after he ordered his education secretary to explore possible legal action against states that have blocked school mask mandates and other public health measures meant to protect students against COVID-19. Reynolds, a Republican, has signed a bill into law that prohibits school officials from requiring masks, raising concerns as delta variant virus cases climb across the state and schools resume classes soon. (AP Photo/Charlie Neibergall)
Education Hurricane Ida Deals New Blow to Louisiana Schools Struggling to Reopen
The opening of the school year offered teachers a chance to fully assess the pandemic's effects, only to have students forced out again.
8 min read
Six-year-old Mary-Louise Lacobon sits on a fallen tree beside the remnants of her family's home destroyed by Hurricane Ida, in Dulac, La., on Sept. 4, 2021. Louisiana students, who were back in class after a year and a half of COVID-19 disruptions kept many of them at home, are now missing school again after Hurricane Ida. A quarter-million public school students statewide have no school to report to, though top educators are promising a return is, at most, weeks away, not months.
Six-year-old Mary-Louise Lacobon sits on a fallen tree beside the remnants of her family's home destroyed by Hurricane Ida, in Dulac, La., on Sept. 4, 2021.
John Locher/AP