Opinion
Education Opinion

MITx and HarvardX Release De-Identified Dataset from First Year of MOOCs

By Justin Reich — May 30, 2014 3 min read

I’m pleased to annouce today that my colleagues at HarvardX and MITx have released a de-identified person-course dataset from 16 courses from the first year of edX; the same dataset that was used to produce HarvardX and MITx: The First Year of Open Online Courses. This massive effort was led by Jon Daries of the MIT office of Institutional Research. We’re hopeful that this dataset both signifies our commitment to providing data to the research community that advances the science of learning while protecting student privacy. We’re excited to see what folks do with it.

Here’s some background on what is in the dataset, and what we did with it.

A “person-course” dataset means that every row in the dataset corresponds to one registration in one-course. A student who registered for Intro to Computer Science from HarvardX and Biology from MITx would have two rows in the dataset. The columns of the dataset are variables like age, gender, grade, whether you earned a certificate, the number of days you were active in a course, and so forth.

De-identified means two kinds of things. First, we removed all of the obvious things that would let a person identify an individual student, like a name or email. We also hashed (scrambled and replaced with an arbitrary string of characters) values like the user_id, so this dataset cannot be connected to other datasets, but unique students can still be identified in the dataset.

Next, we look at anyone who has a unique combination of variables, and we try to either modify those variables or remove rows from the dataset until their are no longer unique. Let’s say you are the only person in Biology from Bulgaria, and you introduce yourself as such. If we identify your country of origin as Bulgaria, people can go to the forums, and figure out more things about you. So for all countries with fewer than 5,000 people in the dataset, we just list their general region in the world.

There are also people with extreme behavior or high activity. For instance, most people never post in the forum, but a few do often: 847 times or 93 times. If someone scraped the forums and counted everyone’s posts, then they could identify these people. Rather than modify their data, we delete these rows. *This is Important.* To protect anonymity, very active registrants are deleted from the data.

There are also unique combinations of people in the dataset. We’ve set things so that no person can be distinguished from at least four other people in the dataset (in technical terms, this means we maintain a k-anonymity of 5). For instance, lots of people sign up for unique combinations of courses. If you register for six courses, you might be the only one like that. So we start dropping rows off your dataset (using crazy complicated algorithms to minimize data loss and prevent bias against big courses) until you start looking like at least four other people.

Finally, we got a smart group of Harvard computer science students and had them try to break the dataset and re-identify people. We feel pretty good about where we are right now.

It’s always possible that people could re-identify students. Particularly, for instance, if someone scraped all of the edX forums, and all data from social media like Facebook and Twitter where people say things like “OMG! I’m from Tobago and I signed up for Biology, Justice, and Heroes, and I’m going to post in the forums in each course exactly 6 times.” We think we’ve minimized these risks, but we know they are non-zero.

Our original person-course dataset had about 840,000 rows, and the released dataset has 740,000 rows. In the original dataset, 5% of registrants earned a certificate in a course. in the new dataset, 3% of registrants earn a certificate in the course. So the effects of de-identification have a non-trivial impact on the composition of the dataset, especially in regards to our most active users. It will be interesting to see people reproduce our results, and the kinds of discrepencies that come about.

Our team has tried to balance making data accessible to other researchers with protect the privacy of our users. I hope we’ve done that well, but we look forward to hearing feedback from researchers, privacy advocates, and others.

For regular updates, follow me on Twitter at @bjfr and for my papers, presentations and so forth, visit EdTechResearcher.

The opinions expressed in EdTech Researcher are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.

Let us know what you think!

We’re looking for feedback on our new site to make sure we continue to provide you the best experience.

Events

This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Future of Work Webinar
Digital Literacy Strategies to Promote Equity
Our new world has only increased our students’ dependence on technology. This makes digital literacy no longer a “nice to have” but a “need to have.” How do we ensure that every student can navigate
Content provided by Learning.com
Mathematics Online Summit Teaching Math in a Pandemic
Attend this online summit to ask questions about how COVID-19 has affected achievement, instruction, assessment, and engagement in math.
School & District Management Webinar Examining the Evidence: Catching Kids Up at a Distance
As districts, schools, and families navigate a new normal following the abrupt end of in-person schooling this spring, students’ learning opportunities vary enormously across the nation. Access to devices and broadband internet and a secure

EdWeek Top School Jobs

Speech Therapists
Lancaster, PA, US
Lancaster Lebanon IU 13
Elementary Teacher
Madison, Wisconsin
One City Schools
Elementary Teacher - Scholars Academy
Madison, Wisconsin
One City Schools

Read Next

Education Obituary In Memory of Michele Molnar, EdWeek Market Brief Writer and Editor
EdWeek Market Brief Associate Editor Michele Molnar, who was instrumental in launching the publication, succumbed to cancer.
5 min read
Education Briefly Stated Briefly Stated: December 9, 2020
Here's a look at some recent Education Week articles you may have missed.
8 min read
Education Briefly Stated Briefly Stated: Stories You May Have Missed
A collection of articles from the previous week that you may have missed.
8 min read
Education Briefly Stated Briefly Stated: Stories You May Have Missed
A collection of stories from the previous week that you may have missed.
8 min read