Opinion
Education Opinion

MITx and HarvardX Release De-Identified Dataset from First Year of MOOCs

By Justin Reich — May 30, 2014 3 min read
  • Save to favorites
  • Print

I’m pleased to annouce today that my colleagues at HarvardX and MITx have released a de-identified person-course dataset from 16 courses from the first year of edX; the same dataset that was used to produce HarvardX and MITx: The First Year of Open Online Courses. This massive effort was led by Jon Daries of the MIT office of Institutional Research. We’re hopeful that this dataset both signifies our commitment to providing data to the research community that advances the science of learning while protecting student privacy. We’re excited to see what folks do with it.

Here’s some background on what is in the dataset, and what we did with it.

A “person-course” dataset means that every row in the dataset corresponds to one registration in one-course. A student who registered for Intro to Computer Science from HarvardX and Biology from MITx would have two rows in the dataset. The columns of the dataset are variables like age, gender, grade, whether you earned a certificate, the number of days you were active in a course, and so forth.

De-identified means two kinds of things. First, we removed all of the obvious things that would let a person identify an individual student, like a name or email. We also hashed (scrambled and replaced with an arbitrary string of characters) values like the user_id, so this dataset cannot be connected to other datasets, but unique students can still be identified in the dataset.

Next, we look at anyone who has a unique combination of variables, and we try to either modify those variables or remove rows from the dataset until their are no longer unique. Let’s say you are the only person in Biology from Bulgaria, and you introduce yourself as such. If we identify your country of origin as Bulgaria, people can go to the forums, and figure out more things about you. So for all countries with fewer than 5,000 people in the dataset, we just list their general region in the world.

There are also people with extreme behavior or high activity. For instance, most people never post in the forum, but a few do often: 847 times or 93 times. If someone scraped the forums and counted everyone’s posts, then they could identify these people. Rather than modify their data, we delete these rows. *This is Important.* To protect anonymity, very active registrants are deleted from the data.

There are also unique combinations of people in the dataset. We’ve set things so that no person can be distinguished from at least four other people in the dataset (in technical terms, this means we maintain a k-anonymity of 5). For instance, lots of people sign up for unique combinations of courses. If you register for six courses, you might be the only one like that. So we start dropping rows off your dataset (using crazy complicated algorithms to minimize data loss and prevent bias against big courses) until you start looking like at least four other people.

Finally, we got a smart group of Harvard computer science students and had them try to break the dataset and re-identify people. We feel pretty good about where we are right now.

It’s always possible that people could re-identify students. Particularly, for instance, if someone scraped all of the edX forums, and all data from social media like Facebook and Twitter where people say things like “OMG! I’m from Tobago and I signed up for Biology, Justice, and Heroes, and I’m going to post in the forums in each course exactly 6 times.” We think we’ve minimized these risks, but we know they are non-zero.

Our original person-course dataset had about 840,000 rows, and the released dataset has 740,000 rows. In the original dataset, 5% of registrants earned a certificate in a course. in the new dataset, 3% of registrants earn a certificate in the course. So the effects of de-identification have a non-trivial impact on the composition of the dataset, especially in regards to our most active users. It will be interesting to see people reproduce our results, and the kinds of discrepencies that come about.

Our team has tried to balance making data accessible to other researchers with protect the privacy of our users. I hope we’ve done that well, but we look forward to hearing feedback from researchers, privacy advocates, and others.

For regular updates, follow me on Twitter at @bjfr and for my papers, presentations and so forth, visit EdTechResearcher.

The opinions expressed in EdTech Researcher are strictly those of the author(s) and do not reflect the opinions or endorsement of Editorial Projects in Education, or any of its publications.


Commenting has been disabled on edweek.org effective Sept. 8. Please visit our FAQ section for more details. To get in touch with us visit our contact page, follow us on social media, or submit a Letter to the Editor.


Events

This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Teaching Webinar
What’s Next for Teaching and Learning? Key Trends for the New School Year
The past 18 months changed the face of education forever, leaving teachers, students, and families to adapt to unprecedented challenges in teaching and learning. As we enter the third school year affected by the pandemic—and
Content provided by Instructure
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Curriculum Webinar
How Data and Digital Curriculum Can Drive Personalized Instruction
As we return from an abnormal year, it’s an educator’s top priority to make sure the lessons learned under adversity positively impact students during the new school year. Digital curriculum has emerged from the pandemic
Content provided by Kiddom
This content is provided by our sponsor. It is not written by and does not necessarily reflect the views of Education Week's editorial staff.
Sponsor
Equity & Diversity Webinar
Leadership for Racial Equity in Schools and Beyond
While the COVID-19 pandemic continues to reveal systemic racial disparities in educational opportunity, there are revelations to which we can and must respond. Through conscientious efforts, using an intentional focus on race, school leaders can
Content provided by Corwin

EdWeek Top School Jobs

Teacher Jobs
Search over ten thousand teaching jobs nationwide — elementary, middle, high school and more.
View Jobs
Principal Jobs
Find hundreds of jobs for principals, assistant principals, and other school leadership roles.
View Jobs
Administrator Jobs
Over a thousand district-level jobs: superintendents, directors, more.
View Jobs
Support Staff Jobs
Search thousands of jobs, from paraprofessionals to counselors and more.
View Jobs

Read Next

Education Judge's Temporary Order Allows Iowa Schools to Mandate Masks
A federal judge ordered the state to immediately halt enforcement of a law that prevents school boards from ordering masks to be worn.
4 min read
Iowa Gov. Kim Reynolds speaks to reporters following a news conference, Thursday, Aug. 19, 2021, in West Des Moines, Iowa. Reynolds lashed out at President Joe Biden Thursday after he ordered his education secretary to explore possible legal action against states that have blocked school mask mandates and other public health measures meant to protect students against COVID-19. Reynolds, a Republican, has signed a bill into law that prohibits school officials from requiring masks, raising concerns as delta variant virus cases climb across the state and schools resume classes soon. (AP Photo/Charlie Neibergall)
Education Hurricane Ida Deals New Blow to Louisiana Schools Struggling to Reopen
The opening of the school year offered teachers a chance to fully assess the pandemic's effects, only to have students forced out again.
8 min read
Six-year-old Mary-Louise Lacobon sits on a fallen tree beside the remnants of her family's home destroyed by Hurricane Ida, in Dulac, La., on Sept. 4, 2021. Louisiana students, who were back in class after a year and a half of COVID-19 disruptions kept many of them at home, are now missing school again after Hurricane Ida. A quarter-million public school students statewide have no school to report to, though top educators are promising a return is, at most, weeks away, not months.
Six-year-old Mary-Louise Lacobon sits on a fallen tree beside the remnants of her family's home destroyed by Hurricane Ida, in Dulac, La., on Sept. 4, 2021.
John Locher/AP
Education Massachusetts National Guard to Help With Busing Students to School
250 guard personnel will be available to serve as drivers of school transport vans, as districts nationwide struggle to hire enough drivers.
1 min read
Massachusetts National Guard soldiers help with logistics in this Friday, April 17, 2020 file photo, at a food distribution site outside City Hall, in Chelsea, Mass. Mass. Gov. Charlie Baker on Monday, Sept. 13, 2021, activated the state's National Guard to help with busing students to school as districts across the country struggle to hire enough drivers.
Massachusetts National Guard soldiers help with logistics in this Friday, April 17, 2020 file photo, at a food distribution site outside City Hall, in Chelsea, Mass.
Michael Dwyer/AP
Education FDA: ‘Very, Very Hopeful’ COVID Shots Will Be Ready for Younger Kids This Year
Dr. Peter Marks said he is hopeful that COVID-19 vaccinations for 5- to 11-year-olds will be underway by year’s end. Maybe sooner.
4 min read
Dr. Peter Marks, director of the Center for Biologics Evaluation and Research in the Food and Drug Administration, testifies during a Senate health, education, labor, and pensions hearing to examine an update from federal officials on efforts to combat COVID-19 on Capitol Hill in Washington on May 11, 2021. On Friday, Sept. 10, 2021, Marks urged parents to be patient, saying the agency will rapidly evaluate vaccines for 5- to 11-year-olds as soon as it gets the needed data.
Dr. Peter Marks, director of the Center for Biologics Evaluation and Research in the Food and Drug Administration, testifies during a Senate health, education, labor, and pensions hearing to examine an update from federal officials on efforts to combat COVID-19 on Capitol Hill in Washington on May 11, 2021.
Jim Lo Scalzo/AP