Privacy & Security

Ed. Data-Mining Research Effort Wins Federal Grant, Raises Privacy Questions

By Benjamin Herold — October 10, 2014 9 min read
  • Save to favorites
  • Print

The National Science Foundation earlier this month awarded a $4.8 million grant to a coalition of prominent research universities aiming to build a massive repository for storing, sharing, and analyzing the information students generate when using digital learning tools.

The project, dubbed “LearnSphere,” highlights the continued optimism that “big” educational data might be used to dramatically transform K-12 schooling.

It also raises new questions in the highly charged debate over student-data privacy.

The federally funded initiative will be led by researchers at Carnegie Mellon University, in Pittsburgh, who propose to construct a new data-sharing infrastructure that is distributed across multiple institutions, include third-party and for-profit vendors. When complete, LearnSphere is likely to hold a massive amount of anonymous information, including:

  • “Clickstream” and other digital-interaction data generated by students using digital software provided to schools by LearnSphere participants;
  • Chat-window dialogue sent by students participating in some online courses and tutoring programs;
  • Potentially, “affect” and biometric data, including information generated from classroom observations, computerized analysis of students’ posture, and sensors placed on students’ skin.

Proponents say that facilitating the sharing and analysis of such information for research purposes can lead to new insights about how humans learn, as well as rapid improvements to the digital learning software flooding now flooding schools.

“We are going to be able to understand the learning and instructional processes so much better than we already do,” said Ken Koedinger, a professor of human-computer interaction and psychology at Carnegie Mellon and the principal investigator on the NSF grant, in an interview.

LearnSphere, he said, “is going to be very powerful in making courses more adaptive and personalized and better able to anticipate the sticking points an average student encounters.”

But prominent student-data-privacy advocates expressed reservations about the initiative, saying that rapid expansion in the collection of students’ digital data, even when done primarily for research purposes, is fraught with potential problems related to notification, consent, and data ownership.

At minimum, said Khaliah Barnes, a lawyer with the Electronic Privacy Information Center, a Washington-based advocacy group, LearnSphere “warrants further evaluation” before moving forward.

“It’s not at all clear that parents and students are fine with having their information data-mined in this way,” Barnes said.

A Data-Sharing Infrastructure

All told, the NSF’s award to LearnSphere was one of 14 grants totaling $31 million that the federal agency announced earlier this month as part of its Data Infrastructure Building Blocks program, also known as DIBBS.

The goal of the program is to promote interdisciplinary collaboration and innovation in a wide variety of scientific fields.

In an interview, officials at the NSF said the creation of data-sharing infrastructure has helped fuel advances elsewhere. Education may finally be ready to catch up.

“We’re now able to collect massive amounts of information on individual students we weren’t able to collect 10 years ago,” John Cherniavsky, a senior advisor for research at the NSF, said. “It presents an opportunity in the education-research domain that has been available in the physical sciences for decades.”

LearnSphere won funding, the NSF officials said, in large part because the effort will build off of extensive work that researchers at Carnegie Mellon have already done.

Koedinger first won acclaim in the early 2000’s for his involvement in the development of adaptive-learning software known as Cognitive Tutor, which uses big-data analysis to help spot the points at which students get stymied or disrupted in their learning process, then provides targeted help to get them back on track.

In 2004, with NSF funding, Carnegie Mellon and the University of Pittsburgh jointly founded the Pittsburgh Science Learning Center. Led by Koedinger, researchers at the center study human learning and seek to apply the findings to the development of teaching tools.

The center in recent years created DataShop, an educational data repository that holds information from more than 550 datasets.

Cognitive Tutor software, currently used by about 600,000 middle- and high-school students in the U.S., provides a large chunk of data to the repository, Koedinger said.

Additional data are generated by students’ use of other interactive tutor systems and software programs, digital educational games and simulations, and massive open online courses, or MOOCs.

Some of those tools come from universities, and some come from private companies and developers.

The data to be stored in the LearnSphere database and analyzed by researchers will be far-reaching, Koedinger said. They would likely include, for example, records of every mouse click a student makes when using a software program and information demonstrating a student’s thought process when attempting to solve a problem in an online simulation.

The LearnSphere data will also likely include the text that college students type when participating in a discussion board for a MOOC, and that K-12 students type when interacting with a dialogue-based adaptive tutoring system.

And it may also include information on what Koedinger described as students’ “affective emotional states,” such as whether they are bored or frustrated, as gauged through either classroom observations or sensor technology that can detect an individual’s posture, their skin’s conductance of electricity, and more.

Upending Conventional Wisdom

Part of what Koedinger hopes will make LearnSphere powerful is the ability to connect such varying data streams to each other in order to conduct large-scale analyses.

Already, he said, “we have shown some pretty interesting results in being able to detect different [emotional] states from keystroke data.”

Such findings, Koedinger said, might be used to improve the ability of adaptive software to determine when a student is becoming uninterested in a digital lesson, allowing the program to provide on-the-spot encouragement or remedial help.

Analysis of the types of data that LearnSphere proposes to store can also lead to surprising insights about how to best teach students, Koedinger said. He cited recent findings by his team at Carnegie Mellon that, contrary to conventional wisdom, students actually seem to learn algebra better when they are first introduced to problems in the form of a story, rather than in the form of an equation.

James Gee, an education professor at Arizona State University, in Tempe, and an expert on the uses of data generated by digital games, said in an interview that “cherished theories” in other fields have already been upended by similar big-data analyses, especially those in which information is shared across institutions.

He pointed, for example, to medicine, where informational devices inserted into the body can now provide medical professionals with a constant, real-time stream of information on patients’ biochemistry, allowing for much richer and more accurate portrait of an individual’s health than can be ascertained through check-ups or human monitoring.

Koedinger said the field of education is ripe for something similar.

“Our sense of learning from our conscious experience is just a slim sliver of what is actually happening in our brains,” Koedinger said. “It’s a lot more complex than we think it is.”

Data-Privacy Questions

But Gee is also among those worried that such large-scale educational data collection efforts will “create more noise, not more signal,” effectively obscuring the very things researchers are hoping to learn.

That’s especially true, he said, given the ways in which researchers and vendors are increasingly prioritizing digital over other types of data about how students learn, such as human observations of student-to-student interactions.

Such concerns about “over-collection” of digital learning data are also part of what troubles Barnes, the EPIC lawyer.

“We’re increasingly operating outside the parameters of FERPA,” she said, referring to the Family Educational Rights and Privacy Act, a 40-year old federal statute that remains the primary law in place to protect students’ privacy.

“We talk about modern privacy as being about an individual’s right to control the information they’ve entrusted to others,” Barnes said, “but it appears [with LearnSphere] that students will lose significant control.”

Indeed, the new effort in some ways echoes the work attempted by Atlanta-based nonprofit inBloom, which closed its doors in April in the wake of stiff opposition from parents and advocates concerned with the privacy and security of children’s sensitive information.

Unlike LearnSphere, which will facilitate data-sharing among researchers and some private companies and developers, inBloom aimed to sit between schools and vendors. The nonprofit also sought to collect and store personally identifiable student information directly from schools, which LearnSphere will not do.

But like inBloom, the new initiative will facilitate greater sharing of the information students generate for a potentially wide variety of purposes, just as it will facilitate expanded collection of digital information from students in schools.

It’s the latter point that most worries Leonie Haimson, the co-chair of the Parent Coalition for Student Privacy and a leading voice in the opposition that ultimately toppled inBloom.

“In general, we have nothing against research that is done with fully anonymized data,” she said in an interview. “But I think that any university involved in such a data [repository] has to make sure that the original collection of data was done ethically, with full consent and notification. They shouldn’t leave it up to vendors.”

A Matter of Trust

For their part, both Koedinger and officials from the NSF acknowledged such concerns, but said protections will be in place.

Approval by institutional review boards at participating LearnSphere universities will be required for formal research studies involving the data being collected. That means stringent requirements for informed consent by participating students.

Koedinger conceded, however, that in many cases, the software and other digital learning tools that are feeding data to LearnSphere will be operating outside of formal research studies. In those cases, he said, the potential for anonymized information to be shared with third parties may or may not be disclosed to schools and districts through a formal statement.

Ultimately, he said, the level of concern people have about data privacy and ownership will likely come down to a combination of much they trust the institutions involved and how much they believe in the potential of a project like LearnSphere to improve education.

For the time being, at least, there is clearly a wide spectrum of opinions on those questions.

“Contrary to popular belief,” said Barnes, the EPIC attorney, “students and parents don’t necessarily support being made into research subjects every time they interact with an ed-tech platform.”

Photo of Ken Koedinger, professor at Carnegie Mellon University and principal investigator on the NSF grant.


See also:

A version of this news article first appeared in the Digital Education blog.