'Big Data' Research Effort Faces Student-Privacy Questions
A coalition of prominent research universities is receiving federal support to redesign and scale up a massive repository for storing, sharing, and analyzing learning and behavioral data that students generate when using digital instructional tools, demonstrating the continued faith that many personalized-learning proponents have in the power of "big data" to transform schooling.
But the project, which is dubbed "LearnSphere" and in some respects echoes the ill-fated attempt by controversial nonprofit inBloom to facilitate the collection and sharing of large amounts of educational information, also raises raising new questions in the highly charged debate over student-data privacy.
The initiative—which was awarded a $4.8 million grant from the National Science Foundation—will be led by researchers at Carnegie Mellon University, in Pittsburgh, who propose to construct a new data-sharing infrastructure that is distributed across multiple institutions, including third-party and for-profit vendors. When complete, LearnSphere is likely to hold a massive amount of anonymous information, including:
• "Clickstream" and other digital-interaction data generated by students using digital software provided to schools by the universities and vendors participating in LearnSphere;
• Chat-window dialogue sent by students participating in some online courses and tutoring programs;
• Potentially, "affect" and biometric data, including information generated from classroom observations, computerized analysis of students' posture, and sensors placed on students' skin, in order to track measures such as student engagement.
Proponents say that facilitating the sharing and analysis of such information for research purposes can lead to new insights about how humans learn, as well as rapid improvements to the digital learning software now flooding schools.
"We are going to be able to understand the learning and instructional processes so much better than we already do," said Kenneth R. Koedinger, a professor of human-computer interaction and psychology at Carnegie Mellon and the principal investigator on the NSF grant. The other universities participating in the project are the Massachusetts Institute of Technology, Stanford University, and the University of Memphis.
But prominent student-data-privacy advocates warned that rapid expansion in the collection of students' digital data, even when done primarily for research purposes, is fraught with potential problems related to notification, consent, and data ownership.
At a minimum, LearnSphere "warrants further evaluation" before moving forward, said Khaliah Barnes, a lawyer with the Electronic Privacy Information Center, a Washington-based advocacy group.
"It's not at all clear that parents and students are fine with having their information data-mined in this way," Ms. Barnes said.
LearnSphere's grant was one of 14, totaling $31 million that the NSF announced earlier this month as part of its Data Infrastructure Building Blocks program, also known as DIBBS.
The goal of the program is to promote interdisciplinary collaboration and innovation in a wide variety of scientific fields.
NSF officials said education may finally be ready to catch up to data-related advances in other fields.
"We're now able to collect massive amounts of information on individual students we weren't able to collect 10 years ago," said John C. Cherniavsky, a senior advisor for research at the NSF. "It presents an opportunity in the education-research domain that has been available in the physical sciences for decades."
LearnSphere won funding, the NSF officials said, in large part because the effort will build off of extensive work that researchers at Carnegie Mellon have already done.
Mr. Koedinger first won acclaim in the early 2000s for his involvement in the development of adaptive-learning software known as Cognitive Tutor, which uses big-data analysis to help spot the points at which students get stymied or disrupted in their mathematics-learning process, then provides targeted help to get them back on track.
In 2004, with NSF funding, Carnegie Mellon and the University of Pittsburgh jointly founded the Pittsburgh Science Learning Center to study human learning and apply the findings to the development of teaching tools. The center in recent years created DataShop, an educational data repository that holds information from more than 550 datasets.
Cognitive Tutor software, used by about 600,000 middle and high school students in the United States, provides a large chunk of data to the repository, Mr. Koedinger said.
Additional information is generated by students' use of other interactive tutor systems and software programs, digital educational games and simulations, and massive open online courses, or MOOCs. Some of those tools come from universities, and some come from private companies and developers.
Detecting Emotional States
The data to be stored in the LearnSphere database and analyzed by researchers will be far-reaching, Mr. Koedinger said, likely including records of every mouse click a student makes when using a software program and information demonstrating a student's thought process when attempting to solve a problem in an online simulation.
The LearnSphere data will also likely include the text that college students type when participating in a discussion board for a MOOC, and that K-12 students enter when interacting with a dialogue-based adaptive-tutoring system.
And it may also include information on what Mr. Koedinger described as students' "affective emotional states," such as whether they are bored or frustrated, as gauged through either classroom observations or sensor technology that can detect an individual's posture, his or her skin's conductance of electricity, and more.
Part of what Mr. Koedinger hopes will make LearnSphere powerful is the ability to connect such varying data streams to each other in order to conduct large-scale analyses.
Already, he said, "we have shown some pretty interesting results in being able to detect different [emotional] states from keystroke data."
Such findings, Mr. Koedinger said, might be used to improve the ability of adaptive software to determine when a student is losing interest in a digital lesson, allowing the program to provide on-the-spot encouragement or remedial help.
Analysis of the types of data that LearnSphere proposes to store can also lead to surprising insights about how to best teach students, Mr. Koedinger said. He cited recent findings by his team at Carnegie Mellon that, contrary to conventional wisdom, showed students seem to learn algebra better when they are first introduced to problems in the form of a story, rather than in the form of an equation.
James Paul Gee, an education professor at Arizona State University, in Tempe, and an expert on the uses of data generated by digital games, said that "cherished theories" in other fields have already been upended by similar big-data analyses, especially those in which information is shared across institutions.
He pointed, for example, to medicine, where informational devices inserted into the body can now provide medical professionals with a constant, real-time stream of information on patients' biochemistry, allowing for a much richer and more accurate portrait of an individual's health than can be ascertained through check-ups or human monitoring.
Mr. Koedinger said the field of education is ripe for something similar.
"Our sense of learning from our conscious experience is just a slim sliver of what is actually happening in our brains," Mr. Koedinger said. "It's a lot more complex than we think it is."
But Mr. Gee is also among those worried that such large-scale educational data collection efforts will "create more noise, not more signal," effectively obscuring the very things researchers are hoping to learn.
That's especially true, he said, given the ways in which researchers and vendors are increasingly prioritizing digital over other types of data about how students learn, such as human observations of student-to-student interactions.
Such concerns about "over-collection" of digital learning data are also part of what troubles Ms. Barnes, the EPIC lawyer.
"We're increasingly operating outside the parameters of FERPA," she said, referring to the Family Educational Rights and Privacy Act, a 40-year-old federal statute that remains the primary law in place to protect students' privacy.
"We talk about modern privacy as being about an individual's right to control the information they've entrusted to others," Ms. Barnes said, "but it appears [with LearnSphere] that students will lose significant control."
Indeed, Mr. Koedinger said the new effort bears some similarity to the work attempted by Atlanta-based nonprofit inBloom, which closed its doors in April in the wake of stiff opposition from parents and advocates concerned with the privacy and security of children's sensitive information. Like inBloom, LearnSphere will involve the storage of massive amounts of student information and enable those data to be shared more easily.
But unlike the researchers' effort, which will facilitate data-sharing among researchers and some private companies and developers, inBloom aimed to sit between schools and vendors. The nonprofit also sought to collect and store personally identifiable student information directly from schools, which LearnSphere will not do.
For their parts, both Mr. Koedinger and officials from the NSF acknowledged potential privacy concerns, but said protections, including approval by university institutional review boards, in some instances, will be in place.
Mr. Koedinger conceded, however, that in many cases, the software and other digital learning tools that are feeding data to LearnSphere will be operating outside of formal research studies. In those cases, he said, the potential for "de-identified" or anonymous information to be shared with third parties may or may not be disclosed to schools and districts through a formal statement.
It's the latter point that most worries Leonie Haimson, the co-chairperson of the Parent Coalition for Student Privacy and a leading voice in the opposition that ultimately toppled inBloom.
"In general, we have nothing against research that is done with fully anonymized data," Ms. Haimson said. "But I think that any university involved in such a data [repository] has to make sure that the original collection of data was done ethically, with full consent and notification. They shouldn't leave it up to vendors."
Vol. 34, Issue 09, Page 6