Project will broaden access to geoscience data

8/28/2014 Celeste Arbogast

Team to develop framework to make extracting information from data collections easier.

Written by Celeste Arbogast

Above, from left, CEE Professor Praveen Kumar, CEE Research Associate Mostafa Elag and National Center for Supercomputing Applications Senior Research Programmer Luigi Marini are part of a team working to make long-tail data and models easier to utilize.

Each day, millions of researchers across the geoscience field and around the world are busy collecting data for unique research projects. They develop equally specific models to utilize that data for their individual purposes. These collections are called “long-tail” data and models, because large numbers of researchers collecting relatively small amounts of data add up to an extensive data trail. They represent a potential goldmine for the scientific community, but for one problem: they are too different. Data collections and unique models that reflect differences in units, time intervals, locations and a host of other variables are generally so dissimilar that using them is more difficult for researchers than simply starting from scratch.

That may soon change, thanks to a project led by Illinois researchers to develop a way to make this data much easier to utilize, promising a significant savings of money and time for researchers wishing to utilize long-tail data and models. The team, led by CEE Professor Praveen Kumar, has received a $1 million grant from the National Science Foundation (NSF) to develop a semantic framework for integrating long-tail data and models.

“We’re developing the semantic framework in software so that we can extract all this information from the different data types and different model components and help them talk to each other,” Kumar said. “It cuts down the manual effort that is required in discovering suitable data for a model and converting data so that it’s in a suitable format for a model. All that could be done automatically.”

The effect will be similar to giving scientists access to a translator, said team member Mostafa Elag, a CEE Research Associate.

“Just as languages have a semantic structure, we need semantic techniques for data,” Elag said. “With French and English, the root language is Latin. They have a similar structure but not the same structure. It’s the same for data and models. They need an interpreter or a translator.”

The group plans to first develop a set of protocols that can ultimately be reduced in software. Researchers will be able to deploy the software to extract information from heterogeneous data sets and models in order to use them.

“Our emphasis is on developing the design of the protocol as much as developing the technology wrapped around it to make it possible,” Kumar said. “The design of the protocol requires a lot of thought. The design document is where there will be a lot of effort which will require community input and approval.”

The community Kumar mentions are the hundreds of geoscientists involved in EarthCube, the larger NSF initiative of which this project is a component. Led by NSF’s Directorate for Geosciences and its Division of Advanced Cyberinfrastructure, EarthCube’s main objective is to develop a national cyberinfrastructure for earth system science researchers and educators. The work represents another aspect of what scientists in recent years have come to refer to as “big data,” Kumar said.

“Big data issues are related not just to large volumes of data but also to large heterogeneities in data,” Kumar said. “In our context, we’re thinking of big data because there are millions of people collecting little data, but when you multiply a little by millions, it’s very large. It’s a large volume of data which has not been effectively used for scientific exploration.”

In addition to making long-tail data and models easier for other researchers to utilize, the project also will shed light on the very existence of additional scientific resources, Elag said.

“The broader impact of this project will be helping to increase the visibility of our resources—either models or data,” Elag said. “Without this framework, we cannot see what other communities are doing. Through this framework, we will be able to reuse other resources that we are not currently aware of.”

Other researchers involved in this project are Luigi Marini of the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign, Scott Peckham of the University of Colorado and Leslie Hsu of the Lamont-Doherty Earth Observatory at Columbia University. 


Share this story

This story was published August 28, 2014.