Kevin Kelly has a post on The Technnium about whether google sized data sets will lead to a new way of doing science:
“There’s a dawning sense that extremely large databases of information, starting in the petabyte level, could change how we learn things. ”
Kevin’s post was pointed out to me by Jonah Stein and caught my attention because a couple AppLogic users are starting to build very large data sets and as a result we’re ocassionaly asked about how to access and distribute them. However, the post really is less about the technology of dealing with these data sets and more about whether they’ll change the way science is conducted, which is another subject I’m interested in so I read on. Kevin’s post was inspired by a Wired cover story “The End of Theory” by Chris Anderson who writes:
“There is now a better way. Petabytes allow us to say: “Correlation is enough.” We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”
So, Chris’ theory is that if we have sufficiently large data sets we’ll be able to act merely on the finding of correlation rather than waiting to understand the actual mechanism that relates the data – the cause and effect. However, IMHO data without understanding isn’t knowledge. In fact, it can be dangerous because the conclusions reached through statistics are very susceptible to influence by what data is collected.
In the 70′s Harvard medical school researchers had reams of data correlating breast feeding infants with juvenile cancer. Simply accepting the statistics could have lead to some horrible decisions. Fortunately, they didn’t accept the statistics and further research showed breast feeding doesn’t cause cancer, but that carcinogens in the mother’s environment and diet were being passed through to the baby.
The Eugenics movement a century ago was based on statistics correlating skull shape of different races with intelligence test scores. This psuedo-science was in part responsible for Nazi atrocities. Of course, we understand now that the intelligence tests were geared towards white Europeans, but without that knowledge, the statistics seemed compelling to people of that era.
Even the low birthrate issues in Europe today could provide miscues if you just accept the statistics. Is the cause that people are less religous than in previous generations? Is it that government subsidies are higher than other countries, or that they’re lower than 50 years ago? Is it pollution or economics? There are statistics to prove each of these.
In pure mathematical sciences huge data sets will provide scientists with interesting insights on where to focus their research. Used properly they will provide a sort of short cut to new theories to test and those theories will provide feedback into the system on what data to collect in the future.