Jeff Bertolucci at InformationWeek wrote an interesting article recently comparing the different nature and opportunities that Long Data has compared to the ever popular Big Data. His main point is that we shouldn’t overlook the depth of data collected over time that can show historical trends just because we have a breadth of real-time data that highlights consumer utility today. In research, this is simply a debate of longitudinal studies vs. cross sectional studies- which academics like to argue about a lot.
Unfortunately, WASH research is almost always cross sectional (although failing in size or scale to be considered Big Data)- functionality of water points at a single moment in time or health impact studies of handwashing. Getting Long Data through longitudinal studies has historically been expensive, complicated, and lacking in breadth to draw large scale conclusions. The primary reason for the lack in abundance of data is the nature of the beast of our sector: we work in rural settings, with populations who do not have access to modern amenities, unlike Google, GE, Ford, Amazon, and the rest of the private sector who work with demographics that spend hours online each day and are easily tracked using modern analytic tools.
When I read reports like the UN Global Pulse on utilizing Big Data or Long Data for development, I always question how we will get there without our current data-streams. Only 95% of private sector businesses are truly utilizing the data revolution- if big business is struggling to keep up can we expect small non-profits with little-to-no overhead budget to jump on board? Should we be focused on trying to amass a data set large enough to run through Hadoop when the people we serve and local governments we work with do not have the capacity to monitor their existing delivery systems?
I debated the difficulties of gathering data with some researchers in my office this week that lead to an interesting point- getting data from other sources is hard. Many of them had reached out previously to other academics, requesting to see the data used in papers and reports published in the past 20 years. Most of the were told the data couldn’t be found, it was lost in a paper stack somewhere, or just didn’t receive a response. One success story of getting existing data was met with this response: ” sure! I’d love to give you the data. You know, in the 10 years since we finished this multi-country study, you are only the second person whose asked for it!” It’s even more difficult when reports and data we want are not written in English, which then requires additional work just to unpack the data and harmonize it with our own datasets.
So if we amass all this information on rural water systems and service delivery, who will analyze it? Another WASH conference has come to the same conclusion that we need shared data and suggests academia is the partner that governments and NGOs should look to for analysis. As a university researcher, this model could work- but only if funding is attached to the data. I think the real challenge in the Big Data debate is what we are able to analyze and learn from resources available.
I believe we should shift our attention to a different buzzwordy data type- Smart Data. If we can start with a shared data platform, Smart Data will allow us to be efficient with our existing databases and make actionable decisions. Smart Data can harness the power of all of our data, whether it is large or small, cross sectional or longitudinal, and put it to work. We need to crowd source our skills and data scientists to analyze for actionable purpose- not just for another journal article.
Long Data hits a resonating note with recent WASH sustainability publications that are urging us to shift our focus on monitoring the life cycle of systems. Big Data is the carrot on the stick that will push us to better analysis and shared collaboration. Smart Data is the link, the nexus point where ideas and information will be maximized for impact. It may be too soon for development professional to concern themselves with Hadoop or any analysis on that scale, but it should be the dream we are trying to make reality.