I love Twitter. Specifically, I love it when I stumble across a new article, idea, or link from someone I follow. This past week I followed a link posted by a fellow data geek that led me to downloading a book on Neo4j NoSQL databases. The book, Graphing Databases, explores the development and practical use of non-relational, non-traditional database structures for showing relationships between data points.
For the less nerdy, a typical database is refered to as a relational database. It is the type of format you see in Microsoft Excel. Each data point lives in a single row and each column is a different attribute. All the attributes in a given row are linked together, and each row is collectively linked together within the table. This is a great way for organizing small data sets with simple variables and is commonly referred to as a SQL database. It is not great when you have very large datasets, as the query and analysis time grows with the size of the database. Its even harder to use if you are trying to track data points from multiple sources that don’t necessarily fit into the same relational table.
I constantly run up against this challenge when working with WaSH data. For any given project, we typically collect data from multiple sources. User data from household surveys, management and financial data from water board surveys, georeferenced functionality data from water point mapping, water quality data from the local engineer, along with other sources from remote sensors or national databases, and they must all be combined to paint a picture of water and sanitation coverage in any given community, region, or country. Each source becomes it’s own Excel workbook tab or separate CSV of variables within STATA. Linking these data streams requires a lot of time and effort, and usually results in a messy set of files that require an intimate knowledge of their creation to make any sense of. I have both created these labyrinths of data and been dropped into the middle of other peoples data sets- and neither situation is pretty.
Learning about graphing data bases has changed my entire view of WaSH data.
Graphing databases are based upon 18c graph theory. Each data point is a node or single point on a graph. Nodes are connected to each other through relational edges. So instead of all data points having to relate within a table, data points can relate in any schema necessary.
Graphing Connections of Kevin Bacon
So what does that mean for WaSH data? We can take graph data points from households, communities, water points, and any other data source and link them together through their interconnectedness. Relational edges can be drawn between a set of households in a given community and a shared hand pump, allowing for analysis to take place between all points in that network to determine a total picture of functionality. These small groups of nodes can be partitioned by geography or rural/urban setting and aggregated with other network groups to develop an in depth view of service coverage. If the data points are georeferenced, graphing database models can even organize data using long/lat and conduct some geospatial analysis. And since each data point lives independent of each other, we can more easily deal with a common challenge in WaSH data- syncing data sets from varied sources and dealing with gaps in data.
The one big downside of graphing databases: ditching Excel. A lot of people involved in the collection and management of WaSH data rely heavily on using Excel (or a similar program). It is a challenge for me to move beyond this traditional way of thinking about Relational Database Management Systems (RDBMS), but I’m already enjoying learning new tools and reworking my datasets into a graphing format that highlights the complex relationships that exist between different M&E datasets.
As I continue down the rabbit hole of NoSQL graphing databases, I am going to see exactly what can be done with graphing of WaSH data. Stay tuned for Rethinking Data: Part II…