Data Analysis / Graphing Databases

Rethinking Data: Part 1-Graphing Data

I love Twitter. Specifically, I love it when I stumble across a new article, idea, or link from someone I follow. This past week I followed a link posted by a fellow data geek that led me to downloading a book on Neo4j NoSQL databases. The book, Graphing Databases, explores the development and practical use of non-relational, non-traditional database structures for showing relationships between data points.

For the less nerdy, a typical database is refered to as a relational database. It is the type of format you see in Microsoft Excel. Each data point lives in a single row and each column is a different attribute. All the attributes in a given row are linked together, and each row is collectively linked together within the table.  This is a great way for organizing small data sets with simple variables and is commonly referred to as a SQL database. It is not great when you have very large datasets, as the query and analysis time grows with the size of the database. Its even harder to use if you are trying to track data points from multiple sources that don’t necessarily fit into the same relational table.

I constantly run up against this challenge when working with WaSH data. For any given project, we typically collect data from multiple sources. User data from household surveys, management and financial data from water board surveys, georeferenced functionality data from water point mapping, water quality data from the local engineer, along with other sources from remote sensors or national databases,  and they must all be combined to paint a picture of water and sanitation coverage in any given community, region, or country. Each source becomes it’s own Excel workbook tab or separate CSV of variables within STATA. Linking these data streams requires a lot of time and effort, and usually results in a messy set of files that require an intimate knowledge of their creation to make any sense of.  I have both created these labyrinths of data and been dropped into the middle of other peoples data sets- and neither situation is pretty.

Learning about graphing data bases has changed my entire view of WaSH data.

Graphing databases are based upon 18c graph theory. Each data point is a node or single point on a graph. Nodes are connected to each other through relational edges. So instead of all data points having to relate within a table, data points can relate in any schema necessary.

Graphing Connections of Kevin Bacon

So what does that mean for WaSH data? We can take graph data points from households, communities, water points, and any other data source and link them together through their interconnectedness. Relational edges can be drawn between a set of households in a given community and a shared hand pump, allowing for analysis to take place between all points in that network to determine a total picture of functionality. These small groups of nodes can be partitioned by geography or rural/urban setting and aggregated with other network groups to develop an in depth view of service coverage. If the data points are georeferenced, graphing database models can even organize data using long/lat and conduct some geospatial analysis. And since each data point lives independent of each other, we can more easily deal with a common challenge in WaSH data- syncing data sets from varied sources and dealing with gaps in data.

The one big downside of graphing databases: ditching Excel. A lot of people involved in the collection and management of WaSH data rely heavily on using Excel (or a similar program). It is a challenge for me to move beyond this traditional way of thinking about Relational Database Management Systems (RDBMS), but I’m already enjoying learning new tools and reworking my datasets into a graphing format that highlights the complex relationships that exist between different M&E datasets.

As I continue down the rabbit hole of NoSQL graphing databases, I am going to see exactly what can be done with graphing of WaSH data. Stay tuned for Rethinking Data: Part II…

Advertisements

3 thoughts on “Rethinking Data: Part 1-Graphing Data

  1. Nice article and thinking. Looking forward to your results, would love to see the conclusions, visualisations and queries on top of your data.

    You might want to explore at least importing data from spreadsheets http://www.neo4j.org/develop/import and the graph query language cypher also allows you to return tabular data from a graph.

    With the http endpoints in Neo4j Server it would be an interesting challenge to create a set of macros that sync an spreadsheet with a graph database. I assume that macros can do http requests and json processing these days.

    http://www.neo4j.org/tracks/cypher

    Cheers

  2. Pingback: Rethinking Data: Part 2- Graphing “Open” Data | The Water Wonk

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s