…a continuation from Rethinking Data: Part 1
Two weeks ago I posted a blog about open data in WaSH- the challenges in gaining sector buy-in and our responsibility to promote open data (read it here: The Wild-Card: Open Data). After the Akvo Track-Day event, I had a great conversation with Henry Jewell about what do we do with open data once we have it. Once it’s on the internet, is anyone really using it? What are the using it for? Is it making an impact?
I realized that while I promote open data, but I do very little with it. I’ve used some open data to generate statistics for reports and info graphics. I’ve even used some as a primary research data source. But the real point of open data is to improve public understanding of a given topic and influence future policy/decision making, from a national level to the community level. And I haven’t used it for that.
Which brings us to today’s post. It’s time for an open data project! My research question: Within the private sector, who is funding who in WaSH? How do all of the various organizations connect via different funding streams? And most importantly, how do they compare in size and scope of funding given/received?
Gathering the data.
Surprisingly, or maybe not surprising depending on your experience with open data, open data comes in a lot of shapes and forms. Data is only truly open when it is easily extractable and in a user-friendly format. Data is often (sadly) promoted as open when its locked in a .pdf or hidden within flash objects. Lets layout a ground rule: If an API can’t read it, it’s not really “open”.
The data I decided to use was what I like to call “semi-open”. WASHFunders.org has a great dataset that the Foundation Center has curated, tracking grants and financial distributions from donors to recipients. It is hosted on their site in a great map format and can be viewed in a table format as well. However, the table is segmented into pages and there is no option to export any of the data. When I contacted the site manager, I was informed that the raw data could be accessed for a nominal fee. In an effort to harness the data in its public-facing format, I was decided to copy and paste the data from each table-page (194 pages, to be exact). While time consuming, I was able to build the raw data set needed.
Analyzing the data.
I’ve been diving into graphing databases and tools, and what better way to put some of these skills to work than in a project that will combine open-source with open-data? I stumbled upon a graphing tool called Gephi a few weeks ago and have been tinkering with it on random data sets. Gephi allows users to upload edge and node tables, run layout algorithms to determine all sorts of fun stats (connectiveness, centrality, graph density, etc.), and put together very visually appealing graphs.
Graphing the data!
After some scrubbing, merging, and re-formatting, I was able to build two CSVs with edge and node information needed for Gephi. A combination of Force Atlas and Noverlap layouts gave the base for my graph. There was still a decent amount of overlap in labels, so a couple hours of manual tweaking took place to finalize the graph format.
And Viola! Here is the graph:
Warning: Technical Part Beginning