Yesterday (July 11, 2012) the UN Global Pulse hosted a roundtable on Big Data and Global Development. This follows the release of their white paper – Big Data for Development: Opportunities & Challenges – on May 29, 2012. In this same spirit, I would like to present some products of IREX Ukraine’s quest to become more data driven. The last paragraph of their blog post announcing the white paper is worth quoting directly:
“It is important to recognize that Big Data and real-time analytics are no modern panacea for age-old development challenges. That said, the diffusion of data science to the realm of international development nevertheless constitutes a genuine opportunity to bring powerful new tools to the fight against poverty, hunger and disease.”
The white paper presents a number of key dialogue points for the movement, all of which are very applicable to global library development. The first asks what types of new, digital data sources are potentially useful to the field of international development? In this case the data source is from computers in libraries in Ukraine. Each of the computers that the Bibliomist – Ukraine program provides to libraries in Ukraine has software included to provide impact data back to our team. Upon deployment, the librarian completes a short questionnaire, including information on the library’s location and the size of the community in which it resides. Also, the software reports to our database every 15 minutes; a 1 if the computer is actively being used, a 0 if it is on but not in use, and Null if it is switched off. This simple ‘hand raising’ exercise from each of our 700 libraries provides a wealth of data. Over the course of the first half of 2012, Bibliomist has certainly accumulated a big data source that provides insight on public computer usage in international development settings.
The second question posed in the UN’s white paper is what kind of analytical tools or methodologies for analyzing Big Data have already been tried and tested by academia and the private sector, which could have utility for the public sector? Like most big data analysis, I use a combination of tools to retrieve, format, and visualize our project data. The raw data is stored in a MySQL database, and through a variety of views is exported into a csv file. From there, I use a few Perl scripts I had laying around from graduate school to clean things up. I then produce a monthly workstation usage report to our project team, which ranks libraries according to Oblast. This Excel spreadsheet allows the whole team (representing a wide range of technical expertise) to be comfortable making more data driven decisions. For the fun stuff, I bring my data into R. Using the wonderful IDE, R Studio, I get down to visualizing the activity of over 2000 computers in libraries across Ukraine. Specifically I use the ggplot2 visualization package written by Hadley Wickham to produce the following graphics. The following animations are not meant to be comprehensive data driven representations of the program, but provide what I feel are compelling displays of the beauty to be found in visualizing the data in our development work.
Click each of the images to watch the animation in full size.
(WordPress resizing the gif files removes the animation, but I think they lose too much visually if sized down)
This first animation shows the evolution of computer usage by month based on the size of the library’s community. Workstation usage is taken only during working hours for a library (8am – 8pm). As can be seen, the vast majority of the computers are in rural locations. There is a surprising amount of change from month to month.
This animation is the same as the first, only each of the 25 Oblasts of Ukraine are graphed separately. The variation between Oblasts is striking; some have many more computers than others, others are much more urban. In this visualization you can almost feel the project breathing from month to month. The binary report from each computer comes together to paint this full picture.
In this final graphic, we again return to the location size vs. performance. Each of the dots represents a library. The box and whiskers is used to mark the nature of the distribution in workstation usage. Over the months, libraries move fairly significantly. Some very noticeably separate themselves from others, most noticeably in the village community size. We can now reach out and specifically find out what aspects of programming are making these libraries successful and how can they be emulated?