Thoughts about open data and the future of librarianship

voyantThese are most common words I have used on this blog since I began writing it back at the beginning of October.  I feel, looking at this representation exported from Voyant tools, that I must have been on the right track.  It was actually even more interesting, from a writerly point of view, to leave a few of the stop words in, as what resulted gave me an indication of boring words that I tend to overuse. “Particularly” seems to be one of them… I might have to give thought to this before I hand in my final essay!

In the final lab of DITA we attempted to obtain Twitter metrics reports, however, given that I have only Tweeted a handful of times since the course began, my results were singularly uninteresting, and I couldn’t seem to get the program to work properly so I won’t publish the results here. This is not to say I haven’t been using Twitter during this time.  As well as following up my classmates’ links and suggestions, I have also used it to track the protests post the Ferguson verdict, read feedback and comments from students using the library I work in, found out the details of various incidents I have passed while cycling to work and discovered details about what some of my favourite bands are doing.  All of this data I have generated and accessed has covered vast swathes of my life and has made me realise how useful open access to data, via APIs and beyond, can be for people developing apps to help us get on in our lives.  It also scares me a bit, when you look at the ways that companies such as Uber are using data to invade people’s privacy.

The move towards open data generated from research has been prominent in the university in which I work – well – talk of open data has been prominent, whether or not the university eventually sets up a repository similar to the institutional repository we currently have remains to be seen.  Increasingly it is being recognised that researchers providing their raw data will, as the Open Data Initiative says, contribute “economic, environmental, and social value” to society.  If research is publicly funded, it stands to reason the public should have access to results.  And as I mentioned before, the ability to utilise this kind of data, to mash up different applications really can’t be underestimated, considering the kinds of things that people are creating, such as this woman’s mission to make it easy for people to locate public toilets in Denmark. Someone needs to do that for London!

What has been interesting, and slightly uncomfortable for me throughout the last 10 weeks of DITA is that, while on the one hand I can definitely see the need for librarians and information specialists getting a handle on these kinds of technologies, on the other hand it seems to run very parallel to what companies and corporations are doing (such as Uber).  The difference being, I guess, that we’re not in it to (necessarily) make money off of people, but much of it does feel a bit like we’re learning business analyst tools.  In fact, a friend of mine recently got a job working with “big data”, and her company does market research and the like for various big companies.  We’ve been able to share a lot of knowledge in the last few weeks, and while I realise it is reactionary (and probably a bit technophobic) of me to feel uncomfortable, there is a bit of “I am training to be a librarian after all, not help car companies sell cars!”.

But I think this will be the (future) role of librarians; to help the public to gain/retain control of their own information and understand what is being done with their data, as well as navigate copyright limitations and in an academic context, promote useful data analysis tools to students. To that end I am pleased to have been given these leads to follow up and look forward to integrating them into my work within the library..

Semantic Web and the potential for opening up accessibility

I work with visually impaired library users every day in my job as Library Access Support co-ordinator. The benefits of the development of the “semantic web” for these library users is immediately obvious (one thing I have learned on this job is that technology with assistive aspects benefits all of us, wether we consider ourselves disabled or not, hence the success of the iPhone and naturally, the drive towards the semantic web).

Throughout the 10 sessions of DITA, in the back of my mind I have been applying the ideas to library users of my past (youth) and my present (disabled and dyslexic university students) which helps ground the theory in practise for me. Right at the beginning we learnt about Information Architecture, thinking about the importance of structuring web resources well, and now at the end we are investigating the semantic web, which involves the Text Encoding Initiative (which seeks to make documents machine understandable) and the Resource Description Framework (which provides metatdata for digital resources). Who better to judge the efficacy of these concepts and approaches than those who, in navigating digital resources, are relying solely on software entirely dependant on the hierarchies of a webpage being meaningfully structured, or document being correctly tagged. Think about the ‘skill’ we are taught to develop of quickly scanning a document to decide on its usefulness for our research. Without being able to physically “see” the text, imagine the benefits of text analysis tools and topic modelling to quickly pull out the salient concepts of a document.

Reading further into literature on the semantic web, however, I kept getting snagged on the discussions around the creation of ontologies and taxonomies (which any readers of this blog will not be surprised to hear).

Take for example, the Comic Book Mark Up Language. As an ex-comic book shop owner I was fascinated to see this exists. Our store, Cherry Bomb Comics (RIP) specifically sold only those graphic novels made by women, LGBT and people of colour, as well as local New Zealand creators (our blunt instrument way of rectifying the imbalance in the comics world). I remember trawling through distributors’ catalogues hoping to catch sight of a few keywords we had employed to identify the stock we wanted to hold. If all things comics were marked up with CBML (they’d have to be digitised first though, but I imagine most comics are born digital these days) and publishers made them available for text analysis, what a far more accurate way of identifying what we needed. But… As well as removing the serendipity of browsing through catalogues, who decides how to interpret (and then subsquently mark up) a comic or any image for that matter? The artist/author? The publisher? The person they have hired to do the marking up?

At my place of work, we came across similar philosophical difficulties when we OCRd texts from art books for VI students. Initially we tried to describe the artwork depicted (which the blind student would obviously not be able to see), but it quickly became apparent that this was inappropriate as we were describing things inconsistently, and subjectively, effectively “telling” the VI student what an image represents.

The concept of the “semantic web” being about creating knowledge structures is as exciting as it is open to abuse of power and privilege.  What does the web get to “know”?  Whose knowledge?

I wanted to investigate the possibilities for Web 3.0 technologies to aid accessibility and located this study by Koroupetroglou et al, on the “Web For All” site which looks at using semantic web frameworks to create applications to assist visually impaired users.   Conducted in 2006, it’s rather old now in terms of digital technology, but I was interested in their focus on the extensibility that comes with using OWL (the language used for setting the ontologies behind an RDF), and the fact that this openness to addition and change, they felt, leads to increased opportunity for co-operation amongst different groups with expertise in different areas of digital accessibility. The final paragraph in the study sums up the possibilities opened up by using semantic web technologies in a way that I think implicitly addresses the need to be aware of “whose knowledge?” : “Our community is not tightly connected to the web authoring society, which is quite large and difficult to educate in accessibility issues. However, it can work independently upon the products of the web authoring society.”

Reference: Kouroupetroglou, C., Salampasis, M. & Manitsaris, A. (2006) A Semantic-Web based Framework for Developing Applications to Improve Accessibility in the WWW. Retrieved from: URL http://www.w4a.info/2006/prog/15-kouroupetroglou.pdf.

Text analysis using the Old Bailey API & Annotated Books Online

The Old Bailey Online provides digitised proceedings of the Old Bailey from 1674-1913.  It offers a general search function, however using the open API allows the user query to results in a more specific way, “undrilling” to modify a query, or breaking the query down into further subcategories.  Using the API also allows results to be exported to the online reference management software Zotero and also to Voyant for further visualisation.

For my search, I used the keyword “Camberwell” (where I live), with gender of the defendant set to “female”, and punishment category set to “Death”. This returned 8 (highly interesting!) results.

OldBaileyScreenshotsegmentfromoldbailey

 

I exported these texts to Voyant, and the resulting word cloud looked like this:

OldBaileyVoyantThe prominent words,  “child” “mr” “mrs” “death” “house” “room” “seen” “know” “said”, paint an eerie picture of domestic mishap, which would definitely be a good starting point if you were looking for inspiration for a Victorian murder mystery. Aside from that, the word cloud doesn’t give you the kind of information you’d expect a researcher to be looking for while using this tool, i.e. you don’t get any kind of picture about what kinds of crimes these women committed or the kinds of evidence presented at court. This does seem to be one of those situations Jacob Harris mentions in his blog post at Nieman Lab wherein the use of the word cloud doesn’t provide much in the way of insight.

I was interested to read that these court proceedings were digitised through a process of text rekeying. Earlier texts were manually typed twice by two different typists, and then the transcripts were compared by a computer, with editing performed manually.  Later texts were keyed once, with the second version being created using OCR software, and the texts once again compared and manually corrected.  In my place of work I use OCR on PDFs uploaded to Moodle in order to make them accessible for visually impaired students so that they can use text-to-speech software.  This is a time consuming process, especially if the original text is old and the print quality not very good (we have students studying Olde English and Witchcraft, which and the OCR software really doesn’t like their texts).  In some ways it was pleasing to know that there just *isn’t* the technology out there to get it right at the moment to make this task easy, as demonstrated by the laborious processes performed by the people behind the Old Bailey Online.  I am glad to know that in my place of work we aren’t just wasting our time with all our manual editing, at present it seems this is the only way!

Later in the DITA lab, I looked at Universiteit Utrecht’s Digital Humanities Lab, specifically at their text mining research projects, and chose  to explore the project Annotated Books Online. This project digitises early modern books with handwritten annotations, marking the text up in order to separate out the annotations themselves for closer inspection.  Annotations can be highlighted with different colours, and have transcriptions added to them.  Well, that was the theory anyway.  The first time I used ABO I could highlight the annotations and get them to change colours, however, I haven’t been able to since for some reason.

ABOThis research project really appealed to me, I have always found marginalia interesting, and I like that the present-day reader can, in a sense “interact” with the annotator of the past by “doing stuff” with their scribblings in the margin.  Considering these texts are quite old and no doubt delicate, it’s a treat to be able to manipulate them in this way (well, it would be if I could get the annotation features to work for me again!).

Word clouds: “mullets of the internet”? What would Tupac say?

The description of word clouds employed by Jeffrey Zeldman as the “mullets of the internet” made me laugh.  I’ve never found them particular attractive to look at.  That said, using tools like Wordle, Many Eyes and Voyant was fun and, like the Altmetrics doughnuts made the data in the otherwise eye-strainingly dull Excel spreadsheets much easier to get my head around, though I’m not sure how useful they are beyond getting a very general picture of a situation.

That said though, we used data collected from our altmetrics work in the last DITA lab and a few things were revealed to me.  Firstly, using Altmetric I performed a keyword search for “Aotearoa” as I mentioned in my previous blogpost. When I  looked at the results produced by Altmetric, it seemed that some of the journal articles/blog posts/Tweets etc it gathered did not contain the word Aotearoa, and it felt like the results were a bit random.  However, using Voyant on the titles from the Altmetric data exported to Excel, resulted in the following word cloud:

AotearoaVoyant

with the word “Aotearoa” (as well as “Zealand”) showing very prominently, which lead me to realise that I probably dismissed my Altmetric results too quickly, and on further inspection they were more relevant than I thought – and the word cloud more than just a colourful mullet! (And yes, I did forget to include “stop words” which is why “and” and “of” appear so frequently – oops).

I also gathered Altmetric data using the keyword “Bicycle” and exported these to Voyant as well. This screenshot shows the kinds of information Voyant pulled out for me:

BicycleVoyantOne of the most useful features is being able to select a word, in this case “helmets” from the corpus, and on the bottom right of the screen, the instances of this word being used are shown, surrounded by the context of the sentence (which can be expanded). This is useful if the word the researcher is looking for is more ambiguous than “helmets” or “Aotearoa”, and could perhaps be mentioned in a context irrelevant to the thing being studied.  This more granular way of looking at the data ensures that the researcher is getting an accurate picture of how the words are being used in the text, with minimal effort.

I still can’t say I am convinced by the usefulness of the word cloud, or even 100% sold on text analysis when looked at in this quantitative way.  I did my undergraduate degree in English literature, so I guess Franco Moretti’s concept of distant reading which employs graphical and quantitative visualisations of a text is a new one to me (though would have been REALLY helpful when writing those essays on Victorian literature!).  But I was interested in Julie Meloni’s blogpost at the Chronicle of Higher Education regarding the use of word clouds for engaging students.  I used to work in a youth library, and many of the teenagers I worked with were very interested in poetry and expressive language.  “The Rose that Grew from Concrete” by Tupac Shakur was (perhaps unsurprisingly) one of the most popular books in the library.  In a bid to get the kids to engage with how poems are written, I photocopied some of Tupac’s poetry and whited-out some of the more visceral words.  The kids then had to imagine/guess/decide what words should be used where the spaces were.  I just used Voyant on poems from “The Rose that Grew from Concrete”, and I think that this would’ve been a hit amongst all those emo teenagers at the library:

TupacVoyant

Using Altmetrics to measure societal impact

The value of using alternative metrics to collect evidence of impact for a scholarly work appeals to me, because it opens up the notion that the “general public” are thinking and reading also, not just academics.   I am involved with lots of communities that I guess you’d call grassroots: activist and music communities particularly, which are not connected to universities, but are often political in nature (feminist, post colonial and queer theory contributing significantly).  The open and cheap dissemination of information and ideas is important to these networks and while zines have long played a big part in this, often introducing the theories of seminal thinkers (see for example, this zine, Judy!, a tongue in cheek zine about Judith Butler, recently digitised by QZAP – the Queer Zine Archive Project), the internet has obviously by and large taken over (though zines still live on!) and these same communities continue to share theory on Tumblr, Twitter, Facebook etc.  Gathering evidence that scholarly works are being discussed outside of the ivory tower I think not only is gratifying for the scholar, but also provides an important channel of feedback as academics are able to see the context in which their work is being used.

Working in an academic library I hear a lot about the Research Evaluation Framework (REF), which is basically how funding decisions are made for research in Universities.   As said on the Ref home page: “The assessment provides accountability for public investment in research and produces evidence of the benefits of this investment”. Which I guess is a fancier and more money-focused way of saying what I said above.   Clearly, altmetrics will play a significant part in proving the worth of areas of research, particularly with the shift to Open Access that is also being hustled along by the REF.

As we discussed in our DITA lecture, altmetrics cannot be relied upon for the whole picture.  Tools such as Altmetric rely on documents having DOIs, and also due to the ever-shifting nature of social media, results are not stable, they will only ever provide a snapshot for a moment in time.  Five minutes later, things could be different.  Not only that, but the way that we share things on social media can often be flippant and superficial; i.e. just because I share a link doesn’t mean I’ve really read it.  However, as Ernesto Priego points out on the Altmetric blog, using altmetrics often (but not always) means you can pinpoint data such as the geolocation of the person sharing the link, which can give added weight to the significance of the share.

AltmetricsUsing Altmetric Explorer last week was an interesting experience.  I was a bit frustrated that the keyword search didn’t seem particularly accurate, for example, I wanted to search for mentions of “Aotearoa”, which is the Maori word for “New Zealand”, as I thought it would cut out the chances of picking up articles about “Zealand” in Denmark. However, despite the uniqueness of the name, some of the articles returned did not contain the word, or even have anything to do with NZ at all.  I couldn’t get to the bottom of this. Also I noticed that mostly Science-based journals were being discovered, but I guess this is probably due to the these journals having a higher proportion of DOIs over journals in the humanities and literature, which is probably where I was more likely to find the topics I was interested in.  One thing I wondered about what whether there was any kind of correlation between how “populist” the article topic was, and what kind of social media was used to share it, e.g. perhaps including Pintrest and Tumblr in my search scope would reveal something rather different than if I stuck to news sites and blogs…however, it was difficult for me to judge that from the results I received (partly because the notion of what’s “populist” is subjective I think).

Looking at the Altmetric “doughnuts” was far more pleasing and easy to take in at a glance than using Excel spreadsheets, and I will definitely be going back to this tool and hopefully will be able to get more out of it with practice.

Learning to love the digital in order to understand the world

I was fumbling around for a way into a blog post this week, and was inspired by my classmate Judith’s entry; “If it’s boring, it’s important”, which made me laugh as it’s so painfully true.*

That said, I am loving they way that DITA is being taught, it puts a whole new spin on things that have otherwise never interested me, and I am certainly not finding it boring.  In fact it has me seriously questioning the ways in which I have ever been taught about digital technology in the past!  Ernesto’s slideshow for our last lecture on “Archiving, Understanding and Visualising Twitter data” is a good example of this “angle”, ending with a cartoon by Randall Munroe, creator of xkcd.com which has two stick figures in a dark and empty landscape full of possibilities saying “Let’s find out”.  Ultimately, all of this is about being curious and having questions, and using information, such as from Twitter, to find out about the world we live in.  When information technology is looked at in this way, it is much less daunting.

Having my mind opened up in this way is leading to some important realisations.  For example, it now seems clear to me that leaving Twitter data out of the equation when analysing modern communication networks, topics which groups of people care about, and the way current events unfold which will one day be of historical importance, is bordering on irresponsible.  As Ernesto Priego says in his blogpost ‘Twitter as public evidence and the ethics of Twitter research’, “these days what’s unethical is not to use Twitter as a research tool”.  Indeed, the Library of Congress signed an agreement with Twitter in 2010, which gave them access to an archive of public tweets from 2006 – 2010, and Twitter continue to provide the Library with access to public tweets to this day.   On the Library of Congress website, it is explained that the reason for this is that the Library’s core mission is to “collect the story of America”, demonstrating the importance of social media-as-document, and consequently in the way in which we understand our world.  As Lyn Robinson states in ‘The future of documents’, networked technology is only going to become more pervasive, and as such social media will not be going away any time soon.  Furthermore, the role Twitter and other social media play in political protest and world-changing events is the subject of much recent debate in the media, and even when the position is taken that it’s actually not very important, such as this article by Laurie Penny for the New Statesman, ‘Revolts don’t have to be Tweeted’, Twitter et al is still central to the discussion.  Either way, social media can’t be ignored.

That said, I had never used Twitter until I needed to for #citylis and therefore I remain skeptical about privileging Twitter (though I do understand that it is the public nature of Twitter which affords itself to the study of it in the context of a classroom). Adding to my skepticism is my experience working in a public youth Library in New Zealand. It was the early 2000s and Bebo and MySpace were new on the scene. Interestingly, many of my peers who lived in the central city and were interested in punk music took to using MySpace, while almost all of the kids who used the library (which was based in an economically underprivileged suburb), and who generally listened to hip-hop and R&B used Bebo.  These different groups were having conversations and building communities that didn’t seem to touch each other, though they were all living in the same city. And to this day I don’t quite understand why one social media platform would be chosen over the other on the basis of socio-cultural/economic factors, given that both were free.

I am therefore finding myself quite drawn to literature which points out the biases which occur in analyses of Twitter data, particularly in relation to when these analyses are used to explain social and historical events by the media.

In ‘Assessing the bias in samples of online networks’, Gonzalez-Bailon et al describe their use of the Twitter search API (application programming interface) and the Twitter stream API using various filters to compare what kinds of data each bring up about the Spanish ‘indignados’ protests in 2012.  Probably not surprisingly, they found that smaller samples don’t reveal the diverse array of peripheral activity/conversations that was going on, and their data using the Twitter search API with filters was biased towards the centrality of certain tweets/users.  Unfortunately, unless you are the Library of Congress or some other big organization which can pay for archives of “all” the tweets, you will be limited to smaller samples, and will therefore get a skewed picture of communication networks.

But this brings me back full circle to things that are, if not boring, at least seemingly impenetrable at first glance being the most important. As researchers, librarians, information specialists we need to be able to understand things such as how APIs  work and their inherent limitations in order to best assess the data we collect.  I am also interested thinking about why certain people use Twitter and others don’t and why some groups were using Bebo and others MySpace in the mid-2000s.  How does this affect data visualisation?  You only have to be a New Zealander, and look at the picture of the world-as-connected-by-Facebook on Facebook’s login page and see your country is left out of it, to realise the limitations of Big Data, and remember there is always another story going on beyond the one gleaned from the algorithms.

*Entertaining aside, I shared this link to an article by Charlie Brooker for the Guardian on Judith’s blog, “What is Drip and how, precisely, will it help the government ruin your life?” about the Data Retention and Investigatory Powers bill which Brooker describes as “the most tedious outrage ever”.  This is how They will get us in the end, by boring us to death with things that matter the most.

Understanding APIs with the help of WhatsApp

Getting to grips with the concept of an “API”, particularly in contrast to a web service, took me quite some time.  I couldn’t figure out what the difference was initially, until I WhatsApped with a friend of mine in New Zealand who is a programming whiz genius-type person. She, succinctly in a text message, informed me that a web service is a type of API, but APIs themselves are not web specific.  She went on to give an example of an API that may exist in conjunction with, say, a kernel in the open-source operating system Linux, which would allow Linux app developers to write desktop applications.  As she said, APIs can be accessed on the same machine, rather than over a network, unlike web services which are almost always accessed over HTTP (K. Graham, personal communication October 26th, 2014). Hooray for WhatsApp allowing my friends back home to explain the intricacies of DITA to me 🙂

I had a quick look to see what kind of APIs exist for WhatsApp. This link explains a couple of ways WhatsApp can be integrated into a various apps: http://www.whatsapp.com/faq/en/iphone/23559013. WhatsApp has a Documentation API which is what allows multimedia created by other apps to be shared on WhatsApp.

One of my difficulties with getting my head around what APIs and web services were is partly because they are so ubiquitous: we use them, or the mashup of data or services they provide, everyday when we read Tweets on the Guardian, or copy/share media or text between applications.

For a bit of embedding practise, as well as a tie in with my WhatsApp revelation here is a talk by Toby Shapshak and the role of the mobile phone in Africa.