I’ve just returned from Silicon Valley and the 2013 IEEE International Conference on Big Data at which there was a day-long workshop on big data in the humanities. Big Humanities, as it was called, was organized by Dr Mark Hedges, Dr Tobias Blanke, and Professor Richard Marciano, and the sessions were strikingly different from the technical talks going on all around us on text mining and data visualization. Here, the spotlight was on the potential value of big data.
What is Big Data?
Big Data is defined as data sets so large, and growing with such velocity that that they cannot be handled by the traditional analytic tools. And it is not only in science that new tools and technologies are needed. We produce more blogs, emails, and comments on social websites than ever. We use our smart phones to record and transmit information about our locations and interests. Google logs the websites we browse, and Apple the music we download. Credit card companies log the purchases and trips we make. And all this information, including the additional storage and analysis of it by companies, is piling up into data sets that someone, somewhere wants to mine for vital information about social or musical trends, transport schedules, group behavior, climate impact, health and well-being. Unless you live as a recluse, you are part of big data already.
The usefulness of big data in the Humanities?
The advertised usefulness of such data all depends on key examples. Cluster analysis of how income and health are related, or predictions of better ways to schedule flight times, design food products, or target goods and services. It may also provide a goldmine for social scientists trying to study changes in behavior, social and anti-social. But how will it be of use to the humanities, which concentrates on specific, small scale objects and their peculiarities, like people places and things, often studied not in abstract but in their historical context. We deal with acts and utterances that stand in need of interpretation, or art works we aim to understand, not quantitative research on well-structured information. How will the tools of big data help us? Can it serve the traditional goals of humanities research, or will it create new goals and methods for humanities researchers in the digital age? Thing may change when more of the objects of study have been digitised, or are born digital.
Part of the problem is that the term ‘data’ in big data is ambiguous. A fact often overlooked. Data can mean data structure – a digital entity that doesn’t have to be meaningful; or, it can mean evidence for a hypothesis. It is the latter people have in mind when they hope for insights from big data, but it is often just data structures that are proliferated. One ICT director from a large museum told me if we don’t organize the information we store in a meaningful way, it will become landfill.
Dealing with complexity: Data Visualisation
To make sense of the complexity of the results yielded by probing the data is visualization, and data visualization is all the rage right now. However, there’s a danger of hoping that we can go from large unstructured data sets to meaningful insights by relying on visualization, but visualizing tehcniques are not, alone, the answer. And no visualization is neutral as Kathleen Kerr, Virginia Tech University pointed out in her talk on the rhetorical power of visual imagery.
An old motto of computer science is ‘garbage in-garbage out’. And unless the vital work of data curation goes into handling the information we store away, there may be little of use we can get out. Here, there is a role for curators to work with computer scientists at the input stage, and there were calls for more of this collaboration at the Workshop, and evidence that it was already taking place in Google’s new Cutural Institute.
At the Cutting Edge
Datafication, as we’re learning to say goes beyond digitization. Once you have the digital objects standing for the real, what can you do with them that you couldn’t do before? It’s here we find the cutting edge projects. New tools that can link contents from vast stores of digitized newspapers to show patterns of convergence in nineteenth century thinking about key issues events, spanning several continents. This is facts becoming a reality. Knowing which newspapers corresponding writers were reading and being able to see on one day, what was topical, and how it was interpreted in the latters that were exchanged: this is what is creates new questions and new methods of inquiry. And it is already happening in the humanities. Good examples were presented by Professor David Smith, NorthEastern University; Dr Ben Miller, Georgia State University; Professor Brent Seales, University of Kentucky; Professor Richard Marciano, University of North Carolina, Chapel Hill; Neal Audenaaert, Texas Centre for Applied Technology, Texas A&M University and Natalie Houston, University of Houston.
But will something be lost? Once new methods become available and lead to new questions, will researchers find themselves changing the subject. Will they follow what the new digital systems do best rather than what they wanted to know. How do we transform data into knowledge? A well-argued case was made by Dr John Simpson, University of Alberta, Canada himself a practitioner of the new technology, that something of the specificity of what humanities researcher investigate could get lost by decisions about how to record and organize data for the convenience of search and linkage Link to paper/ slides .
With the beautiful example of Michael Field, an author of many books who was actually two women writers using a pseudonym and aiming to write as a man, we see the difficulty of classifying the author of the works. Is it Michael Field? Is it Katherine or Emily, or both? Here, we encounter problems of ontology. Should the author be listed as male author, as the cousins wished, or as a female author, and which one?
Data bases have to list the objects and properties of things in their domain. But how should one do that and how will the decisions taken affect the linking of this data to other sources? Once you depart from the highly structured world of bibliographic data – a pre-existing example of big data, as Professor Andrew Prescott, the Leadership Fellow for the AHRC Digital Transformations Theme pointed out link to talk – and into domains that literary scholars care about, things are not so well-behaved as computer science would like, as Dr Amalia Levi, University of Maryland brought out in her talk of myths and challenges. There are technical and theoretical issues here in the ontologies of our linked data sets.
The move toward data-driven research suggests the automatic sifting of large scale information hoping for patterns that may be revealing; and indeed they may. But it runs counter to the scientific method of starting with a hypothesis and looking for the relevant data against which to test it. Data – in the sense of evidence – is only data for a theory or hypothesis. Of course, one can have smart hunches or prior questions about what one wants to look for. Merely asking, What’s in my data?’ will hardly endear you to the collaborative computer science who is trying to understand what you want access to in all the data you have access to. Professor Lu Xiao, from University of Western Ontario, wants to help humanities researchers with big data, but to do so she needs to understand what they want to do, and she has started interviewing researchers before she attempts to build the tools. Link to paper. This kind of collaboration will be vital.
There is work to be done in the humanities and sciences in studying the phenomenon of big data. We need to ask more questions about the data source and data quality, and not volume. We need to ask about the validity of the results produced. Are assumptions about people generated from data not gathered for that purpose going to lead to reliable and useful insights? Also, how much should we mind that the devil is no longer in the details but in massive scale generalisations and abstractions? It is not only academics but marketing companies that are worried by these issues. From all this data gathering what they most desire is personalized data: to know something specific about you and not just about people who shop at a certain store and download certain music. They want to be able to address your particular interests and preferences. So small and personalized data is sometimes the aim of all this massive scale data mining.
Finally, should we worry about our security and privacy. There are big issues here concerning the ethics of big data gathering and mining. These are issues not just about the ownership and control of data, the gathering of swathes of information about its citizens by governments, and the right to be forgotten and have information thrown away, but also the morality of companies and policy makers taking decisions that will affect millions on the basis of generalization made about those on the web and with smart phones who generate all the data. We need humanities scholars like Professor Peter Ludlow at Northwestern University, with advanced knowledge of the technologies, to keep an eye on dark data. There’s a lot of work here to be done and we’re just beginning. In the Humanities we’ll need to know about these developments and we need to be ready to respond.
This blog post is written by Professor Barry C Smith as part of his role as the Leadership Fellow for the AHRC Science in Culture Theme. The Science in Culture Theme is a key area of AHRC funding and supports projects committed to developing reciprocal relationships between scientists and arts and humanities researchers.
For updates and latest news and information, follow us on Twitter @AHRCSciculture