Monday, June 22, 2015

A Bold And Arduous Project Of Arriving At Moral Perfection


Ben Franklin stands in the pages of history triumphantly mastering electricity, establishing universities, and writing constitutions. He is heroically armed with kite, with beaver hat, and with book. It seems odd to think of him doing something so quotidian as basic arithmetic.

But back in his younger days he was far less exalted: we must imagine him bent over account books, making a reckoning of each day's purchases and sales, furrowing his brow over sums our phones could do in a half second, trying to figure out whether he made a profit or loss on a printing of some evanescent pamphlet.

Accounting was incredibly important for people like Franklin, because accounting allowed merchants and like-minded souls a way of seeing otherwise invisible patterns. People kept accounts of their businesses to tell their profits and loss. They kept accounts of their own finances to see who they owed money to. And many people, like Franklin, or Robert Hooke or Samuel Pepys, tried to do even more. They tried to keep accounts of their own selves.

A writing blank; from BiblioOdyssey. Note the banker in the lower picture working over the bank's account book.

Here's Franklin's Autobiography on the origin of his famous moral accounting:

I conceived the bold and arduous project of arriving at moral perfection. I wish'd to live without committing any fault at any time; I would conquer all that either natural inclination, custom, or company might lead me into.... [Bad habits] must be broken, and good ones acquired and established, before we can have any dependence on a steady, uniform rectitude of conduct. For this purpose I therefore contrived the following method.
I made a little book, in which I allotted a page for each of the virtues. I rul'd each page with red ink, so as to have seven columns, one for each day of the week, marking each column with a letter for the day. I cross'd these columns with thirteen red lines, marking the beginning of each line with the first letter of one of the virtues, on which line, and in its proper column, I might mark, by a little black spot, every fault I found upon examination to have been committed respecting that virtue upon that day.
It looked a little something like this:



Monday
Tuesday
Temperance


Silence
*

Order


Resolution


Frugality
**
***
Industry


Sincerity
*

Justice


Moderation


Cleanliness


Tranquility


Chastity

****
Humility



The contemporary project of the quantified self promises to improve on Franklin's little spreadsheet, turning this thumbnail sketch into a warts-and-all big data portrait of an individual through time. The hype suggests that wearable technology like Fitbits and Apple Watches can turn our daily activity into numbers, and these numbers can be visualized, massaged, mined, and understood in ways we never could have dreamed of. This will allow us to uncover the hidden causes of our unhappiness, to graph our continuous climb towards self-improvement, to crunch regressions that show us just exactly where it all went wrong.

I don't buy the hype, though. Not yet anyway. The new quantified self might be able to measure a lot, but can that data answer many questions? Ben Franklin kept track of his moderation; the Fitbit can only count your heartbeat as you wait for your new girlfriend to pick you up at the BART stop. Ben Franklin accounted for his temperance; the Fitbit can only measure the number of calories you burn on your guilty post-pizza morning runs. Ben Franklin paid attention to his industry; the Fitbit can only tell you how many naps you took in the day, not what dreams you dreamt while napping, not the pleasure you get from putting a pillow over your eyes to block out the afternoon sun.

And it's those hard three in the morning questions that really matter, Am I good? Am I happy? Am I just stumbling through life? Are we anything more than a bunch of over-proud apes clinging desperately to a rock falling forever through the vacuum of space?

Nobody expects Fitbits to answer questions like these. But they are the questions people expect humanities scholars to wrestle with. So what is digital humanities to do with the quantified self?

~*~

I don't know, but I want to take a crack at telling a story of my life through data. Not just a story of how many steps I took or stairs I climbed, but a weighty story about what my life actually means. So this summer I will take an account of each day, and then do some experiments with this data to try to better understand my self and my place in the world.

But which data should I collect?

This is one of the deep problems of digital humanities. We can work real magic with data, but before we can even begin to build a database, we need to come up with a good question to answer. Then we need to figure out which data might help us answer this question.

To that end, I've set up this Google spreadsheet to canvas suggestions from you, my legions of readers, about what I should track over the summer. Should I measure the number of books I've read? The number of meals I've cooked? The number of times I've thought of a butterfly? The minutes I spent pacing through my rooms, daydreaming? Which data really matters? Which data are just noise?

I'll start the data collection in early July. Afterwards I will post with some regularity about my findings. By the end of the summer I will be able to tell a story of my life, graphs and all. If I'm lucky, it will be the kind of story that Ben Franklin might be proud of. If not, at least there will be some pretty graphs.

This post is the part of a Digital Humanities blogging challenge organized by the Berkeley Digital Humanities Working Group.

Monday, March 23, 2015

Content Analysis With Tableau: Intro

Bit of a change of pace here at Raise High the Roofbeam, Carpenters. Over these next few blog posts, I will write about how a method borrowed from the harder social sciences called content analysis might be useful for humanities scholars.
Content analysis takes qualitative data, boils it down into numbers, then analyzes it. A lot of humanities scholars are already doing content analysis in some form, largely without realizing it. This post will hopefully begin to bridge the gap between the ad hoc content analysis strategies of digital humanists and the formal content analysis of sociologists and psychologists.
Before we start, you might be asking what the pay-off of all this work is.
My current research uses content analysis to look at 18th century Christmas. I looked through more than 250 diaries for entries written on or around Christmas day, which I then coded. Here is a visualization which displays at the percentage of diary entries mentioning a given code, by decade. You can find that here.
Please keep in mind this is only a working example—results are not to be cited or circulated except as an example of this method.
The following blog series will give you step-by-step instructions into how you can turn your research question into a data visualization like the one above.
Broadly speaking, content analysis studies communicative activity by turning qualitative data into quantitative data. At its simplest, a scholar counts the number of times a particular thing happens in a particular set of documents across a particular span of time. Kimberly Neuendorf, author of the most thorough content analysis textbook I’ve read, defines the method like this: “Content analysis may be briefly defined as the systematic, objective, quantitative analysis of message characteristics.” The field is thriving—the number of articles mentioning the method has skyrocketed over the past decade, in part due to the vast expansion of the number of machine-readable documents researchers can now access.
The appeal of this method for humanists is obvious. In some ways, content analysis simply formalizes the narrative synthesis humanists are already so good at.
But despite this promise, it’s hard to know where the humanist can start with content analysis. Textbooks are pitched towards sociologists, psychologists and scholars in media studies struggling with that wonderful non-stop fire-hose of present-day data. Furthermore, content analysis is pitched towards the harder social sciences, which wrestle with very different questions and hold very different theories of change and action than historians and literary scholars. Unlike other digital humanities methods, like text analysis, social network analysis, or geospatial analysis, there is no single out-of-the-box technical solution for content analysis projects—as far as I know. Finally, there is the problem with the name content analysis itself. It is neither evocative nor punchy. It barely describes what the method does. Frankly, it feels boring, overly technical, and scientistic.
These blog posts will hopefully go some ways to provide the interested humanist with some essential background and tools that will overcome these problems.
A warning first: I am an interested amateur, and these blogs represent merely what I’ve gleaned from trying out my own content analysis projects. There’s a certain here’s what I learned on my Summer Vacation quality to all of this. Experts in content analysis will likely find many faults with what follows. Other historians will certainly offer feedback about how I can make better questions and collect more comprehensive corpora. I look forward to their corrections.

Table of Contents

Content Analysis With Tableau 1: A Humanistic Map To A Content Analysis Project

This is Part 1 of a 4 part series on content analysis and Tableau. Maybe start with the intro?
Please note that this is a work in progress and I am more amateur than expert. I welcome questions, comments, and corrections.
In this section, I am going to map out the different stages of a content analysis project. This will be an adaptation (much reduced) of the flowchart found in Neuendorf's content analysis Textbook.
First off, start with a good question. Digital humanities methods are great tools—the challenge is in using them to make great scholarship. Content analysis works particularly well with looking at the often-glacial changes in social practices over time periods larger than a human lifetime. As the more humanists gain expertise in content analysis, we will hopefully find new questions to play with.
Then, seek out some kind of body of texts to look through. You need either a preexisting corpus of material or a way of selecting material to build this corpus. You also need some way of sorting through this corpus so you know what material you will be coding.
Next, determine which variables you are going to be looking for. Broadly, in my own work I code for demographic variables and descriptive variables.
Then code! This means that you read your documents and assign codes to each one. These two steps will be outlined in part three
Finally, you’re ready for visualization and analysis.


Content Analysis With Tableau 2: Data Collection

This is Part 2 of a 4 part series on content analysis and Tableau. Things will make more sense if you start with the intro.
Please note that this is a work in progress and I am more amateur than expert. I welcome questions, comments, and corrections.

You have your research question in hand, and you think a content analysis approach might be useful. What’s next?
The first step is to select which data you want to be working with—your so-called corpus. This can be a certain kind of source: diaries entries on December 25th written in the British Isles from 1688 to 1850. Alternatively, it can be a certain section of an archive: court cases resulting in a hanging. You may also use search terms to hone in on documents in an larger archive mentioning a particular word or set of words. In a previous project I found descriptions of London coffeehouses by searching online archives for material mentioning the word ‘coffee.’ You may also combine multiple archives into one project, if you are careful with how you use balance the different biases of each.
Where do you find all this text? There are many digital corpuses available, some of which have reliable text searching capability. In the British context (the one I know best) there is an embarrassment of riches—London Lives, ECCO, the Burney Collection and other fully-searchable archives are all available to most academic institutions. But you don’t need to use digital material! You can just as easily code printed books, sheet music, or paintings.
A side note: Remember to be very wary about how you use search in digital archives. It is easy to think of search as a flat mirror of the archive, but search engines have their own philosophical assumptions, often occluded under a miasma of proprietary algorithms. My tips are as follows: Search for words with fewer letters, to reduce the chances of OCR errors. (The longer your word or phrase, the greater chance that OCR will garble it up.) Avoid words with the dreaded long s that can confuse OCR. Be aware that the absence of mentions of a term over a long period of time might not be reflective of the actual practice itself, but rather a change in the use of words describing that practice.
Next, you should consider whether the corpus you’ve collected is the right size. The longer a time period you are working with, the more sources you need. If you don’t have a very large corpus (and I would prefer to have more than ten observations per year at the very least) think about expanding your corpus. But not too much. Content analysis is incredibly time-consuming, as it involves hand-coding each and every instance of the terms you are dealing with. If your selection process has given you way too much data, you can select a random sample of it. (Use a statistical significance calculator to figure out how many entries you need.) Even doing this, the whole process will likely take solid weeks of work.
Once you have selected your data collection method, download your sources and assign each document a unique ID. In my Christmas project, I have done this by date of publication, as you can see here.

If manually downloading thousands of books doesn’t appeal to you, keep in mind that some intrepid scholars have written scripts that can help you automate the process. You will likely need some familiarity with coding to take full advantage of these tools, and many are in dubious standing with archives’ Terms of Service.
Now that you have built up your little library of digital (and dead-tree) books, it’s time to jump into coding!
NEXT: Coding!

Content Analysis With Tableau 3: How To Fit The Qualitative World Into The Quantitative Nut-shell

This is Part 3 of a 4 part series on content analysis and Tableau. Things will make more sense if you start with the first part.
Please note that this is a work in progress and I am more amateur than expert. I welcome questions, comments, and corrections.

“Oh God, I could be bounded in a nut shell and count myself a king of infinite space."

Digital humanities has often been described as a kind of ‘distant reading.’ In traditional close reading the scholar contemplates every last comma. But in distant reading hundreds of thousands of documents are dispatched in the time it takes your computer to crunch a few billion calculations. What is content analysis?
Content analysis is mid-way between close and distant reading. The scholar still reads—but she reads only a small portion of each text relating to the activity in question. The scholar uses computers and statistics to crunch the data—but the meaning remains something she has given to the data herself. If ‘mid-range reading’ didn’t sound so silly, it’d be apt.
Coding is the hardest part of the content analysis workflow, the point where method becomes more art than science. Coding your data flattens it, irons out its particularities, smoothes clean the distinctive wrinkles humanists usually have so much fun analyzing. What’s more, coding threatens to lock much of our interpretive work in a black box of numbers and codebooks. But finding regularities in large amounts of your material does not necessarily blind us to variety or difference. It allows us to zoom out—to see things from a perspective that might be otherwise hard to get hold of. As content analysis visualizations improve, it will be possible to link each datapoint with the exact quote it is derived from, leading to the option for scholars to critique one another's coding strategies. The black box will soon be opened.

Coding Workflow
Before I delve into the more technical considerations, it may be helpful to lay out how I go about coding my material.
I open up an Excel document where I keep my records. I use Excel simply because I am familiar with it—other data management programs (like Access and LibriOffice) may be more appropriate. Then I open up one of my target documents and find an observation of the practice I’m looking for. This gets a unique identifier I call an INSTANCE. Then I code what this instance is saying, and the place of observation if applicable. Then I record the citation this instance is connected to. When I am finished with the document, I also code in the writer’s name, religious denomination, and class. Each of these is kept in separate fields in my excel spreadsheet.
Each field is connected to other fields by a set of common variables. Some fields are linked to one another by a shared ‘instance’. (Instance 1 might have a ‘CODE’ of ‘DINE’ and in another field have a ‘PERSON’ of ‘William Bagshawe.) Other fields are linked together by a shared ‘person.’ (‘William Bagshawe’ might have an entry in the ‘gender’ field: male.)
Let’s take a closer look at this whole process.

What to code
Before anything else, you should think about what sort of material you need to discover to answer your research questions. I find it helpful to separate this material into two categories: observations and demographics.
Observations are the behavior you are trying to describe. This can be the activity in coffeehouses, Christmas rituals, the amount of money mentioned in a text, or the mood of a poem.
Demographics are the ‘metadata’ surrounding each observation. The most obvious variable to keep track of is date. But you can also code for gender of observer, place of observation, social class of observer, and so on. You also need to include enough information in each observation so that other people will know where you’re getting your data from. That means that every observation should be connected to a reference. I do this by recording two variables: citation, with the actual page number, and source, which has the code standing for the book the reference is gleaned from. (Remember those codes we gave our books when we downloaded them? They come in handy now.)

Setting Up Your Excel Sheet                                                                  
So where does coding happen? You can do it in many programs, but I use Excel—not because it is the most advanced, or the best suited for this particular project. I just know it the best. It is also helpful that you can find Excel on almost every computer you come across.
Before we start coding, we need set up our spreadsheet. The problem is that we are used to making spreadsheets that humans can read, but it’s a lot easier in the long run to make spreadsheets that computers can read.
When I think of a spreadsheet, I think of something like this:


A human reads this spreadsheet from left to right, clearly seeing which data belongs to which observation. Instance 10 is of a male lawyer in London named Daniel O'Connell who, in 1796, does not mention Christmas. You can check page 91 of book 1906a to see the diary entry I derived this data from.
The problem is that computers don’t like this kind of database architecture. What they like is ‘tidy’ data. In the words of Hadley Wickham “Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table.”
The data above are NOT tidy. Although each observation is a row and each variable is a column, the observational units are not split out by table.
So instead of a single sheet jammed full of data, we have to imagine a number of sheets, all of which interact relationally. (A so-called ‘relational database’) Each ‘table’ (a sheet in Excel, also sometimes called a ‘field’) is a different observational unit—religious denomination, year, and so on. Each row gets its own observation. And each column is its own variable.
It is often helpful to understand your relational database visually: every ‘field’ (each tab in Excel) is related to other ‘fields’ by shared variables. It is important to keep the names of your shared variables EXACTLY THE SAME otherwise you’ll have complications later. (There are programs available for turning a human database into a relational database, but why not start out on the right foot?)
Much work has been done on how to keep your data tidy—consult these if you’re curious about how to make your database. There are also some computer programs, like Data Wrangler, that can help turn a normal ‘flat’ database into a tidy one, if you need.

Open-ended coding versus pre-coding             
Now that you have your relational database set up, it is time for the fun part: actually reading and coding your material.
But before you do this, a huge question arises: what do you code? How do you ball  the infinite variety and difference of human life into a dozen marbles of rigid 'codes'? There are two options.
The first is called open ended coding. In this method, you write brief descriptions of each practice as you go along. After a certain amount of time, you go back over these and impose a kind of order, balling up particular descriptions into a limited number of broader categories. I prefer to do this at the very end of the coding process, but to each his own.
The second approach is called pre-coding. In this method, the researcher comes up with a pre-determined set of codes before she starts looking at her data.
Whatever you do, make a list of your codes and write a description of each one. This is called a codebook. I like to keep representative quotes of each practice alongside my definitions as well.
There are benefits and drawbacks to each approach. Open-ended coding leaves your research more open to the sources themselves. It also means that you are not pre-judging the material. The downside, however, is that it adds a hugely time-consuming step to an already mind-numbingly-tedious process: you have to go back through your notes after you’ve read everything and impose a set of codes over your inevitably sometimes thin notes. Pre-coding is faster, and it allows collaborative work because you are working from a clear set of definitions. However, there is a very real risk that important phenomena will not be included in these pre-determined coding schemes, and so will either have to be ignored, or your code-book updated and the whole process started over again. (This has happened to me. Not fun.)

Working with multiple coders
If you have a codebook handy, you can outsource your coding to others. Qualitative researchers have developed a number of methods for gauging inter-coder reliability. Because I am not familiar with working with other people, I can only point interested people to the fact that this exists.

Text Analysis and Content Analysis?
Inevitably, sometime during the long and frustrating process of coding, you will ask yourself: can’t a computer do this? I am sure that many people reading this will answer yes.
I am nervous about bringing machine learning into the content analysis of historical materials. Although machine learning approaches may be appropriate for data coming from one solid time period, over the incredibly long time spans historians deal with, language changes in a way that machine learning cannot account for. This does not mean that the challenge cannot be met, merely that it awaits more technically adept practitioners than myself.

NEXT: Visualizing your data!