This is Part 3 of a 4 part series on content analysis and Tableau. Things will make more sense if you start with the first part.
Please note that this is a work in progress and I am more amateur than expert. I welcome questions, comments, and corrections.
“Oh God, I could be bounded in a
nut shell and count myself a king of infinite space."
Digital humanities has often been described as a kind of ‘distant reading.’
In traditional close reading the scholar contemplates every last comma. But in distant
reading hundreds of thousands of documents are dispatched in the time it takes
your computer to crunch a few billion calculations. What is content analysis?
Content analysis is mid-way between close and distant reading. The scholar still reads—but she reads only a small portion of each text relating to the activity in question. The scholar uses computers and statistics to crunch the data—but the meaning remains something she has given to the data herself. If ‘mid-range reading’ didn’t sound so silly, it’d be apt.
Content analysis is mid-way between close and distant reading. The scholar still reads—but she reads only a small portion of each text relating to the activity in question. The scholar uses computers and statistics to crunch the data—but the meaning remains something she has given to the data herself. If ‘mid-range reading’ didn’t sound so silly, it’d be apt.
Coding is the hardest part of the content analysis workflow, the point
where method becomes more art than science. Coding your data flattens it, irons
out its particularities, smoothes clean the distinctive wrinkles humanists usually have so much fun analyzing. What’s
more, coding threatens to lock much of our interpretive work in a black box of
numbers and codebooks. But finding regularities in large amounts of your
material does not necessarily blind us to variety or difference. It allows us to zoom out—to see things from a perspective that might
be otherwise hard to get hold of. As content analysis visualizations improve, it
will be possible to link each datapoint with the exact quote it is derived
from, leading to the option for scholars to critique one another's coding strategies. The black box will soon be opened.
Coding Workflow
Before I delve into the more technical considerations, it may be helpful to
lay out how I go about coding my material.
I open up an Excel document where I keep my records. I use Excel simply
because I am familiar with it—other data management programs (like Access and
LibriOffice) may be more appropriate. Then I open up one of my target documents
and find an observation of the practice I’m looking for. This gets a unique
identifier I call an INSTANCE. Then I code what this instance is saying, and the
place of observation if applicable. Then I record the citation this instance is
connected to. When I am finished with the document, I also code in the writer’s
name, religious denomination, and class. Each of these is kept in separate
fields in my excel spreadsheet.
Each field is connected to other fields by a set of common variables. Some fields are linked to one another by a shared ‘instance’. (Instance 1 might have a ‘CODE’ of ‘DINE’ and in another field have a ‘PERSON’ of ‘William Bagshawe.) Other fields are linked together by a shared ‘person.’ (‘William Bagshawe’ might have an entry in the ‘gender’ field: male.)
Each field is connected to other fields by a set of common variables. Some fields are linked to one another by a shared ‘instance’. (Instance 1 might have a ‘CODE’ of ‘DINE’ and in another field have a ‘PERSON’ of ‘William Bagshawe.) Other fields are linked together by a shared ‘person.’ (‘William Bagshawe’ might have an entry in the ‘gender’ field: male.)
Let’s take a closer look at this whole process.
What to code
Before anything else, you should think about what sort of material you need
to discover to answer your research questions. I find it helpful to separate
this material into two categories: observations
and demographics.
Observations are the behavior you are trying to describe.
This can be the activity in coffeehouses, Christmas rituals, the amount of
money mentioned in a text, or the mood of a poem.
Demographics are the ‘metadata’ surrounding each observation.
The most obvious variable to keep track of is date. But you can also code for
gender of observer, place of observation, social class of observer, and so on. You also
need to include enough information in each observation so that other people
will know where you’re getting your data from. That means that every
observation should be connected to a reference. I do this by recording two
variables: citation, with the actual page number, and source, which has the
code standing for the book the reference is gleaned from. (Remember those codes
we gave our books when we downloaded them? They come in handy now.)
Setting Up Your Excel Sheet
So where does coding happen? You can do it in many programs, but I use
Excel—not because it is the most advanced, or the best suited for this particular
project. I just know it the best. It is also helpful that you can find Excel on
almost every computer you come across.
Before we start coding, we need set up our spreadsheet. The problem is that
we are used to making spreadsheets that humans can read, but it’s a lot easier
in the long run to make spreadsheets that computers can read.
When I think of a spreadsheet, I think of something like this:
A human reads this spreadsheet from left to right, clearly seeing which
data belongs to which observation. Instance 10 is of a male lawyer in London named Daniel O'Connell who, in 1796, does not mention Christmas. You can check page 91 of book 1906a to see the diary entry I derived this data from.
The problem is that computers don’t like this kind of database
architecture. What they like is ‘tidy’ data.
In the words of Hadley Wickham “Tidy datasets are easy to manipulate, model and
visualise, and have a specific structure: each variable is a column, each
observation is a row, and each type of observational unit is a table.”
The data above are NOT tidy. Although each observation is a row and each
variable is a column, the observational units are not split out by table.
So instead of a single sheet jammed full of data, we have to imagine a
number of sheets, all of which interact relationally. (A so-called ‘relational
database’) Each ‘table’ (a sheet in Excel, also sometimes called a ‘field’) is
a different observational unit—religious denomination, year, and so on. Each
row gets its own observation. And each column is its own variable.
It is often helpful to understand your relational database visually: every
‘field’ (each tab in Excel) is related to other ‘fields’ by shared variables.
It is important to keep the names of your shared variables EXACTLY THE SAME otherwise
you’ll have complications later. (There are programs available for turning a human database into
a relational database, but why not start out on the right foot?)
Much work has been done on how to keep your data tidy—consult these if
you’re curious about how to make your database. There are also some computer programs, like Data Wrangler, that can help turn a
normal ‘flat’ database into a tidy one, if you need.
Open-ended coding versus
pre-coding
Now that you have your relational database set up, it is time for the fun
part: actually reading and coding your material.
But before you do this, a huge question arises: what do you code? How do
you ball the infinite variety and
difference of human life into a dozen marbles of rigid 'codes'? There
are two options.
The first is called open ended coding.
In this method, you write brief descriptions of each practice as you go along.
After a certain amount of time, you go back over these and impose a kind of
order, balling up particular descriptions into a limited number of broader
categories. I prefer to do this at the very end of the coding process, but to
each his own.
The second approach is called pre-coding.
In this method, the researcher comes up with a pre-determined set of codes before
she starts looking at her data.
Whatever you do, make a list of your codes and write a description of each
one. This is called a codebook. I like to keep representative quotes of each
practice alongside my definitions as well.
There are benefits and drawbacks to each approach. Open-ended coding leaves
your research more open to the sources themselves. It also means that you are
not pre-judging the material. The downside, however, is that it adds a hugely
time-consuming step to an already mind-numbingly-tedious process: you have to
go back through your notes after you’ve read everything and impose a set of
codes over your inevitably sometimes thin notes. Pre-coding is faster, and it
allows collaborative work because you are working from a clear set of
definitions. However, there is a very real risk that important phenomena will
not be included in these pre-determined coding schemes, and so will either have
to be ignored, or your code-book updated and the whole process started over
again. (This has happened to me. Not fun.)
Working with multiple coders
If you have a codebook handy, you can outsource your coding to others.
Qualitative researchers have developed a number of methods for gauging
inter-coder reliability. Because I am not familiar with working with other
people, I can only point interested people to the fact that this exists.
Text Analysis and Content
Analysis?
Inevitably, sometime during the long and frustrating process of coding, you
will ask yourself: can’t a computer do this? I am sure that many people reading
this will answer yes.
I am nervous about bringing machine learning into the content analysis of
historical materials. Although machine learning approaches may be appropriate
for data coming from one solid time period, over the incredibly long time spans
historians deal with, language changes in a way that machine learning cannot
account for. This does not mean that the challenge cannot be
met, merely that it awaits more technically adept practitioners than myself.
NEXT: Visualizing your data!
NEXT: Visualizing your data!
3 comments:
wow..Good explanation of your blog..!
Online Business Search Engine
This concept is a good way to enhance the knowledge.thanks for sharing. please keep it up machine learning online course
You can learn a lot about machine learning through a number of short online courses.
Post a Comment