Monday, March 23, 2015

Content Analysis With Tableau 2: Data Collection

This is Part 2 of a 4 part series on content analysis and Tableau. Things will make more sense if you start with the intro.
Please note that this is a work in progress and I am more amateur than expert. I welcome questions, comments, and corrections.

You have your research question in hand, and you think a content analysis approach might be useful. What’s next?
The first step is to select which data you want to be working with—your so-called corpus. This can be a certain kind of source: diaries entries on December 25th written in the British Isles from 1688 to 1850. Alternatively, it can be a certain section of an archive: court cases resulting in a hanging. You may also use search terms to hone in on documents in an larger archive mentioning a particular word or set of words. In a previous project I found descriptions of London coffeehouses by searching online archives for material mentioning the word ‘coffee.’ You may also combine multiple archives into one project, if you are careful with how you use balance the different biases of each.
Where do you find all this text? There are many digital corpuses available, some of which have reliable text searching capability. In the British context (the one I know best) there is an embarrassment of riches—London Lives, ECCO, the Burney Collection and other fully-searchable archives are all available to most academic institutions. But you don’t need to use digital material! You can just as easily code printed books, sheet music, or paintings.
A side note: Remember to be very wary about how you use search in digital archives. It is easy to think of search as a flat mirror of the archive, but search engines have their own philosophical assumptions, often occluded under a miasma of proprietary algorithms. My tips are as follows: Search for words with fewer letters, to reduce the chances of OCR errors. (The longer your word or phrase, the greater chance that OCR will garble it up.) Avoid words with the dreaded long s that can confuse OCR. Be aware that the absence of mentions of a term over a long period of time might not be reflective of the actual practice itself, but rather a change in the use of words describing that practice.
Next, you should consider whether the corpus you’ve collected is the right size. The longer a time period you are working with, the more sources you need. If you don’t have a very large corpus (and I would prefer to have more than ten observations per year at the very least) think about expanding your corpus. But not too much. Content analysis is incredibly time-consuming, as it involves hand-coding each and every instance of the terms you are dealing with. If your selection process has given you way too much data, you can select a random sample of it. (Use a statistical significance calculator to figure out how many entries you need.) Even doing this, the whole process will likely take solid weeks of work.
Once you have selected your data collection method, download your sources and assign each document a unique ID. In my Christmas project, I have done this by date of publication, as you can see here.

If manually downloading thousands of books doesn’t appeal to you, keep in mind that some intrepid scholars have written scripts that can help you automate the process. You will likely need some familiarity with coding to take full advantage of these tools, and many are in dubious standing with archives’ Terms of Service.
Now that you have built up your little library of digital (and dead-tree) books, it’s time to jump into coding!
NEXT: Coding!

No comments: