Topic Modeling

Topic modeling reminds me of a lot pre-field work in archaeology. An archaeological team can’t look everywhere on a site. The weeks and months before any ground is broken or test pits created are spent mining archives, doing geophysical surveys and anything else that can lead archaeologists in the right direction. Depending on who you ask, this portion of archaeology may be the most important. Digging is great but if you have no educated guesses as to where you’ll find artifacts or significant archaeological features, you may as well be searching for a needle in a haystack.

The digital world is much the same. Having a huge swath of digital records is pretty much useless unless you know what to narrow in on. Topic modeling in all of its forms then, seems so so important to me. We could use wget to mine websites, markdown to write thousands of files, a plethora of .xml files to interpret with SPARQL. But it comes down to needing to know what you have before you work with it. And if you have too much data, there isn’t enough time in the day to go over every single piece. This is where I see topic modeling come in. Topic modeling is pretty cool. So far, it seems like the digital tool that would be the most useful for me.

However, this week’s tutorial cautions against assuming that topic modeling will provide us with the ability to infer information based on the topics. A computer can tell us where to go, but not tell us the human answers we need. Clearly the temptation to use topic modeling in this way is high. But just like in archaeological work, finding archaeological features together is a first step to interpreting them in relation to one another which an archaeologist must do after a dig.

On a side note, this tutorial was made even better by virtue of Andrew Gelman’s “How many zombies do you know?”. It reminds me of a physicist named J.H. Hetherington who used his cat as a second author in a 1975 paper instead of retyping his paper. His cat went by pen name, F.D.C. Willard or Felis Domesticus Chester Willard. Scientists are often willing to put themselves into their work and they have a sense of humour. I like that, and making yourself present as a researcher/author/creator is always important in digital history. We could borrow some of their transparency in historical writing.

Written on March 7, 2016