In his poem A Sort of a Song, William Carlos Williams wrote “no ideas but in things” and “saxifrage is my flower that splits the rocks.” What I’m doing here almost certainly isn’t what he meant– in fact, I may be doing the reverse, in that I am taking a poem and words and, in a sense, converting it back to, or at least representing it as, its component “things.” Even though it isn’t quite what Williams intended, these lines kept coming to mind as I worked ont his post, and it seems related to the things in poetry I’m discussing here.
Early on, near the beginning of this project, when we were experimenting with some of the tools and technologies we thought we might use to improve the process of identifying and tagging names in XML text, I noticed some strange output when I ran some of the poetry from the Belfast group sheets against the DBPedia Spotlight annotation service. Because I wasn’t restricting the identified resources to persons, places, or organizations (which is what our tools usually do when we’re trying to identify names to be tagged, e.g. in the NameDropper OxygenXML plugin we’re developing), it was identifying things like “potato”, “rock”, “eye”, “mouth”, “hand”, and “root” in the text. We’re now at the point in the project that we’re starting to shift towards using the tools we’ve been developing to enhance the EAD and TEI XML associated with the Belfast Group, and as I’ve begun working on tagging some of the poetry I was reminded of this and thought it might be worth a little more investigation and thought.
For this experiment, I restricted myself to Seamus Heaney’s poem Digging, as it appears in the draft on one of the Belfast group sheets (there are some slight wording differences from the published version).
Below are the things that DBpedia Spotlight identifies in the poem. I’m using the DBpedia thumbnails (or Wikipedia thumbnails, in the few cases where the DBpedia thumbnail image link was broken) to emphasize the “thingness” of the entities that Spotlight recognizes. Each image links to the corresponding DBpedia resource, and if you hover your mouse over the image you should see a snippet of the poem where the entity was recognized. I’ve sorted them out into three groups semi-manually, since I’m still having difficulty filtering based on support and similarity scores without losing useful data, although in this case it seemed like very few of the identified resources had high certainty, I suspect due to the poetic language.
First, the things that DBpedia Spotlight recognized accurately, in the order that they occur in the poem.
It’s sort of an odd way to read a poem, but it’s also kind of intriguing. Among other things, I think this highlights how full of actual physical items, especially body parts, the text is.
Second, a few of the resources that aren’t quite correctly matched up to the text, but are still interesting and semi-relevant.
I actually found these mis-identifications somewhat thought-provoking. To some degree, they betray the extent to which DBpedia is thing-centric, so that verbs and adjectives are mis-identified as nouns (again, with low confidence or support scores). But I find the notion of the poet’s pen “squatting” between thumb and finger, in the sense of taking up residence in an abandoned space without permission, rather appealing and fascinating. In the case of some of the other mis-identifications, it seems that Spotlight is picking up the context of digging and working outdoors, hence the mountains and archeological entities. And in the case of the lugger ship, this mis-identification actually drove me back to the text, and when I looked at “lug” in context I discovered that I didn’t actually know what it was, and had to go looking to figure out that the lug and shaft are parts of a shovel or spade.
Third, some of the mis-identified things that are humorously, obviously wrong. In this case have actors and musicians or bands, conceptually unrelated items, and even a video game. I’m including these here partly because they make me laugh, but also to demonstrate that the technology still has limitations and we need to be careful how we apply it.
For those who are interested, here are some technical notes on how I generated this post.
Got a copy of the TEI xml for the Heaney Belfast group sheets from the current Beck Center Belfast Group site (now available on GitHub!)
Ran the NameDropper lookup-names python script on the TEI file, restricting it to the poem I was interested in and setting the certainty pretty low, to generate a CSV file.
lookup-names heaney1.xml -c 0.1 \ --tei-xpath '//t:body[@xml:id="heaney1_1045"]' \ --scores --csv /tmp/heaney-digging.csv
Wrote a simple python script to iterate through the CSV file and generate the HTML I wanted for each item, pulling the label and thumbnail from DBpedia, and using the context pulled from the poem.
Manually sorted out the entities I wanted into the three groups, preserving order, and fixed missing thumbnails where I could (some of the DBpedia thumbnail references are invalid; I’m guessing this is because they have been updated on Wikipedia since the last time the current DBpedia data was regenerated).
Background image: Photo by cotinis on Flickr