Detecting Linguistic Signal in Cather’s Early Journalism: Polishing the Bibliography

Background and History
Willa Cather, a novelist whose works include My Antonia and O Pioneers!, served her literary apprenticeship writing drama, music, and book reviews for literary magazines and small-town newspapers between 1893 and 1903. Most of these articles were unsigned or pseudonymous, but were attributed to Cather usually based on circumstance and opportunity, reference, and style – something William Curtain failed to define beyond “the style and swing [being] unmistakably Cather’s” (Curtain 972). Authorship attribution is hardly a new field for digital humanities. David Hoover, Matthew Jockers, Hugh Craig, Patrick Juola and others have done work on literary authorship attribution in which they examined and refined the different approaches (e.g. function words and lexical entropy) to the authorial question. Our project focuses on a small corpus of short texts, reflecting our attempt to verify authorship attribution in Willa Cather’s early unsigned, attributed journalism from 1893 to 1903.

Early efforts to create a bibliography of all of Cather’s writing, including documentation of her journalism, were complicated by Cather’s pseudonymous and anonymous publications. In 1950, John Hinz created an initial list of Cather’s pseudonyms, including Mary K. Hawley, Clara Wood Shipman, and Helen Delay, her most famous pseudonym (Hinz 201). In 1966, Bernice Slote published the first definitive bibliography of Cather’s early nonfiction writings in The Kingdom of Art where she uncovered 44 previously unattributed columns she believed Cather wrote for the Nebraska State Journal between 1895 and 1896. In 1970, William Curtain compiled The World and the Parish, a two-volume bibliography of Cather’s journalistic writings. Joanna Lathrop followed up with a 1975 checklist. Joan Crane’s 1982 bibliography of Cather’s work is a departure point for much recent scholarship, including the online Willa Cather Archive (cather.unl.edu).Tim Bintrim’s 2004 dissertation attributed 19 additional pseudonymous articles to Cather during her Pittsburgh years and successfully challenged at least two pseudonyms proposed by Hinz. In 2013, Kari Ronning of the University of Nebraska-Lincoln and Robert Thacker of St. Lawrence University cleared up confusion about a purported Cather pseudonym “Clara Wood Shipman.” It is not a Cather pseudonym, something Slote had hinted about in a footnote in The Kingdom of Art (Slote 28).

In this research, we investigate Hinz’s original attributions as well as those suggested by scholars that came after him. We apply tools and techniques of authorship attribution and stylometrics in order to assess the extent to which unsigned works and works attributed to Cather possess a stylistic and linguistic voice that is consistent with her known works.

Methodology
We approached the initial problem of authorship attribution first by breaking down the content of each of Cather’s journal articles to the word level. Our sample consisted of 158 texts. Of these, 126 texts were made available through the Willa Cather Archive. The remaining 32 texts were articles written by Fanny Fern (Sarah Parker Willis), which were used as a control group in order to establish a clear stylistic signal that was not Cather’s. We chose Fern’s journalistic work not only because it was available in digital form, but also because Fern’s work constitutes a body of work that is comparable to Cather’s in terms of genre, size, number of works, and available metadata. At the same time, the Fern material is distinctive enough chronologically and geographically to avoid any artistic overlap.

Our total corpora consisted of 216,506 words, or tokens, 192,062 belonging to the Cather corpus and 24,444 belonging to the Fern corpus. The identification of unique token types resulted in 20,868 types: 15,656 appearing only in Cather’s corpus, 5,212 appearing only in Fern’s corpus, and 3,591 shared between the two. Our algorithm then employed simple relative frequency across the corpora in order to account for any discrepancies in article length. The resulting plots are based on the occurrence of the most frequent words in the corpora, such as “the”, “a”, and “of,” above an arbitrary threshold. Instead of imposing structural rules, we practiced unsupervised clustering by allowing the algorithm to group the works itself based on the frequency of these common words. This created hierarchically clustered dendrogram plots, which serve to provide a visual analysis. At first, we seemed to detect a clear linguistic signal for Cather. When we introduced the control set, Fern’s works also tended to cluster together in our initial dendrograms (see figure 1).2dendro1
Fig. 1: Example of one of the original, first-run results dendrograms (here, run with a threshold factor of 0.89).After closely reading the pieces, we discovered that there was a substantial amount of quoted material from other authors (such as William Shakespeare and Rudyard Kipling), which might have skewed the accuracy of our approach. In order to produce the most reliable results, we manually stripped the quoted material from the files.
ObservationsAfter carefully examining the dendrograms from our analysis, we have evidence that the works analyzed were, correctly attributed to Cather, but we cannot be sure to have detected her style or signal.2 Additionally, we found the third-party quotations in Cather’s work had no significant effect on the clustering of works. Contrary to our expectations and conventional wisdom, the same works consistently clustered before and after the extraction of quoted material. No matter which frequency threshold we used to produce the hierarchically clustered models, the results (see figures 2 and 3) did not offer a solid pattern for determining Cather’s early journalistic style or identify possible misattributions.2dendro2
Fig. 2: One of the final dendrograms (here, run with a threshold factor of 0.55).2dendro3
Fig. 3: One of the final dendrograms (here, run with a threshold factor of 0.96).

However, we can claim the veracity of the previously attributed works achieved through traditional scholarship by employing this computational method. With little controversy, the clusters formed by signed or consistently attributed pieces support such articles as likely to have been written by Willa Cather. Our control set consisting of Fanny Fern’s work generally tended to cluster together as a set, indicating that our analysis distinguished between the two authors with high reliability.

Conclusions and Further Research
Our initial efforts at detecting a Cather signal to dispute the attribution of anonymous and pseudonymous texts were not successful. While we can be definitive enough to support Curtain, Bintrim, Slote, and Crane’s attributions, our results cannot be considered reliable enough to dispute any dubious attributions. Burrows, Jockers, Hoover, and Craig suggest machine reading of texts and unsupervised clustering of shorter texts and smaller corpora is still better than working by mere chance. The question is, how much better?

The context of our work seemed to be appropriate for quantitative methods. We set out to allow a machine to do that for which few people have little patience: counting the frequency of function words. As Hoover suggested, “Only when external evidence fails is it reasonable to apply quantitative methods, and the presence or absence of a closed set of possible authors and differences in the size and number of documents available for analysis are usually more significant than the kind of text involved.” The external evidence seemed compelling: Cather’s output as a young journalist was copious, and scholars using traditional methods were able to identify more than three pseudonyms incorrectly attributed to her. We analyzed a closed set of authors, Willa Cather and Fanny Fern, so we expected to find more attribution errors, based upon the timbre of early Cather scholarship, which often smacked of hagiography. What we were lacking was a larger corpus.3 This project indicates the need to digitize more of Cather’s early work to further clarify the historic bibliography using computational stylometry.

Bibliography
Bintrim, Timothy W. (2004) Recovering the Extra-Literary: The Pittsburgh Writings of Willa Cather. Diss. Duquesne University. Print.
Burrows, J.F. (2004). “Textual Analysis.” A Companion to Digital Humanities. Ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell, 2004. Web.
Craig, Hugh (2004). “Stylistic Analysis and Authorship Studies.” A Companion to Digital Literacy Studies. Ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell. Web.
Crane, Joan (1982). Willa Cather: A Bibliography. Lincoln, NE: U of Nebraska P. Print.
Curtain, William(1893-1902). The World and the Parish: Willa Cather’s Articles and Reviews . Lincoln, NE: U of Nebraska P, 1970. Print.
Fanny Fern in the New York Ledger(2013). Center for Digital Research in the Humanities at the University of Nebraska-Lincoln. Web. 9 Dec. 2013.
Hinz, John P (1950). “Willa Cather in Pittsburgh.” The New Colophon. New York: Duschnes Crawford Inc. Print.
Hoover, David L (2004). “Quantitative Analysis and Literary Studies.” A Companion to Digital Literary Studies. Ed. Susan Schreibman and Ray Siemens. Oxford: Blackwell. Web.
Jewell, Andrew, and Janis Stout (2013). The Selected Letters of Willa Cather. New York: Knopf. Print.
Jewell, Andrew, ed. The Willa Cather Archive. U. of Nebraska-Lincoln, (2004-2013). Web. 9 Dec. 2013. Web.
Jockers, Matthew L. and Daniela M. Witten(2010). “A Comparative Study of Machine Learning Methods for Authorship Attribution.” Literary and Linguistic Computing, 25.2 : 215-224. Web. 9 Dec. 2013.
Jockers, Matthew L., Daniela M. Witten, and Craig S. Criddle(2008). “Reassessing Authorship of the Book of Mormon Using Delta and Nearest Shrunken Centroid Classification.” Literary and Linguistic Computing, 23.4 : 465-492. Web. 9 Dec. 2013.
Jockers, Matthew L. (2013) “Testing Authorship in the Personal Writings of Joseph Smith Using NSC Classification.” Literary and Linguistic Computing. 28.3: 371-381. Web. 9 Dec. 2013.
Jockers, Matthew. (2013). Text Analysis with R for Students of Literature. 2 Sept. TS.
Juola, Patrick. “Authorship Attribution.” Foundations and Trends in Information Retrieval 1.3 (2006): 233-334. Web. 1 Dec. 2013.
Lathrop, JoAnna (1975). Willa Cather: A Checklist of her Published Writing. Lincoln, NE: U of Nebraska P. Print.
Ronning, Kari. “Re: theatres in Lincoln.” Message to the author. 9 Dec. 2013. E-mail.
Slote, Bernice. The Kingdom of Art. Lincoln, NE: U of Nebraska P, (1966). Print.
Van Dalen-Oskam, Karina (2013). “Epistolary Voices: The Case of Elisabeth Wolff and Agatha Deken.” Digital Humanities 2013 Lincoln NE. Eds. Katherine Walker and Kenneth Price. Lincoln, NE: Center for Digital Research in the Humanities. Web. 9 Dec. 2013.

References

1. This research began in a class taught by Matthew Jockers and has continued under his direction as a project of the newly formed Nebraska Literary Lab at the University of Nebraska-Lincoln.

2. If we had found Cather’s signal, it would be best described as the distribution of works that clustered together, whereby some elements of the clusters that have the same frequency consequently have the same style.

3. According to Andrew Jewell, editor of the online Cather Archive, there are a total of approximately 600 pieces of journalism written by Cather. The Cather Archive has digital scans of about 400 of those. About 230 are transcribed in some fashion, and about 206 are in some kind of XML format.