Can Statistics Reveal the Secrets of Great Writing?


In an article published by Smithsonian Magazine, Megan Gambino interviews data journalist Ben Blatt on his recent efforts to apply data analysis to literary works.

Here are the opening paragraphs of Gambino’s article, which frame the interview:

In most college-­level literature courses, you find students dissecting small portions of literary classics: Shakespeare’s soliloquies, Joyce’s stream of consciousness and Hemingway’s staccato sentences. No  doubt, there is so much that can be learned about a writer, his or her craft and a story’s meaning by this type of close reading.

But Ben Blatt makes a strong argument for another approach. By focusing on certain sentences and paragraphs, he posits in his new book, Nabokov’s Favorite Word is Mauve, readers are neglecting all of the other words, which, in an average­length novel amount to tens of thousands of data  points.

The journalist and statistician created a database of the text from a smattering of 20th century classics and bestsellers to quantitatively answer a number of questions of interest. His analysis revealed some quirky patterns that might otherwise go unnoticed:

By the numbers, the best opening sentences to novels do tend to be short. Prolific author James Patterson averages 160 clichés per 100,000 words (that’s 115 more than the revered Jane Austen), and Vladimir Nabokov used the word mauve 44 times more often than the average writer in the past two centuries. talked with Blatt about his method, some of his key findings and why big data is important to the study of literature.

As the following two charts that are included with the interview demonstrates, Blatt’s findings are certainly of interest:

Clearly, data analysis can be a valuable tool for literary critics—a sort of advancement on the concordances that scholars once spent decades compiling but that can now be digitally compiled in a matter of hours or days.

But I don’t believe that data analysis is going to provide a substitute for literary analysis anytime soon.

I have been re-reading Richard Powers’ novels and reading the two books of criticism on his work. In each of his novels, Powers’ has explored multiple areas of science and technology within a fictional framework that ultimately becomes metafictional. The linkages and disjunctions among all of these narrative elements are multi-layered and often extremely nuanced. In the best of the criticism of Powers’ work, the relationships between these elements and layers are delineated with meticulous attention to detail, located deftly within the broader sources on which Powers’ is drawing, and elucidated with great precision and insight—with perceptions informed by an awareness of language that is matched by a facility with language approaching Powers’ own. Here is just one illustration:

In the decades during which the public imagination has assimilated the premises and promise of genetics, we have come to accept certain tropes about the way DNA works: that it is coded and translated like a language; that gene mutations function very much like the noise in an information system; that successive generations are like messages sent forward to the future—a process partly enabled by novelists who have drawn on sciences like genetics not only for subject matter, but for the metaphors and cognitive structures that inform their own narratives. Richard Powers is a good example of this development since he builds his 1991 novel The Gold Bug Variations around the DNA double-helix. Like Thomas Pynchon, whose fictions brought the notion (and process) of entropy into narrative, Powers makes genetics both the foundation for and the explanation of the events that take place in the novel—and then shapes the narrative to mirror the concept. . . .

But while, as critics have argued, a novel like Gold Bug is a system through which knowledge circulates, we must not be surprised to find the novel deeply concerned not only with those things that foster such circulation, but also with those that impede circulation or misdirect the flow of information. In addition to his attention to the delicate information system that structures all life, Powers addresses the many systems of communication that structure human interaction. Threaded throughout the dissertation on genetics contained in the novel, we find the record of attempts to send a variety of messages through a series of constrained and limited channels. Like motivated DNA, characters encode information, send instructions about its transport and delivery, and attempt to decode and act upon the information they receive. They struggle to minimize the mutations caused by noise in the system. And they deploy a peculiar postal rhetoric as they try to manage the chaos of competing interests between friends and lovers.


Those paragraphs are from the beginning of the essay “The Rhetoric of the Genetic Postcard: Writing and Reading in The Gold Bug Variations,” contributed by Patti White to the collection Intersections: Essays on Richard Powers. [Eds. Stephen J. Burn and Peter Dempsey. Champaign, IL: Dalkey Archive P, 2008. 90-104.]

Megan Gambino’s complete article–the interview with Ben Blatt–is available at:



2 thoughts on “Can Statistics Reveal the Secrets of Great Writing?

Your comments are welcome. They must be relevant to the topic at hand and must not contain advertisements, degrade others, or violate laws or considerations of privacy. We encourage the use of your real name, but do not prohibit pseudonyms as long as you don’t impersonate a real person.