Natural Language Processing and Literary Style

How Can one Determine the Style of these Authors?

Literary Relevance of the Question

First, we started by being interested in those two corpora’s authors’ writing style. Before analysing their ideas, we wanted to understand the way they structured their writings. Even though the notion of “style” is difficult to define, it is, yet, as much as at the very core of criticism activity than it is in nowadays’ debates between paper reviews versus digital reviews.

While newspapers’ authors blame bloggers for their lack of seriousness, these ones claim their right to write in a different way. On the page “About Exeunt”, the magazine summarises very well stylistic stakes that are implied by those new forms of digital critics:

Exeunt believes in making beautifully written, experimental, fierce and longform writing about theatre available for free.

However, for Michael Bellington, who is a long-standing critic working for The Guardian, a blog is more an “informal letter” than a true critique. Danielle Tarento, director of the Chocolate Menier Factory theatre in London, states loud and clear that those bloggers are not “genuine writers”:

They do not have the intellectual background or historical background or time to know what they are writing about.

Which Technical Tools Shall we use to Answer this Question?

To answer this question, we drew on computational linguistics work. This field of research borrows to informatics, linguistics and statistics. Among other things, it allows to model natural language phenomena thanks to logical approaches. As a base we took D.I Holmes’ work who defined writing style as a set of measurable variables that is a part to establish an author’s “fingerprint.”

This first trail of research aimed at studying a series of simple stylistic distinguished features in both of the corpora and compare them (number of words / sentences per reviews, common names / verbs / adjectives / the most regular ones, how the different kinds of sentences are divided up, the use of punctuation, etc…).


What Can we Conclude so far?

So far, two important points have been highlighted thanks to these first experiments:

1. The 5 most used names tend to show that these two groups focus on two different topics. When we take a closer look at the two most regular words, we can easily notice that they are the same (“Production” and “Show” for the first corpus – “Theatre” and “Show” for the second corpus). The third most used word by theatre critics sheds more light about what they are more interested in. The word “Stage” lets us assume that those critics are more focused on what is happening on stage, or at least, that they are more focused on the show, what takes place in front of them. When the most common words in the second corpus are compared, the third most used word is quite close to the two first ones (“Theatre” – “Show” – “Production”). However, the fourth most used word is more interesting. “Audience” implies that digital reviewers pay their attention first to what is happening by them, to the members of the audience, and not mainly to what takes place in front of them, hence, on stage. Could it be two different ways to experience what theatre is? The theatre critcs’ experience would be more rational, focused on analysing the show, while the bloggers’ one would be more emotional, more focused on human reactions, or on the audience’s reactions?

2. The way personal pronouns are distributed between these two corpora tends to assert this hypothesis. It is on the second graph that the most important differences in percentage appear between those two data bases. While the first-person singular represents 10% of the whole use of personal pronouns in the first corpus, it is used twice in the second corpus (20%). Which means that bloggers use twice the pronoun “I” in their reviews. Does it mean that a more subjective opinion is accepted in digital reviews?

Machine Learning: Study on Critiques’ and Reviews’ Structures.

How do Journalists and Bloggers Organise their Arguments within a Review?

Literary Relevance of the Question

The second experiment was focused the analysis of critiques’ and reviews’ structures. We wanted to understand how arguments in critiques and reviews were outlined in the two different corpora. To do so, we were inspired by Mark Fisher’s essay How to Write About Theatre (2015), he is a theatre critic for The Guardian, and he depicts the different steps in critiques (Introduction, plot summary, etc…). Then, we spent several hours on the study of these two corpora in order to understand better the topics that those authors talked about in their work. Here are the different categories that we found, and the labels assigned to them:

After that, we manually annotated 1000 critiques from the first corpus in accordance with those labels. This step consisted in selecting randomly a critique and change the colours of the text according to the category it belonged in. Here is an example taken from the review written about Sam Shepard’s play, A Lie of the Mind (1985), which was played in May 2017 at the Southwark Playhouse Theatre in London. This review was written on the 11th of May 2017 by Fergus Morgan, a theatre critic working for The Stage newspaper.

Initial Review

Labelled Review

Which Technical Tools Shall we use to Answer this Question?

We needed to use Machine Learning technique (one called scikit-learn in this case) to train the algorithm to recognise those categories on texts that had not been annotated. The software needs to be improved and the results showed here lack of precision. However, they have already provided some patterns.


What Can we Conclude so far?

If the debate in the artistic field only highlights the differences between journalists and bloggers, these experiences prove however that there are a few similarities between these two communities. Each of the eight categories we have identified are represented in both datasets which suggests that both journalists and bloggers use similar arguments.

There are, however, significant differences. When we take a closer look at the percentages of each of these categories within the two corpora, we can see that bloggers tend to focus on affect-related categories. The “Visual and Auditory Details”, the “Performance of the Actors” as well as the “Remarks on the Audience” are all elements that bring the subjectivity of the critic to the front.

For the first corpus on the contrary, that is to say for journalistic criticism, the highest values deal with categories linked to arguments of a factual nature. The “Analyses”, the “Plot” and the remarks on the “Structure of the play” rely on rational analyses.

Could we then see two ways of approaching the stage? One factual, the other more emotional?

Sentiment Analysis