Research filed(s): Computational Stylistics, Natural Language Processing, Data Mining
Thesis title: On Computational Stylistics: Mining Literary Texts for the Extraction of Characterizing Stylistic Patterns
Degree: PhD in Computer Science
Supervisor(s): Prof. Jean-Gabriel Ganascia
Where did you undertake your research?
I undertook my PhD thesis at the Computer Science laboratory of Paris-6 University, one of the leading French universities located in the heart of Paris. I worked under the guidance of Prof Ganascia Jean-Gabriel as a member of the ACASA (Cognitive Agents and Symbolic Machine Learning) Team. My thesis comes as part of the effort started in the OBVIL (observatory of literary life) laboratory. This observatory intends to develop and exploit resources offered by computer applications to examine French literature. It promotes scientific research in the field of digital humanities by bringing together researchers from both literary and social disciplines on the one hand, and computer scientists and engineers on the other hand.
What is your thesis about?
My thesis locates itself in the relatively long tradition of computational stylistics, namely the application of statistical methods to the study of literary style. Hence, my research work during my thesis lied broadly in the areas of computational stylistics and text mining. More specifically, I was working on modelling and developing sequential data mining techniques for the extraction of relevant syntactic patterns that characterize someone’s writing style.
How did you get into this research?
Before joining the Computer Science laboratory of Paris-6 University, I completed a Master degree on Artificial Intelligence at the University of Grenoble-1, France, where I discovered the exciting research field of natural language processing (NLP). This is precisely why I undertook my Master’s thesis in the NLP group of the university. My growing interest for this research field was eventually culminated by pursuing my PhD thesis in a closely related topic.
Why is it important?
As a text producer, an author must take many linguistic decisions in order for the text to be correctly formed. These decisions vary on different linguistics levels. Since language is a very regulated and complex phenomenon, we can arguably assume that these decisions are not randomly taken, but rather chosen in a specific and defined manner that embeds additional information in the text. Together, propositional content and stylistic effects end up characterizing not only a single piece of written text but also a manner of writing in its general form. Among many other factors, those elements make the style play an essential role in the content and in the meaning addressed by the author to the reader.
Being able to automatically capture those stylistics effects in some sort of patterns can help us not only to have a better understanding of the investigated texts, but also to open the door for many interesting applications such as stylistic-based text generation and authorship identification.
What is your contribution to your field of research?
Historically, most of the work done in computational stylistics focused on lexical aspects especially in the early decades of the discipline. However, in our thesis, we tackled a different linguistic level than lexicon. Our focus was put on the syntactic aspect of style which is quite harder to capture and to analyse given its abstract nature.
We worked on an approach to the computational stylistic study of classic French literary texts based on a hermeneutic point of view, where discovering interesting linguistic patterns is done without any prior knowledge. More concretely, we focused on the development and the extraction of complex yet computationally feasible stylistic features that are linguistically motivated, namely morpho-syntactic patterns. We claimed that computational stylistic methods need to be grounded in the hermeneutic unsupervised paradigm rather than on the classification-based one.
Following this line of thought, we proposed a knowledge discovery process for stylistic characterization with an emphasis on the syntactic dimension of style by extracting relevant patterns from a given text (see Figure 1). The proposed knowledge discovery process consists of two main steps, a sequential data mining step followed by the application of some interestingness measures. We proposed, evaluated and reported results on three interestingness measures, each of which is based on a different theoretical linguistic background.
The experimental results indicate that the presented techniques are fairly effective in extracting interesting syntactic patterns, and this seems particularly promising as a computer-assisted literary analysis tool to support linguist and literary researchers in their critic analysis, especially if we take into account the unsupervised nature of this process.