Analysis of Text: Tolstoy’s War and Peace

On this post, we will look at an example of analysis of text data that are Tolstoy’s War and Peace books. You can download a plain text file from http://www.gutenberg.org/ebooks/2600. Of course, there should be many other solutions, please feel free to try to come up with analysis in different ways. The following codes are written in Python:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

[1] Load the book and parse the text into a structured format (Python Dictionary)

with open('War_and_Peace.txt', 'r') as f:
     book_txt = f.read().split('\n\n')

The raw text is splitted by line spacing into the list book_txt. Each element of the list indicates a title, a header, a paragraph, etc. The followings are examples:

book_txt[405: 413]

[1.1] Separate the body from the header and the footer

def separate_sections(book_txt):
    section = "header"
    header_body_footer = {'header': [], 'body': [], 'footer': []}

    for seg in book_txt:
        seg = seg.replace('\n', ' ')   # This is to remove '\n' in paragraphs
        
        if seg == " BOOK ONE: 1805":
            section = "body"
        elif seg == " End of the Project Gutenberg EBook of War and Peace, by Leo Tolstoy":
            section = "footer"

        header_body_footer[section].append(seg.lstrip())
    
    return header_body_footer

This function is to split the list book_txt into header, body and footer sections in the dictionary header_body_footer. header_body_footer[‘header’] has the header including the table of contents, header_body_footer[‘body’] includes the boday, and header_body_footer[‘footer’] has the footer.

[1.2] Initialize the data structure

def init_data(header):
    data = {}
    book_num = None
    chapter_idx = 0

    for seg in header:
        if "BOOK " in seg and ":" in seg:
            book_num = seg
            data[book_num] = {}
            chapter_idx = 0
        elif "CHAPTER" in seg and book_num != None:
            chapter_idx += 1
            data[book_num]["C" + str(chapter_idx)] = {}
        elif "EPILOGUE" in seg:
            break
            
    return data

This function is to initialize the data structure (dictionary) with the index of books and chapters. The initial structure looks like {‘BOOK ONE: 1805’: {‘C1’: {}, ‘C2’: {}, … , ‘C28’: {}}, ‘BOOK TWO: 1805’: {‘C1’: {}, …}, …}. ‘C’ indicates a chapter and the following number is a chapter number. The epilogues are ignored.

[1.3] Functions for text preprocessing

def to_lowercase(words):
    new_words = []
    
    for word in words:
        new_words.append(word.lower())
        
    return new_words

This function converts all characters to lowercases from list of words.

from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

def remove_stopwords(words):
    new_words = []
    
    for word in words:
        if word not in stopwords.words('english'):
            new_words.append(word)
            
    return new_words

This function removes stop words.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize(words):
    new_words = []
    
    for word in words:
        new_words.append(lemmatizer.lemmatize(word, pos='v'))

    return new_words

This lemmatization is to capture canonical forms based on a word’s lemma (e.g., better -> good).

[1.4] Parser

import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+')    # This is to remove special characters

def parser(body, data):
    book_num = None
    ch_idx = 0
    para_idx = 0

    for seg in body:
        if seg == '':
            continue
            
        if "BOOK" in seg and ":" in seg:
            book_num = seg
            ch_idx = 0
            continue
        elif "EPILOGUE" in seg:
            break
            
        if "CHAPTER" in seg:
            ch_idx += 1
            ch = "C" + str(ch_idx)
            para_idx = 0
        else:
            para_idx += 1
            data[book_num][ch]["P" + str(para_idx)] = {}
            list_sent = nltk.sent_tokenize(seg)

            for i, sent in enumerate(list_sent):
                data[book_num][ch]["P" + str(para_idx)]["S" + str(i+1)] = {"text": sent}

                words = tokenizer.tokenize(sent)
                words = to_lowercase(words)
                words = remove_stopwords(words)
                words = lemmatize(words)

                for j, word in enumerate(words):
                    data[book_num][ch]["P" + str(para_idx)]["S" + str(i+1)]["W" + str(j+1)] = word
            
    return data

This parser uses NLTK to tokenize paragraphs, convert words to lowercases, remove stop words and lemmatize words. The chapter index, the paragraph index, the sentence index and the word index look like C1, P1, S1 and W1, respectively. The epilogues are ignored. We will look at an example of some parsed data later.

[1.5] Parse the book

header_body_footer = separate_sections(book_txt)
data = init_data(header_body_footer['header'])
data = parser(header_body_footer['body'], data)

This the text preprocessing from separating the body from others to parsing contents. The following example shows results from the first paragraph of the first chapter of the first book

data['BOOK ONE: 1805']['C1']['P1']

[1.6] Save the dictionary into a pickle file

import pickle

pickle_out = open("data.pickle","wb")
pickle.dump(data, pickle_out)
pickle_out.close()

pickle is used to serialize the dataset into a file.

[2] Analysis

[2.1] Ratio of direct speech to indirect speech

def get_num_dir_indir_speech(book, ch, para):
    num_direct = 0
    num_indirect = 0
    
    start_quotation = "“"
    end_quotation = "”"
    
    quoted = False

    for sent in data[book][ch][para]:
        sentence = data[book][ch][para][sent]['text']

        if start_quotation in sentence and end_quotation in sentence:
            if any(p in sentence for p in ['.', ',', '!', '?']):
                num_direct += 1
            else:
                num_indirect += 1
                continue

            if len(sentence.split(start_quotation)[0]) > 0 or len(sentence.split(end_quotation)[1]) > 0:
                num_indirect += 1

            quoted = False
        elif start_quotation in sentence or quoted == True:
            num_direct += 1
            quoted = True
        elif end_quotation in sentence:
            if len(sentence.split(end_quotation)[1]) > 0:
                num_indirect += 1   
            num_direct += 1
            quoted = False
        else:
            num_indirect += 1
                
    return num_direct, num_indirect

Given a paragraph, this returns the number of direct speech and the number of indirect speech. In the sentence level, it check if a sentence has direct and/or indirect speech with the quotation marks. The following examples show what cases were considered:

1) “I thought today’s fete had been canceled. I confess all these festivities and fireworks are becoming wearisome.”
-> 2 direct speech

2) “No, Princess, I have lost your affection forever!” said Mademoiselle Bourienne.
-> 1 direct and 1 indirect speech

3) “Uncle” played another song and a valse.
-> 1 indirect speech

4) She suddenly paused, smiling at her own impetuosity.
-> 1 indirect speech

def get_ratio_direct(data):
    ratio_direct = {'book': [], 'chapter': [], 'num_direct': [], 'num_indirect': [], 'ratio_direct_to_indirect': []}
    
    total_direct = 0
    total_indirect = 0
    
    for book in data:
        for ch in data[book]:
            num_direct_ch = 0
            num_indirect_ch = 0
            
            for para in data[book][ch]:
                num_direct, num_indirect = get_num_dir_indir_speech(book, ch, para)
                num_direct_ch += num_direct
                num_indirect_ch += num_indirect
                
            ratio_direct['book'].append(book)
            ratio_direct['chapter'].append(int(ch[1:]))
            ratio_direct['num_direct'].append(num_direct_ch)
            ratio_direct['num_indirect'].append(num_indirect_ch)
            ratio_direct['ratio_direct_to_indirect'].append(num_direct_ch / num_indirect_ch)

            total_direct += num_direct_ch
            total_indirect += num_indirect_ch
    
    return pd.DataFrame.from_dict(ratio_direct)

ratio_direct = get_ratio_direct(data)

It returns a pandas dataframe that shows the number of directed and indirected speech and the ratio of directed speech to indirected speech for each chapter. The raito is a fraction with the number of direct speech as numerator and the number of indirect speech as denominator: # of direct speech / # of indirect speech. The output looks like:

ratio_direct.head()

[2.1.1] Ratio for chapters

The following table shows the list of chapters that the ratio of directed speech to indirected speech is 0. These chapters don’t have directed speech.

ratio_direct[ratio_direct['ratio_direct_to_indirect'] == 0]

The following table shows chapters that the ratio is bigger than 2. This means that these chapters have directed speech more than two times as many as indirected speech.

ratio_direct[ratio_direct['ratio_direct_to_indirect'] >= 2]

[2.1.2] Ratio for books

The following table and bar chart show the ratio for each book.

ratio_direct_book = ratio_direct.groupby(['book'], sort=False)['num_direct', 'num_indirect'].sum().reset_index()

ratio_direct_book['ratio_direct_to_indirect'] = ratio_direct_book['num_direct'] / ratio_direct_book['num_indirect']

ratio_direct_book

[2.1.3] Visualization – Bar chart

ratio_direct_book[['book', 'ratio_direct_to_indirect']].plot.bar(x='book')

The ratio over the all books is around 0.694.

ratio_direct_book['num_direct'].sum() / ratio_direct_book['num_indirect'].sum()

[2.1.4] Summary

We looked at how direct and indrect speech are found and how the ratio is calculated.
We also found which chapters have no direct speech and which chapters have many direct speech.
The bar chart shows the ratio of direct speech to indirect speech for each book.
- The earlier books (Book one ~ eight, except six) have higher ratio than the later books (Book nine ~ fifteen).
The ratio over the all books is around 0.694.

[2.2] Lengths of all books and chapters

def get_book_ch_count(data):
    book_ch_count = {'book': [], 'chapter': [], 'word_count': []}
    book_idx = 0
    
    for book in data:
        book_idx += 1
        
        for ch in data[book]:
            len_ch = 0
            
            for para in data[book][ch]:
                for sent in data[book][ch][para]:
                    word_key = list(data[book][ch][para][sent].keys())
                    word_key = word_key[-1]
                    
                    if 'W' in word_key:
                        num_words = int(word_key[1:]) + 1
                        len_ch += num_words
                
            book_ch_count['book'].append(book)
            book_ch_count['chapter'].append(int(ch[1:]))
            book_ch_count['word_count'].append(len_ch)
    
    return pd.DataFrame.from_dict(book_ch_count)

book_ch_count = get_book_ch_count(data)

This function returns word count for each chapter. The output looks like:

book_ch_count.head()

The following converts the above table in a different format: rows are books and columns are chapters. This looks better to look at the word counts. We will look at a stacked bar chart visualizing the word counts later.

pivot_book_ch_count = book_ch_count.groupby(['book', 'chapter'], sort=False)['word_count'].sum().unstack('chapter')

pivot_book_ch_count.head()

[2.2.1] Visualization – Stacked bar chart

The lengths of all books and chapters can be shown in a stacked bar chart. Here, I define the length as the number of words. In the following chart, the X axis indicates the books, and the Y axis shows the number of words for each book. The legend indicates chapter indices. A colored bar segment in each bar indicates the number of words for each chapter.

pivot_book_ch_count.plot(kind='bar', stacked=True, figsize=(10,6), legend=False, zorder=2)
plt.grid()
plt.legend(title='chapter', loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()

[2.2.2] Summary

The lengths of all books and chapters can be shown in a stacked bar chart.
In this case, the length is defined as word counts.
It seems that the author put more effort in the books in 1805 and 1812 than others.

[2.3] Outliers

[2.3.1] Visualization – Boxplot

In this case, a box plot can be used to find outliers. The following box plot shows results from word counts for all chapters. The top and bottom of the box are the 75th and 25th percentiles, respectively. The green line in the box indicates the median, which is 759. The whiskers (vertical solid lines) indicate the extensions from the top and bottom. In the following box plot, the top whisker ranges from 1010 to 1635, and the bottom whisker ranges from 83 to 588. The circles indicate outliers.

_, bp = book_ch_count.boxplot(column='word_count',return_type = 'both')

outliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
medians = [median.get_ydata() for median in bp["medians"]]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]

The following table shows the list of outliers that are chapters that have more than 1635 words.

book_ch_count[book_ch_count['word_count'].isin(outliers[0])]

[2.3.2] Summary

We looked at how a box plot can be used to find outliers.
In this case, the outliers are the chapters that have more than 1635 words.
The median is 759 words.

[2.4] Word frequency analysis

[2.4.1] Visualization – Line chart

Now, let’s look at some of most frequent words. The following shows the 50 most frequent words. We can see some verbs that the author likely used in these books and names that appeared frequently. The names are Pierre, Rostóv, Natásha and Andrew. These people might be main characters in these books. Not surprisingly, the word say was used most frequently because there are a lot of direct speech.

def get_list_words(data):
    list_words = []
    
    for book in data:        
        for chapter in data[book]:
            for para in data[book][chapter]:
                for sent in data[book][chapter][para]:
                    for key in data[book][chapter][para][sent].keys():
                        if 'W' in key:
                            list_words.append(data[book][chapter][para][sent][key])
                            
    return list_words

list_words = get_list_words(data)

freqdist = nltk.FreqDist(list_words)
plt.figure(figsize=(16,5))
freqdist.plot(50)

[2.4.2] Summary

Here, line chart were used to visualize word counts.
We found some verbs that the author likely used in these books and names that appeared frequently.
The names are Pierre, Rostóv, Natásha and Andrew.
- These people might be main characters.
The word say was used most frequently because there are a lot of direct speech.

Posts

Gradient Descent

Batch vs. Online Learning

Knowledge Graph Completion

What are Embeddings?

Batch Normalization

Gradient Boosting

The Normal Equation

Handling Categorical Values

How to Deal with Missing Values

Underfitting vs. Overfitting