Analysis of Text: Tolstoy’s War and Peace
On this post, we will look at an example of analysis of text data that are Tolstoy’s War and Peace books. You can download a plain text file from http://www.gutenberg.org/ebooks/2600. Of course, there should be many other solutions, please feel free to try to come up with analysis in different ways. The following codes are written in Python:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
[1] Load the book and parse the text into a structured format (Python Dictionary)
with open('War_and_Peace.txt', 'r') as f:
book_txt = f.read().split('\n\n')
The raw text is splitted by line spacing into the list book_txt. Each element of the list indicates a title, a header, a paragraph, etc. The followings are examples:
book_txt[405: 413]
[1.1] Separate the body from the header and the footer
def separate_sections(book_txt):
section = "header"
header_body_footer = {'header': [], 'body': [], 'footer': []}
for seg in book_txt:
seg = seg.replace('\n', ' ') # This is to remove '\n' in paragraphs
if seg == " BOOK ONE: 1805":
section = "body"
elif seg == " End of the Project Gutenberg EBook of War and Peace, by Leo Tolstoy":
section = "footer"
header_body_footer[section].append(seg.lstrip())
return header_body_footer
This function is to split the list book_txt into header, body and footer sections in the dictionary header_body_footer. header_body_footer[‘header’] has the header including the table of contents, header_body_footer[‘body’] includes the boday, and header_body_footer[‘footer’] has the footer.
[1.2] Initialize the data structure
def init_data(header):
data = {}
book_num = None
chapter_idx = 0
for seg in header:
if "BOOK " in seg and ":" in seg:
book_num = seg
data[book_num] = {}
chapter_idx = 0
elif "CHAPTER" in seg and book_num != None:
chapter_idx += 1
data[book_num]["C" + str(chapter_idx)] = {}
elif "EPILOGUE" in seg:
break
return data
This function is to initialize the data structure (dictionary) with the index of books and chapters. The initial structure looks like {‘BOOK ONE: 1805’: {‘C1’: {}, ‘C2’: {}, … , ‘C28’: {}}, ‘BOOK TWO: 1805’: {‘C1’: {}, …}, …}. ‘C’ indicates a chapter and the following number is a chapter number. The epilogues are ignored.
[1.3] Functions for text preprocessing
def to_lowercase(words):
new_words = []
for word in words:
new_words.append(word.lower())
return new_words
This function converts all characters to lowercases from list of words.
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
def remove_stopwords(words):
new_words = []
for word in words:
if word not in stopwords.words('english'):
new_words.append(word)
return new_words
This function removes stop words.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
def lemmatize(words):
new_words = []
for word in words:
new_words.append(lemmatizer.lemmatize(word, pos='v'))
return new_words
This lemmatization is to capture canonical forms based on a word’s lemma (e.g., better -> good).
[1.4] Parser
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') # This is to remove special characters
def parser(body, data):
book_num = None
ch_idx = 0
para_idx = 0
for seg in body:
if seg == '':
continue
if "BOOK" in seg and ":" in seg:
book_num = seg
ch_idx = 0
continue
elif "EPILOGUE" in seg:
break
if "CHAPTER" in seg:
ch_idx += 1
ch = "C" + str(ch_idx)
para_idx = 0
else:
para_idx += 1
data[book_num][ch]["P" + str(para_idx)] = {}
list_sent = nltk.sent_tokenize(seg)
for i, sent in enumerate(list_sent):
data[book_num][ch]["P" + str(para_idx)]["S" + str(i+1)] = {"text": sent}
words = tokenizer.tokenize(sent)
words = to_lowercase(words)
words = remove_stopwords(words)
words = lemmatize(words)
for j, word in enumerate(words):
data[book_num][ch]["P" + str(para_idx)]["S" + str(i+1)]["W" + str(j+1)] = word
return data
This parser uses NLTK to tokenize paragraphs, convert words to lowercases, remove stop words and lemmatize words. The chapter index, the paragraph index, the sentence index and the word index look like C1, P1, S1 and W1, respectively. The epilogues are ignored. We will look at an example of some parsed data later.
[1.5] Parse the book
header_body_footer = separate_sections(book_txt)
data = init_data(header_body_footer['header'])
data = parser(header_body_footer['body'], data)
This the text preprocessing from separating the body from others to parsing contents. The following example shows results from the first paragraph of the first chapter of the first book
data['BOOK ONE: 1805']['C1']['P1']
[1.6] Save the dictionary into a pickle file
import pickle
pickle_out = open("data.pickle","wb")
pickle.dump(data, pickle_out)
pickle_out.close()
pickle is used to serialize the dataset into a file.
[2] Analysis
[2.1] Ratio of direct speech to indirect speech
def get_num_dir_indir_speech(book, ch, para):
num_direct = 0
num_indirect = 0
start_quotation = "“"
end_quotation = "”"
quoted = False
for sent in data[book][ch][para]:
sentence = data[book][ch][para][sent]['text']
if start_quotation in sentence and end_quotation in sentence:
if any(p in sentence for p in ['.', ',', '!', '?']):
num_direct += 1
else:
num_indirect += 1
continue
if len(sentence.split(start_quotation)[0]) > 0 or len(sentence.split(end_quotation)[1]) > 0:
num_indirect += 1
quoted = False
elif start_quotation in sentence or quoted == True:
num_direct += 1
quoted = True
elif end_quotation in sentence:
if len(sentence.split(end_quotation)[1]) > 0:
num_indirect += 1
num_direct += 1
quoted = False
else:
num_indirect += 1
return num_direct, num_indirect
Given a paragraph, this returns the number of direct speech and the number of indirect speech. In the sentence level, it check if a sentence has direct and/or indirect speech with the quotation marks. The following examples show what cases were considered:
1) “I thought today’s fete had been canceled. I confess all these festivities and fireworks are becoming wearisome.”
-> 2 direct speech
2) “No, Princess, I have lost your affection forever!” said Mademoiselle Bourienne.
-> 1 direct and 1 indirect speech
3) “Uncle” played another song and a valse.
-> 1 indirect speech
4) She suddenly paused, smiling at her own impetuosity.
-> 1 indirect speech
def get_ratio_direct(data):
ratio_direct = {'book': [], 'chapter': [], 'num_direct': [], 'num_indirect': [], 'ratio_direct_to_indirect': []}
total_direct = 0
total_indirect = 0
for book in data:
for ch in data[book]:
num_direct_ch = 0
num_indirect_ch = 0
for para in data[book][ch]:
num_direct, num_indirect = get_num_dir_indir_speech(book, ch, para)
num_direct_ch += num_direct
num_indirect_ch += num_indirect
ratio_direct['book'].append(book)
ratio_direct['chapter'].append(int(ch[1:]))
ratio_direct['num_direct'].append(num_direct_ch)
ratio_direct['num_indirect'].append(num_indirect_ch)
ratio_direct['ratio_direct_to_indirect'].append(num_direct_ch / num_indirect_ch)
total_direct += num_direct_ch
total_indirect += num_indirect_ch
return pd.DataFrame.from_dict(ratio_direct)
ratio_direct = get_ratio_direct(data)
It returns a pandas dataframe that shows the number of directed and indirected speech and the ratio of directed speech to indirected speech for each chapter. The raito is a fraction with the number of direct speech as numerator and the number of indirect speech as denominator: # of direct speech / # of indirect speech. The output looks like:
ratio_direct.head()
[2.1.1] Ratio for chapters
The following table shows the list of chapters that the ratio of directed speech to indirected speech is 0. These chapters don’t have directed speech.
ratio_direct[ratio_direct['ratio_direct_to_indirect'] == 0]
The following table shows chapters that the ratio is bigger than 2. This means that these chapters have directed speech more than two times as many as indirected speech.
ratio_direct[ratio_direct['ratio_direct_to_indirect'] >= 2]
[2.1.2] Ratio for books
The following table and bar chart show the ratio for each book.
ratio_direct_book = ratio_direct.groupby(['book'], sort=False)['num_direct', 'num_indirect'].sum().reset_index()
ratio_direct_book['ratio_direct_to_indirect'] = ratio_direct_book['num_direct'] / ratio_direct_book['num_indirect']
ratio_direct_book
[2.1.3] Visualization – Bar chart
ratio_direct_book[['book', 'ratio_direct_to_indirect']].plot.bar(x='book')
The ratio over the all books is around 0.694.
ratio_direct_book['num_direct'].sum() / ratio_direct_book['num_indirect'].sum()
[2.1.4] Summary
- We looked at how direct and indrect speech are found and how the ratio is calculated.
- We also found which chapters have no direct speech and which chapters have many direct speech.
- The bar chart shows the ratio of direct speech to indirect speech for each book.
- The earlier books (Book one ~ eight, except six) have higher ratio than the later books (Book nine ~ fifteen).
- The ratio over the all books is around 0.694.
[2.2] Lengths of all books and chapters
def get_book_ch_count(data):
book_ch_count = {'book': [], 'chapter': [], 'word_count': []}
book_idx = 0
for book in data:
book_idx += 1
for ch in data[book]:
len_ch = 0
for para in data[book][ch]:
for sent in data[book][ch][para]:
word_key = list(data[book][ch][para][sent].keys())
word_key = word_key[-1]
if 'W' in word_key:
num_words = int(word_key[1:]) + 1
len_ch += num_words
book_ch_count['book'].append(book)
book_ch_count['chapter'].append(int(ch[1:]))
book_ch_count['word_count'].append(len_ch)
return pd.DataFrame.from_dict(book_ch_count)
book_ch_count = get_book_ch_count(data)
This function returns word count for each chapter. The output looks like:
book_ch_count.head()
The following converts the above table in a different format: rows are books and columns are chapters. This looks better to look at the word counts. We will look at a stacked bar chart visualizing the word counts later.
pivot_book_ch_count = book_ch_count.groupby(['book', 'chapter'], sort=False)['word_count'].sum().unstack('chapter')
pivot_book_ch_count.head()
[2.2.1] Visualization – Stacked bar chart
The lengths of all books and chapters can be shown in a stacked bar chart. Here, I define the length as the number of words. In the following chart, the X axis indicates the books, and the Y axis shows the number of words for each book. The legend indicates chapter indices. A colored bar segment in each bar indicates the number of words for each chapter.
pivot_book_ch_count.plot(kind='bar', stacked=True, figsize=(10,6), legend=False, zorder=2)
plt.grid()
plt.legend(title='chapter', loc='center left', bbox_to_anchor=(1.0, 0.5))
plt.show()
[2.2.2] Summary
- The lengths of all books and chapters can be shown in a stacked bar chart.
- In this case, the length is defined as word counts.
- It seems that the author put more effort in the books in 1805 and 1812 than others.
[2.3] Outliers
[2.3.1] Visualization – Boxplot
In this case, a box plot can be used to find outliers. The following box plot shows results from word counts for all chapters. The top and bottom of the box are the 75th and 25th percentiles, respectively. The green line in the box indicates the median, which is 759. The whiskers (vertical solid lines) indicate the extensions from the top and bottom. In the following box plot, the top whisker ranges from 1010 to 1635, and the bottom whisker ranges from 83 to 588. The circles indicate outliers.
_, bp = book_ch_count.boxplot(column='word_count',return_type = 'both')
outliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
medians = [median.get_ydata() for median in bp["medians"]]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]
The following table shows the list of outliers that are chapters that have more than 1635 words.
book_ch_count[book_ch_count['word_count'].isin(outliers[0])]
[2.3.2] Summary
- We looked at how a box plot can be used to find outliers.
- In this case, the outliers are the chapters that have more than 1635 words.
- The median is 759 words.
[2.4] Word frequency analysis
[2.4.1] Visualization – Line chart
Now, let’s look at some of most frequent words. The following shows the 50 most frequent words. We can see some verbs that the author likely used in these books and names that appeared frequently. The names are Pierre, Rostóv, Natásha and Andrew. These people might be main characters in these books. Not surprisingly, the word say was used most frequently because there are a lot of direct speech.
def get_list_words(data):
list_words = []
for book in data:
for chapter in data[book]:
for para in data[book][chapter]:
for sent in data[book][chapter][para]:
for key in data[book][chapter][para][sent].keys():
if 'W' in key:
list_words.append(data[book][chapter][para][sent][key])
return list_words
list_words = get_list_words(data)
freqdist = nltk.FreqDist(list_words)
plt.figure(figsize=(16,5))
freqdist.plot(50)
[2.4.2] Summary
- Here, line chart were used to visualize word counts.
- We found some verbs that the author likely used in these books and names that appeared frequently.
- The names are Pierre, Rostóv, Natásha and Andrew.
- These people might be main characters.
- The word say was used most frequently because there are a lot of direct speech.