Box Plot

A box plot (also called a whisker plot) is useful to visualize the distribution of data and find outliers. This plot displays the five-number summary: minimum, first quartile, median, third quartile and maximum. Let’s look at more details with a box plot that was generated from Tolstoy’s War and Peace (http://www.gutenberg.org/ebooks/2600). Another post “Analysis of Text: Tolstoy’s War and Peace” shares ideas of text data preprocessing with this text. In this post, we will focus on the data visualization with the box plot.

Python Code

I have preprocessed this text and generated the following pandas dataframe, which has columns: book, chapter and word_count. There are 15 books in this text and 337 chapters in total. The word_count column has the word count for each chapter. The following table shows the first 5 rows.

book_ch_count.head()
_, bp = book_ch_count.boxplot(column='word_count', return_type = 'both')
outliers = [flier.get_ydata() for flier in bp["fliers"]]
boxes = [box.get_ydata() for box in bp["boxes"]]
medians = [median.get_ydata() for median in bp["medians"]]
whiskers = [whiskers.get_ydata() for whiskers in bp["whiskers"]]
print(medians)

[array([759., 759.])]

print(whiskers)

[array([588., 83.]), array([1010., 1635.])]

The above box plot shows results from word counts for each chapter over all books. The top and bottom of the box are the 75th and 25th percentiles, respectively. The green line in the box indicates the median, which is 759. The whiskers (vertical solid lines) indicate the extensions from the top and bottom. In the following box plot, the top whisker ranges from 1010 to 1635, and the bottom whisker ranges from 83 to 588. The circles indicate outliers.

The following table shows the list of outliers that are chapters that have more than 1635 words.

book_ch_count[book_ch_count['word_count'].isin(outliers[0])]