A General Approach to Preprocessing Text Data

You can find the whole process from preprocessing to visualization in another post Analysis of Text: Tolstoy’s War and Peace if you are interested in looking at the whole source codes. In this post, we will discuss a general approach to preprocessing text data and look through python codes that use the NLTK (Natural Language TooKit) library. The following diagram shows the general idea of the preprocessing pipeline from text (e.g., books) to a list of words:

This pipeline can be more generalized to: Text -> Tokenization -> Normalization -> A list of words. Let’s look at details step by step.

Tokenization

Tokenization is a step that splits text into smaller pieces (also called tokens). For example, it splits a paragraph into sentences or a sentence into words. It is also referred to as text segmentation or lexical analysis.

import nltk

list_sent = nltk.sent_tokenize(paragraph)

This python code outputs a list of sentences from a paragraph. The following examples show the results (sentences and words) from tokenizing the first paragraph of Tolstoy’s War and Peace (http://www.gutenberg.org/ebooks/2600):

“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes. But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself! But how do you do? I see I have frightened you—sit down and tell me all the news.”

The list of output sentences:

['“Well, Prince, so Genoa and Lucca are now just family estates of the Buonapartes.', 'But I warn you, if you don’t tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist—I really believe he is Antichrist—I will have nothing more to do with you and you are no longer my friend, no longer my ‘faithful slave,’ as you call yourself!', 'But how do you do?', 'I see I have frightened you—sit down and tell me all the news.”']
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\w+') # This is to remove special characters

words = tokenizer.tokenize(sent)

The tokenized word list (also special characters were removed) from the first sentence:

['Well', 'Prince', 'so', 'Genoa', 'and', 'Lucca', 'are', 'now', 'just', 'family', 'estates', 'of', 'the', 'Buonapartes']

Normalization

Before analyzing text, we need to normalize the words. If we do not normalize them, the word ‘Family’ and ‘family’ might be treated as different words. Also, you might want to treat ‘better’ and ‘good’ words as the same word. It depends on your purpose. There are most common normalization approaches to consider:

  • convert all characters to lowercase
  • remove stop words
  • remove numbers, or convert them to strings
  • stemming
  • lemmatization

The following codes are for converting characters to lowercase and removing stop words:

def to_lowercase(words):
new_words = []
for word in words:
new_words.append(word.lower())
return new_words
from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))

def remove_stopwords(words):
new_words = []
for word in words:
if word not in stopwords.words('english'):
new_words.append(word)
return new_words

The result would be:

['well', 'prince', 'genoa', 'lucca', 'family', 'estates', 'Buonapartes']

The stop words (‘so’, ‘and’, ‘are’, ‘now’, ‘just’, ‘of’ and ‘the’) were removed.

Stemming is the process of removing affixes like: running -> run.

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

def stem_words(words):
new_words = []
for word in words:
new_words.append(stemmer.stem(word))
return new_words

Lemmatization is to capture canonical forms based on a word’s lemma (e.g., better -> good).

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def lemmatize_verbs(words):
new_words = []
for word in words:
new_words.append(lemmatizer.lemmatize(word, pos='v'))
return new_words

You might want to choose one of stemming or lemmatization that is appropriate for your purpose. You can find more details about stemming and lemmatization in here, Stemming and lemmatization.

Visualization

There are many approaches to visualize words distribution. The following graph (line chart) shows the 50 most frequent words in the Tolstoy’s War and Peace books.

freqdist = nltk.FreqDist(list_words)
plt.figure(figsize=(16,5))
freqdist.plot(50)

We can see some verbs that the author likely used in these books and names that appeared frequently. The names are PierreRostóvNatásha and Andrew. These people might be main characters in these books. Not surprisingly, the word say was used most frequently because there are a lot of direct speech (i.e., quoted text). We will look at other visualization approaches in future posts.