How to Lemmatize a Dataframe in Python

In this tutorial, we will show you how to lemmatize a dataframe in Python using the NLTK library.

Updated March 29, 2023

Hello and welcome to this beginner’s tutorial on how to lemmatize a dataframe in Python. Lemmatization is the process of converting a word to its base form, or lemma. It’s a useful technique in natural language processing for reducing words to their root form and improving the accuracy of text analysis. In this tutorial, we will show you how to lemmatize a dataframe in Python using the NLTK library.

What is a Dataframe?

A dataframe is a two-dimensional tabular data structure in which data is organized in rows and columns, similar to a spreadsheet or a SQL table. Dataframes are a popular data structure in data science and machine learning and are used for data manipulation, analysis, and visualization.

What is Lemmatization?

Lemmatization is the process of reducing a word to its base form, or lemma, using morphological analysis. The base form of a word is its root form, which is typically a noun, verb, adjective, or adverb. For example, the lemmas of the words “running,” “ran,” and “runs” are “run.”

Method: Using the NLTK Library

The Natural Language Toolkit (NLTK) is a popular Python library for natural language processing. It provides a wide range of tools and resources for text processing, including lemmatization.

Here’s an example of how to lemmatize a dataframe using the NLTK library:

import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer

# Create a sample dataframe
df = pd.DataFrame({'text': ['I am running', 'He ran fast', 'They are runners']})

# Create a lemmatizer object
lemmatizer = WordNetLemmatizer()

# Define a function to lemmatize text
def lemmatize_text(text):
    # Tokenize the text into words
    words = nltk.word_tokenize(text)

    # Lemmatize each word
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

    # Join the lemmatized words back into a string
    lemmatized_text = ' '.join(lemmatized_words)

    return lemmatized_text

# Apply the lemmatize_text function to the text column of the dataframe
df['lemmatized_text'] = df['text'].apply(lemmatize_text)

# Print the original and lemmatized dataframes
print(df)

In this code, we first import the necessary libraries, including pandas and NLTK. We then create a sample dataframe with a text column containing three sentences. We create a lemmatizer object using the WordNetLemmatizer class from NLTK. We define a function called lemmatize_text that takes a text string as input, tokenizes it into words, lemmatizes each word, and joins the lemmatized words back into a string. We apply the lemmatize_text function to the text column of the dataframe using the apply() method and store the lemmatized text in a new column called lemmatized_text. Finally, we print both the original and lemmatized dataframes.

Conclusion

In this tutorial, we have shown you how to lemmatize a dataframe in Python using the NLTK library. We have covered the basics of lemmatization, creating a lemmatizer object, defining a lemmatization function, applying the function to a dataframe column, and printing the original and lemmatized dataframes.

Lemmatization is an essential technique in natural language processing, and it can improve the accuracy of text analysis and classification tasks.

Hey! Do you love Python? Want to learn more about it?
Let's connect on Twitter or LinkedIn. I talk about this stuff all the time!