Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Text Cleaning | Strings, Dates, Missing Data
Data Wrangling with Polars

Text Cleaning

Swipe to show menu

When working with real-world data, text columns often contain unwanted whitespace, extra symbols, or inconsistent formatting. In Polars, you can use the .str namespace to efficiently clean and transform these columns. Suppose you have a DataFrame with a name column that sometimes includes leading or trailing whitespace and punctuation, and a genres column where genres are stored as a single string separated by slashes, like "Drama / Comedy".

To clean the name column, you can use .str.strip_chars() to remove whitespace and specific symbols from both ends of each string. For the genres column, you can use .str.split() with a regular expression to split the string into a list of genres.

Here is a script that demonstrates these techniques:

123456789101112131415161718
import polars as pl df = pl.DataFrame({ "name": [" Alice! ", "Bob.", " Carol ", "David-"], "genres": ["Drama / Comedy", "Action/Thriller", "Sci-Fi / Adventure", "Romance"] }) # Strip whitespace and symbols from 'name' cleaned_df = df.with_columns([ pl.col("name").str.strip_chars().str.strip_chars("!.-").alias("name_clean") ]) # Use regex to split genres into a list cleaned_df = cleaned_df.with_columns([ pl.col("genres").str.replace_all(r"\s*/\s*", ",").str.split(",").alias("genres_list") ]) print(cleaned_df)
question mark

Which .str method would you use to check if a genre column contains the word "Comedy"?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 1
some-alt