Learn Text Cleaning | Strings, Dates, Missing Data

Swipe to show menu

When working with real-world data, text columns often contain unwanted whitespace, extra symbols, or inconsistent formatting. In Polars, you can use the .str namespace to efficiently clean and transform these columns. Suppose you have a DataFrame with a name column that sometimes includes leading or trailing whitespace and punctuation, and a genres column where genres are stored as a single string separated by slashes, like "Drama / Comedy".

To clean the name column, you can use .str.strip_chars() to remove whitespace and specific symbols from both ends of each string. For the genres column, you can use .str.split() with a regular expression to split the string into a list of genres.

Here is a script that demonstrates these techniques:


              123456789101112131415161718
            
import polars as pl

df = pl.DataFrame({
    "name": ["  Alice! ", "Bob.", "  Carol  ", "David-"],
    "genres": ["Drama / Comedy", "Action/Thriller", "Sci-Fi / Adventure", "Romance"]
})

# Strip whitespace and symbols from 'name'
cleaned_df = df.with_columns([
    pl.col("name").str.strip_chars().str.strip_chars("!.-").alias("name_clean")
])

# Use regex to split genres into a list
cleaned_df = cleaned_df.with_columns([
    pl.col("genres").str.replace_all(r"\s*/\s*", ",").str.split(",").alias("genres_list")
])

print(cleaned_df)

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 3. Chapter 1