Text Cleaning
Swipe to show menu
When working with real-world data, text columns often contain unwanted whitespace, extra symbols, or inconsistent formatting. In Polars, you can use the .str namespace to efficiently clean and transform these columns. Suppose you have a DataFrame with a name column that sometimes includes leading or trailing whitespace and punctuation, and a genres column where genres are stored as a single string separated by slashes, like "Drama / Comedy".
To clean the name column, you can use .str.strip_chars() to remove whitespace and specific symbols from both ends of each string. For the genres column, you can use .str.split() with a regular expression to split the string into a list of genres.
Here is a script that demonstrates these techniques:
123456789101112131415161718import polars as pl df = pl.DataFrame({ "name": [" Alice! ", "Bob.", " Carol ", "David-"], "genres": ["Drama / Comedy", "Action/Thriller", "Sci-Fi / Adventure", "Romance"] }) # Strip whitespace and symbols from 'name' cleaned_df = df.with_columns([ pl.col("name").str.strip_chars().str.strip_chars("!.-").alias("name_clean") ]) # Use regex to split genres into a list cleaned_df = cleaned_df.with_columns([ pl.col("genres").str.replace_all(r"\s*/\s*", ",").str.split(",").alias("genres_list") ]) print(cleaned_df)
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat