Words Count

Now we would like to explore the most represented number in our DataFrame. For this reason we will create a collection where we will store the most frequent words and then, plot it.

Methods description

from collections import Counter; import nltk: Imports the Counter class from the collections module and the nltk library;
from nltk.corpus import stopwords: Imports a list of common stopwords from NLTK;
nltk.download("stopwords"): Downloads the stopwords dataset from NLTK;
def remove_stopword(x): This defines a function named remove_stopword that takes a list x as input and returns a new list with stopwords removed;
return [y for y in x if y not in stopwords.words("english")]: This comprehension expression filters out stopwords from the input list x using the list of English stopwords from NLTK;
Counter: A class from the collections module used to count occurrences of elements in a list or iterable;
stopwords.words("english"): A method from NLTK that returns a list of stopwords for the English language;
temp.most_common(25): Returns the 25 most common elements (words) and their counts from the Counter object temp;
temp.iloc[1:,:]: Indexes a DataFrame temp to exclude the first row and select all columns;
temp.style.background_gradient(...): Applies a background gradient style to a DataFrame temp.

Oppgave

Swipe to start coding

Create a collection to count word occurrences using the Counter module:

Remove stopwords from our tweets texts.
Create a collection.
Create a DataFrame with the newly created list.
Change the background color to "Blues".

Løsning

from collections import Counter

import nltk

from nltk.corpus import stopwords

nltk.download("stopwords")

def remove_stopword(x):

return [y for y in x if y not in stopwords.words("english")]

data["temp_list1"] = data["text"].apply(lambda x: str(x).split()) #List of words in every row for text

data["temp_list1"] = data["temp_list1"].apply(lambda x: remove_stopword(x)) #Removing Stopwords

top = Counter([item for sublist in data["temp_list1"] for item in sublist])

temp = pd.DataFrame(top.most_common(25))

temp = temp.iloc[1:,:]

temp.columns = ["Common_words", "count"]

temp.style.background_gradient(cmap = "Blues")

Mark tasks as Completed

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 8

AVAILABLE TO ULTIMATE ONLY