Contenido del Curso
Web Scraping with Python
Web Scraping with Python
Advanced Search
Certain HTML
tags require mandatory attributes, such as the anchor tag necessitating the href attribute or <img>
requiring the src
attribute. If you are interested in a specific attribute, you can use the .get()
method following .attrs
. For example, let's retrieve all the src
attributes of all <img>
elements.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for img in soup.find_all('img'): print(img.attrs.get('src'))
You may also encounter the id
attribute, which is quite common and is used to distinguish elements under the same tag. If you are interested in specific attribute values, you can pass them as a dictionary (in the format attr_name
: attr_value
) as the parameter for .find_all()
(immediately after specifying the tag you are searching for). For example, we are interested in only <div>
elements with the class
attribute set to 'box'
, or we are searching for the <p>
element with an "id"
attribute value of "id2"
.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for div in soup.find_all("div", {"class": "box"}): print(div) # Filtering by id attribute value print(soup.find("p", {"id": "id2"}))
We utilized the .find()
method (instead of .find_all()
) to retrieve the element with a specific id since the id is a unique identifier, and there cannot be more than one element with the same value. To ensure that we obtained only specific <div>
elements, let's examine the classes that <div>
elements have.
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for div in soup.find_all('div'): print(div.attrs.get('class'))
¡Gracias por tus comentarios!