Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Advanced Search | Beautiful Soup: Part II
Web Scraping with Python
course content

Contenido del Curso

Web Scraping with Python

Web Scraping with Python

1. Getting Acquainted with HTML
2. Beautiful Soup: Part I
3. Beautiful Soup: Part II

bookAdvanced Search

Certain HTML tags require mandatory attributes, such as the anchor tag necessitating the href attribute or <img> requiring the src attribute. If you are interested in a specific attribute, you can use the .get() method following .attrs. For example, let's retrieve all the src attributes of all <img> elements.

12345678910111213
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for img in soup.find_all('img'): print(img.attrs.get('src'))
copy

You may also encounter the id attribute, which is quite common and is used to distinguish elements under the same tag. If you are interested in specific attribute values, you can pass them as a dictionary (in the format attr_name: attr_value) as the parameter for .find_all() (immediately after specifying the tag you are searching for). For example, we are interested in only <div> elements with the class attribute set to 'box', or we are searching for the <p> element with an "id" attribute value of "id2".

12345678910111213141516
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for div in soup.find_all("div", {"class": "box"}): print(div) # Filtering by id attribute value print(soup.find("p", {"id": "id2"}))
copy

We utilized the .find() method (instead of .find_all()) to retrieve the element with a specific id since the id is a unique identifier, and there cannot be more than one element with the same value. To ensure that we obtained only specific <div> elements, let's examine the classes that <div> elements have.

12345678910111213
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, 'html.parser') for div in soup.find_all('div'): print(div.attrs.get('class'))
copy

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 3. Capítulo 5
some-alt