Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara CSS Selectors in BeautifulSoup | CSS Selectors/XPaths
Web Scraping with Python

book
CSS Selectors in BeautifulSoup

To extract the data with the already written CSS Selectors, you can use Selector from the scrapy library. However, we will consider another way to work with CSS Selectors using the library BeautifulSoup from the previous section. To select the data from the file, use the function .select() of the already created BeautifulSoup object:

сss_locator = "html > body > div"
print(soup.select(сss_locator))
12
сss_locator = "html > body > div" print(soup.select(сss_locator))
copy

We know how to navigate through HTML files using attributes. However, we can select all elements with a specified class or id without the tag’s name or path. For example:

print(soup.select("#id-1"))
print(soup.select(".class-1"))
12
print(soup.select("#id-1")) print(soup.select(".class-1"))
copy

In the first line, we select all elements with the id equal to id-1. In the second line, CSS Selector navigates to all tags that belong to the class-1.

You can also go through all elements of your class with for loop:

for link in soup.select(".class-link > a"):
page = urlopen(link)
html = page.read().decode("utf-8")
new_soup = BeautifulSoup(html, "html.parser")
1234
for link in soup.select(".class-link > a"): page = urlopen(link) html = page.read().decode("utf-8") new_soup = BeautifulSoup(html, "html.parser")
copy

Here we go through all the links of the class class-link and create BeautifulSoup object for each new page.

Keep in mind that instead of urllib.request library you can send every time get request (request for seeing a webpage) to the page using the library requests and .content() function to convert the page to the HTML format:

import requests

page_response = requests.get(url)
page = page_response.content
1234
import requests page_response = requests.get(url) page = page_response.content
copy
Compito

Swipe to start coding

Go through all the links on the main webpage, get their HTML code, and print the titles of each page. Here we will first go through all tags, saving them into a list, and then go through all href attributes of extracted tags to get all URLs of the pages.

  1. Import the library for opening URLs.
  2. Select all a tags using the method .select() and CSS Selector as the parameter. Assign the result to the variable a_tags.
  3. Create the empty list links.
  4. Go through the list a_tags with the for loop and the variable a to extract the attributes href and add them to the empty list links.
  5. During running through each webpage by its link, get the title tag using the method .title of the BeautifulSoup object and print it.

Soluzione

# Import libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Open the page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Create the BeautifulSoup object
soup = BeautifulSoup(html, "html.parser")

# Select all a tags
a_tags = soup.select("a")

# Define and fill the list with links to the pages
links = []
for a in a_tags:
links.append(a["href"])

# Go thought all links
for link in links:
# Open the page
page = urlopen(link)
html = page.read().decode("utf-8")

# Create the BeautifulSoup object
new_soup = BeautifulSoup(html, "html.parser")

# Print the tag title of each page
print(new_soup.title)

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 3. Capitolo 5
# Import libraries
from bs4 import BeautifulSoup
from ___ import ___

# Open the page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/page.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Create the BeautifulSoup object
soup = BeautifulSoup(html, "html.parser")

# Select all a tags
a_tags = ___

# Define and fill the list with links to the pages
___
___:
___.___(a["href"])

# Go thought all links
for link in links:
# Open the page
page = urlopen(link)
html = page.read().decode("utf-8")

# Create the BeautifulSoup object
new_soup = BeautifulSoup(html, "html.parser")

# Print the tag title of each page
___

Chieda ad AI

expand
ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

some-alt