Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära What is Beautiful Soup? | Decoding HTML with Beautiful Soup
Web Scraping with Python

book
What is Beautiful Soup?

BeautifulSoup is a python library that offers extensive functionality for parsing HTML pages. In the previous section, you worked with HTML as a string, which imposed significant limitations.

To install BeautifulSoup, execute the following command in your terminal or command prompt:

  • pip install beautifulsoup4 ;

  • To get started, import BeautifulSoup from bs4 : from bs4 import BeautifulSoup .

# Importing the library
from bs4 import BeautifulSoup
print(BeautifulSoup)
123
# Importing the library from bs4 import BeautifulSoup print(BeautifulSoup)
copy

This library is designed for working with HTML files and does not handle links. However, you already know how to deal with that using urlopen from urllib.requests. To initiate parsing, you need to provide two parameters to the BeautifulSoup function: the first is the HTML file, and the second is the parser (we will use the built-in html.parser parser). This action will create a BeautifulSoup object. For example, let's open and read a web page.

# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
print(type(soup))
print(soup)
12345678910111213
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, "html.parser") print(type(soup)) print(soup)
copy

The first method we will explore is .prettify(), which presents the HTML file as a nested data structure.

# Importing libraries
from bs4 import BeautifulSoup
from urllib.request import urlopen

# Reading web page
url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html"
page = urlopen(url)
html = page.read().decode("utf-8")

# Reading HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
print(soup.prettify())
123456789101112
# Importing libraries from bs4 import BeautifulSoup from urllib.request import urlopen # Reading web page url = "https://codefinity-content-media.s3.eu-west-1.amazonaws.com/18a4e428-1a0f-44c2-a8ad-244cd9c7985e/jesus.html" page = urlopen(url) html = page.read().decode("utf-8") # Reading HTML with BeautifulSoup soup = BeautifulSoup(html, "html.parser") print(soup.prettify())
copy

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 1

Fråga AI

expand
ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

some-alt