Apprendre Scraping Search Engine Results Pages (SERPs)

Glissez pour afficher le menu

Understanding how to extract data from search engine results pages (SERPs) is a powerful skill for any SEO specialist. SERPs are the pages displayed by search engines in response to a user's query, and they contain a wealth of information: page titles, URLs, snippets, and more. Scraping SERPs allows you to gather this data in bulk, enabling you to analyze competitors, track keyword rankings, and uncover new optimization opportunities. By automating this process with Python, you can save time and gain deeper insights into the search landscape.


              1234567891011121314151617181920212223242526272829
            
import requests
from bs4 import BeautifulSoup

# Hardcoded HTML snippet representing a SERP
serp_html = """
<html>
  <body>
    <div class="result">
      <h3><a href="https://example.com/page1">First Result Title</a></h3>
      <span class="snippet">This is a summary of the first result.</span>
    </div>
    <div class="result">
      <h3><a href="https://example.com/page2">Second Result Title</a></h3>
      <span class="snippet">This is a summary of the second result.</span>
    </div>
  </body>
</html>
"""

# Parse the HTML
soup = BeautifulSoup(serp_html, "html.parser")

# Extract titles and URLs
for result in soup.find_all("div", class_="result"):
    link = result.find("a")
    title = link.get_text()
    url = link["href"]
    print(f"Title: {title}")
    print(f"URL: {url}")

The scraping process involves several clear steps. First, you obtain the HTML content of the page—this can come from a file, a web request, or a hardcoded string. Next, you parse the HTML using a library that understands its structure. Once parsed, you can search for specific elements, such as the div tags with a class of "result" that contain each search result. Within each result, you look for the a tag to find the title and URL. By extracting the text from the a tag, you get the result's title, and by accessing its href attribute, you obtain the link. This method allows you to systematically collect structured data from the unstructured HTML of a SERP.


              1234567891011121314151617181920212223242526272829
            
from bs4 import BeautifulSoup

def extract_titles_and_urls(html):
    soup = BeautifulSoup(html, "html.parser")
    results = []
    for result in soup.find_all("div", class_="result"):
        link = result.find("a")
        title = link.get_text()
        url = link["href"]
        results.append((title, url))
    return results

# Example usage:
serp_html = """
<html>
  <body>
    <div class="result">
      <h3><a href="https://example.com/page1">First Result Title</a></h3>
      <span class="snippet">This is a summary of the first result.</span>
    </div>
    <div class="result">
      <h3><a href="https://example.com/page2">Second Result Title</a></h3>
      <span class="snippet">This is a summary of the second result.</span>
    </div>
  </body>
</html>
"""

print(extract_titles_and_urls(serp_html))

1. What Python library is commonly used for parsing HTML?

2. What information can you extract from a SERP for SEO analysis?

3. Fill in the blank: To find all 'a' tags in BeautifulSoup, use ____.

Tout était clair ?

Merci pour vos commentaires !

Section 1. Chapitre 2

Demandez à l'IA

Posez n'importe quelle question ou essayez l'une des questions suggérées pour commencer notre discussion

Section 1. Chapitre 2