PythonでWebページからテキストを抽出する方法

Webサイトから特定のテキストを抽出したい場合、PythonのBeautifulSoupライブラリを使用すると簡単に実現できます。本記事では、以下の方法を紹介します。

id を指定して取得
class を指定して取得
CSSセレクタを使用して取得
ページ全体のテキストを取得

必要なライブラリのインストール
WebページのHTMLを取得
1. id を指定して特定の要素のテキストを取得
2. class を指定して特定の要素のテキストを取得
3. CSSセレクタを使用して取得
4. ページ全体のテキストを取得
ローカルHTMLファイルから取得する場合
まとめ

必要なライブラリのインストール

まず、以下のコマンドで必要なライブラリをインストールします。

pip install beautifulsoup4 requests

WebページのHTMLを取得

対象のWebページのHTMLを取得し、BeautifulSoupで解析します。

import requests
from bs4 import BeautifulSoup

# 取得したいWebページのURL
url = "https://example.com"

# ページのHTMLを取得
response = requests.get(url)
response.raise_for_status()  # エラーチェック

# BeautifulSoupでHTMLを解析
soup = BeautifulSoup(response.text, "html.parser")

1. id を指定して特定の要素のテキストを取得

# div id="blockb" のテキストを取得
blockb_text = soup.find("div", id="blockb")
if blockb_text:
    print(blockb_text.get_text(strip=True))

2. class を指定して特定の要素のテキストを取得

# div class="content" のテキストを取得
content_text = soup.find("div", class_="content")
if content_text:
    print(content_text.get_text(strip=True))

3. CSSセレクタを使用して取得

# CSSセレクタを使用（例: div.content の場合）
content_text = soup.select_one("div.content")
if content_text:
    print(content_text.get_text(strip=True))

# すべての p タグのリストを取得
paragraphs = soup.select("p")
for p in paragraphs:
    print(p.get_text(strip=True))

4. ページ全体のテキストを取得

# ページ全体のテキストを取得
full_text = soup.get_text(separator="\n", strip=True)
print(full_text)

ローカルHTMLファイルから取得する場合

オンラインのWebサイトではなく、ローカルのHTMLファイルを解析する場合は、以下のように書きます。

with open("sample.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")

# 特定の要素を取得（例: id="main"）
main_text = soup.find("div", id="main")
if main_text:
    print(main_text.get_text(strip=True))