Pythonスクレイピングで文字化けしたときの対処法

PythonでWebスクレイピングを行う際、取得したテキストが文字化けすることがあります。この記事では、スクレイピング時の文字化けの原因とその対処法を解説します。

文字化けの原因
文字化けの対処法
まとめ

文字化けの原因

レスポンスのエンコーディングが適切でない
– requests.get() で取得したHTMLのエンコーディングが utf-8 ではない場合、正しくデコードできず文字化けする。
Webサイトのエンコーディングが Shift_JIS や EUC-JP など utf-8 以外
– utf-8 前提でデコードすると、エンコーディングが合わずに文字化けする。
ローカルのHTMLファイルのエンコーディングが適切でない
– open() で読み込む際に encoding を適切に指定していないと、正しくデコードされない。

文字化けの対処法

1. response.encoding を適切に設定する

requests ライブラリの response.apparent_encoding を利用して、適切なエンコーディングを自動判別します。

import requests
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

# エンコーディングを自動設定
response.encoding = response.apparent_encoding  

# BeautifulSoupで解析
soup = BeautifulSoup(response.text, "html.parser")

# ページのテキストを取得
print(soup.get_text(separator="\n", strip=True))

response.apparent_encoding を設定することで、サーバーが指定するエンコーディングを自動で判別し、正しくデコードできます。

2. chardet を使ってエンコーディングを自動判別する

より精度の高いエンコーディング判定を行いたい場合は、chardet を使うのがおすすめです。

chardet のインストール

pip install chardet

chardet を使ったエンコーディング判定

import requests
import chardet
from bs4 import BeautifulSoup

url = "https://example.com"
response = requests.get(url)

# エンコーディングを自動判別
detected = chardet.detect(response.content)
encoding = detected["encoding"]

# 適切なエンコーディングでデコード
text = response.content.decode(encoding, errors="replace")

soup = BeautifulSoup(text, "html.parser")

# ページのテキストを取得
print(soup.get_text(separator="\n", strip=True))

errors="replace" を指定することで、変換できない文字を適切に処理できます。

3. ローカルのHTMLファイルを開く際のエンコーディング指定

ローカルのHTMLファイルを開く際は、encoding="utf-8" を明示的に指定しましょう。

with open("sample.html", "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")

print(soup.get_text(separator="\n", strip=True))

もしファイルのエンコーディングが不明な場合は、chardet で判定すると確実です。

import chardet

with open("sample.html", "rb") as file:
    raw_data = file.read()
    detected = chardet.detect(raw_data)
    encoding = detected["encoding"]

# 判定したエンコーディングで開く
with open("sample.html", "r", encoding=encoding) as file:
    soup = BeautifulSoup(file, "html.parser")

print(soup.get_text(separator="\n", strip=True))