進捗 - okuiyusakuのブログ

12/23 21:30~23:30

YOMIURI ONLINEの新着ニュースをウェブスクレイピングするコードを書いた


import requests
from bs4 import BeautifulSoup

result = requests.get('https://www.yomiuri.co.jp/latestnews/')
soup = BeautifulSoup(result.text,'html.parser')

with open('yomiuri_news.csv','w',encoding = 'utf_8_sig') as f:
    f.write('"{0}","{1}"\n'.format('headline','datetime'))
    
    headline_datetime_list = soup.find_all(class_ = 'update')
    for headline_datetime in headline_datetime_list:
        datetime_text = headline_datetime.get_text()
    
    headline_html_list = soup.find_all(class_ = 'headline')
    for headline_html in headline_html_list:
        for update in soup.find_all(class_ = 'update'):
            update.decompose()
        headline_text = headline_html.get_text()
        
        f.write('"{0}","{1}"\n'.format(headline_text,datetime_text))

stringメソッドは難しくうまくいかなかった。
find_all属性で得たリストから要素を一つずつ取り出してget_textメソッドをかける。
このコードはすごく汚いと思う。