繁体   English   中英

使用抓取的数据创建数据框

[英]Creating a dataframe with scraped data

我正在尝试按如下方式抓取数据:

import requests
from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
import pandas as pd

daterange = pd.date_range('02-25-2015', '09-16-2020', freq='D')

def main(req, date):
    r = req.get(f"https://it.sputniknews.com/politica/{date.strftime('%Y%m%d')}")
    print(r, r.content)
    soup = BeautifulSoup(r.content, 'html.parser')
    tag=None
    print (soup.select("b-plainlist"))
    #for tag in soup.select(".b-plainlist "):
        #print(tag.select_one(".b-plainlist__date").text)
        #print(tag.select_one(".b-plainlist__title").text)
        #print(tag.find_next(class_="b-plainlist__announce").text.strip())
    
    return tag.select_one(".b-plainlist__date").text, tag.select_one(".b-plainlist__title").text, tag.find_next(class_="b-plainlist__announce").text.strip()


with ThreadPoolExecutor(max_workers=30) as executor:
    with requests.Session() as req:
        fs = [executor.submit(main, req, date) for date in daterange]
        allin = []
        for f in fs:
            allin.append(f.result())
        df = pd.DataFrame.from_records(
            allin, columns=["Date", "Title", "Content"])
        print(df)

试图通过时间创建一个带有日期、标题和内容的数据框。

这段代码应该没问题,但我无法创建一个“干净”的数据框,所以我认为标签有问题。 你能看看吗? 谢谢

我无法按原样运行您的代码,但要获得“干净”的数据框,您需要注意每个新闻项目都包含在.b-plainlist__item元素中,因此您必须全选,迭代它们并获得b-plainlist__date

以下对我有用:

import datetime

from bs4 import BeautifulSoup
from concurrent.futures import ThreadPoolExecutor
from dateutil import rrule
import pandas as pd
import requests


def get_items(session, date):
    url_template = 'https://it.sputniknews.com/politica/{}'
    r = session.get(url_template.format(date.strftime("%Y%m%d")))

    soup = BeautifulSoup(r.content, 'html.parser')
    items = soup.select(".b-plainlist__item")

    for item in items:
        date = datetime.datetime.strptime(
            item.select_one(".b-plainlist__date").text,
            '%H:%M %d.%m.%Y'
        )
        title = item.select_one(".b-plainlist__title").text
        content = item.select_one(".b-plainlist__announce").text

        yield {'Date': date, 'Title': title, 'Content': content}


daterange = rrule.rrule(
    freq=rrule.DAILY,
    dtstart=datetime.datetime(2015, 2, 25),
    until=datetime.datetime(2020, 9, 16),
)

with ThreadPoolExecutor(max_workers=30) as executor:
    with requests.Session() as session:
        fs = [
            executor.submit(get_items, session, date)
            for date in daterange
        ]

        result = []
        for f in fs:
            result.extend(f.result())

        df = pd.DataFrame.from_records(result, index='Date')

输出

df.head()

                                                                 Title                                            Content
Date                                                                                                                     
2015-02-25 18:21:00          Che cosa accadrà dopo la tregua di Minsk?  Nonostante il cessate il fuoco entrato nel vig...
2015-02-26 14:12:00       Renzi va a Mosca con un pacchetto misterioso  Il viaggio di Matteo Renzi a Mosca – previsto ...
2015-02-26 13:47:00       Turchia all’UE: non aspetteremo all’infinito  Lo scopo della Turchia è appunto quello di ade...
2015-02-26 11:24:00  Tutte le portaerei mediatiche sparano ad alzo ...  Riassumiamo le ultime vicende della crisi est-...
2015-02-26 11:01:00  Nel dopo Minsk gli ostacoli prevalgono sull'ot...  Nella ridda di intepretazioni dei risultati de...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM