從wapo刮取的推文出錯

Question

兩個問題。

Goal eta 3列csv，其中包含日期，時間和tweet的列標題。 我嘗試從li中提取跨度文本/時間導致在time和tweet列內復制跨度信息。 這是我使用python的第一周，我試圖用“”替換（）tweet列'time'，但最終我刪除了兩個列'time'實例。
按順序將這些列組合在一起，或者將它們正確混合在一起。 我編寫的代碼將產生30,000或1000行。 正確的csv文件應在520行左右。

import bs4 as bs
import requests, urllib.request, csv
from urllib.request import urlopen


sauce = urllib.request.urlopen('https://www.washingtonpost.com/graphics/politics/100-days-of-trump-tweets/?utm_term=.0c2052f6d858').read()
soup = bs.BeautifulSoup(sauce, 'html.parser')

lists = soup.find_all('li', class_='visible')
dates = soup.find_all("li", attrs={"data-date": True})

tweet_data = ['date, time, tweets']

for li in   dates[1:]:
    date = li['data-date']
    tweet_data.append([date])

for list in lists[1:]:
    time = list.find_all('span', {"class": "gray"})[0].text
    tweets = list.text
    tweet_data.append([time, tweets])

with open('tweets_attempt_8.csv', 'w') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerows(tweet_data)

Answer 1

這是您需要為其提供代碼的代碼...希望您對此答案感到滿意。

import bs4 as bs
import urllib2,csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')

url='www.washingtonpost.com/graphics/politics/100-days-of-trump-tweets/?utm_term=.0c2052f6d858'

sauce = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen(sauce)
data = con.read()
soup = bs.BeautifulSoup(data, 'html.parser')

lists = soup.find_all('li', class_='visible')
dates = soup.find_all("li", attrs={"data-date": True})

tweet_data = ['date, time, tweets']

for li,list in zip(dates[1:],lists[1:]):
    date = li['data-date']
    time = list.find_all('span', {"class": "gray"})[0].text
    tweets = list.text
    tweet_data.append([date,time, tweets])

with open('/tmp/tweets_attempt_8.csv', 'w') as csvfile:
     writer = csv.writer(csvfile)
     writer.writerows(tweet_data)

如您所願，請看一下

Answer 2

嘗試這個。 該頁面中有您要解析的504行。 您將使用csv輸出獲取所有這些信息。

import csv ; import requests ; from bs4 import BeautifulSoup

with open('tweets_attempt_8.csv', 'w', newline='', encoding='utf8') as outfile:
    writer = csv.writer(outfile)
    writer.writerow(['date','time','tweets'])

    sauce = requests.get('https://www.washingtonpost.com/graphics/politics/100-days-of-trump-tweets/?utm_term=.0c2052f6d858',headers={"User-Agent":"Existed"}).text
    soup = BeautifulSoup(sauce,"html.parser")

    for item in soup.select("li.pg-excerpt.visible"):
        date = item.get('data-date')
        time = item.select("span.gray")[0].text
        title = item.text.strip()
        print(date, time, title[10:])
        writer.writerow([date, time, title[10:]])

從wapo刮取的推文出錯

問題描述

2 個解決方案

解決方案1
2 2017-08-25 10:06:00

解決方案2
0 已采納 2017-08-25 10:30:08

從wapo刮取的推文出錯

問題描述

2 個解決方案

解決方案1 2 2017-08-25 10:06:00

解決方案2 0 已采納 2017-08-25 10:30:08

解決方案1
2 2017-08-25 10:06:00

解決方案2
0 已采納 2017-08-25 10:30:08