简体   繁体   English

Pandas to_csv 只写入特定页面的数据

[英]Pandas to_csv only write the data from certain page

I tried to scrape data from tripadvisor, but from several pages that I tried to scrape, when I try to export it to csv it only shows 1 line of data and gives an error message like this我试图从tripadvisor刮取数据,但是从我试图刮取的几个页面中,当我尝试将其导出到csv时,它只显示1行数据并给出这样的错误消息

AttributeError: 'NoneType' object has no attribute 'text'

this is my code这是我的代码

import requests
import pandas as pd
from requests import get
from bs4 import BeautifulSoup

URL = 'https://www.tripadvisor.com/Attraction_Review-g469404-d3780963-Reviews-oa'

for offset in range(0, 30, 10):
    
    url = URL + str(offset) + '-Double_Six_Beach-Seminyak_Kuta_District_Bali.html'
    headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
    
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, "html.parser")    
    
    container = soup.find_all('div', {'class':'_2rspOqPP'})
    
    for r in container:
        reviews = r.find_all('div', {'class': None})

        #the container that contains the elements that I want to scrape has no attributes and use DOM element. So I tried to access div with _2rspOqPP class first then access the div with no attributes from there

        records = []
        for review in reviews:
            user = review.find('a', {'class':'_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
            country = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy'}).span.text
            date = review.find('div', {'class' : '_3JxPDYSx'}).text
            content = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text

            records.append((user, country, date, content))
            df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
            df.to_csv('doublesix_.csv', index=False, encoding='utf-8')

Code updated代码更新

for r in container:
    reviews = r.find_all('div', {'class': None})
    records = []
    for review in reviews:
        try:
            user = review.find('a', {'class':'_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
            country = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy'}).span.text
            date = review.find('div', {'class' : '_3JxPDYSx'}).text
            content = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text
            
            records.append((user, country, date, content))
        except:
            pass
        

print(records)
df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
df.to_csv('doublesix_.csv', index=False, encoding='utf-8')

You should move the records out of the for loops and unindent the last few lines.您应该将records移出for loops并取消缩进最后几行。

See this:看到这个:

import pandas as pd
import requests
from bs4 import BeautifulSoup

main_url = 'https://www.tripadvisor.com/Attraction_Review-g469404-d3780963-Reviews-oa'

country_class = "DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy"
records = []

for offset in range(0, 30, 10):
    url = main_url + str(offset) + '-Double_Six_Beach-Seminyak_Kuta_District_Bali.html'
    headers = {
        "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    }

    soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
    container = soup.find_all('div', {'class': '_2rspOqPP'})
    for r in container:
        reviews = r.find_all('div', {'class': None})
        for review in reviews:
            try:
                user = review.find('a', {'class': '_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
                country = review.find('div', {'class': country_class}).span.text
                date = review.find('div', {'class': '_3JxPDYSx'}).text
                content = review.find('div', {'class': 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text
                records.append((user, country, date, content))
            except AttributeError:
                pass

df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
df.to_csv('doublesix_.csv', index=False, encoding='utf-8')

Output from the .csv file: Output 来自.csv文件:

在此处输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM