[英]Pandas to_csv only write the data from certain page
I tried to scrape data from tripadvisor, but from several pages that I tried to scrape, when I try to export it to csv it only shows 1 line of data and gives an error message like this我试图从tripadvisor刮取数据,但是从我试图刮取的几个页面中,当我尝试将其导出到csv时,它只显示1行数据并给出这样的错误消息
AttributeError: 'NoneType' object has no attribute 'text'
this is my code这是我的代码
import requests
import pandas as pd
from requests import get
from bs4 import BeautifulSoup
URL = 'https://www.tripadvisor.com/Attraction_Review-g469404-d3780963-Reviews-oa'
for offset in range(0, 30, 10):
url = URL + str(offset) + '-Double_Six_Beach-Seminyak_Kuta_District_Bali.html'
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, "html.parser")
container = soup.find_all('div', {'class':'_2rspOqPP'})
for r in container:
reviews = r.find_all('div', {'class': None})
#the container that contains the elements that I want to scrape has no attributes and use DOM element. So I tried to access div with _2rspOqPP class first then access the div with no attributes from there
records = []
for review in reviews:
user = review.find('a', {'class':'_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
country = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy'}).span.text
date = review.find('div', {'class' : '_3JxPDYSx'}).text
content = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text
records.append((user, country, date, content))
df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
df.to_csv('doublesix_.csv', index=False, encoding='utf-8')
Code updated代码更新
for r in container:
reviews = r.find_all('div', {'class': None})
records = []
for review in reviews:
try:
user = review.find('a', {'class':'_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
country = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy'}).span.text
date = review.find('div', {'class' : '_3JxPDYSx'}).text
content = review.find('div', {'class' : 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text
records.append((user, country, date, content))
except:
pass
print(records)
df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
df.to_csv('doublesix_.csv', index=False, encoding='utf-8')
You should move the records
out of the for loops
and unindent the last few lines.您应该将records
移出for loops
并取消缩进最后几行。
See this:看到这个:
import pandas as pd
import requests
from bs4 import BeautifulSoup
main_url = 'https://www.tripadvisor.com/Attraction_Review-g469404-d3780963-Reviews-oa'
country_class = "DrjyGw-P _26S7gyB4 NGv7A1lw _2yS548m8 _2cnjB3re _1TAWSgm1 _1Z1zA2gh _2-K8UW3T _1dimhEoy"
records = []
for offset in range(0, 30, 10):
url = main_url + str(offset) + '-Double_Six_Beach-Seminyak_Kuta_District_Bali.html'
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
}
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
container = soup.find_all('div', {'class': '_2rspOqPP'})
for r in container:
reviews = r.find_all('div', {'class': None})
for review in reviews:
try:
user = review.find('a', {'class': '_7c6GgQ6n _37QDe3gr WullykOU _3WoyIIcL'}).text
country = review.find('div', {'class': country_class}).span.text
date = review.find('div', {'class': '_3JxPDYSx'}).text
content = review.find('div', {'class': 'DrjyGw-P _26S7gyB4 _2nPM5Opx'}).text
records.append((user, country, date, content))
except AttributeError:
pass
df = pd.DataFrame(records, columns=['Name', 'Country', 'Date', 'Content'])
df.to_csv('doublesix_.csv', index=False, encoding='utf-8')
Output from the .csv
file: Output 来自.csv
文件:
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.