[英]Python BeautifulSoup - Scrape Multiple Web Pages with Iframes from Given URLs
We have this code (thanks to Cody and Alex Tereshenkov): 我们有以下代码(感谢Cody和Alex Tereshenkov):
import pandas as pd
import requests
from bs4 import BeautifulSoup
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 50)
url = "https://www.aliexpress.com/store/feedback-score/1665279.html"
s = requests.Session()
r = s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("#detail-displayer").attrs["src"]
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")
rows = []
for row in soup.select(".history-tb tr"):
#print("\t".join([e.text for e in row.select("th, td")]))
rows.append([e.text for e in row.select("th, td")])
#print
df = pd.DataFrame.from_records(
rows,
columns=['Feedback', '1 Month', '3 Months', '6 Months'],
)
# remove first row with column names
df = df.iloc[1:]
df['Shop'] = url.split('/')[-1].split('.')[0]
pivot = df.pivot(index='Shop', columns='Feedback')
pivot.columns = [' '.join(col).strip() for col in pivot.columns.values]
column_mapping = dict(
zip(pivot.columns.tolist(), [col[:12] for col in pivot.columns.tolist()]))
# column_mapping
# {'1 Month Negative (1-2 Stars)': '1 Month Nega',
# '1 Month Neutral (3 Stars)': '1 Month Neut',
# '1 Month Positive (4-5 Stars)': '1 Month Posi',
# '1 Month Positive feedback rate': '1 Month Posi',
# '3 Months Negative (1-2 Stars)': '3 Months Neg',
# '3 Months Neutral (3 Stars)': '3 Months Neu',
# '3 Months Positive (4-5 Stars)': '3 Months Pos',
# '3 Months Positive feedback rate': '3 Months Pos',
# '6 Months Negative (1-2 Stars)': '6 Months Neg',
# '6 Months Neutral (3 Stars)': '6 Months Neu',
# '6 Months Positive (4-5 Stars)': '6 Months Pos',
# '6 Months Positive feedback rate': '6 Months Pos'}
pivot.columns = [column_mapping[col] for col in pivot.columns]
pivot.to_excel('Report.xlsx')
The code extracts the "Feedback History" table for the given URL (which is inside an iframe) and transforms all the table data into 1 line, exactly like this: 该代码提取给定URL(位于iframe中)的“反馈历史记录”表,并将所有表数据转换为1行,就像这样:
And on the other hand we have file in the same project folder ("urls.txt") with 50 URLS like this: 另一方面,我们在同一项目文件夹(“ urls.txt”)中有50个URL,如下所示:
https://www.aliexpress.com/store/feedback-score/4385007.html
https://www.aliexpress.com/store/feedback-score/1473089.html
https://www.aliexpress.com/store/feedback-score/3085095.html
https://www.aliexpress.com/store/feedback-score/2793002.html
https://www.aliexpress.com/store/feedback-score/4656043.html
https://www.aliexpress.com/store/feedback-score/4564021.html
We just need to extract the same data for all the URLs in the file. 我们只需要为文件中的所有URL提取相同的数据。
How do we do it? 我们该怎么做呢?
Since the number of urls is ~ 50, you could just read the urls into a list and then send a request to each of the urls. 由于网址数量约为50,因此您可以将网址读入列表中,然后向每个网址发送请求。 I have just tested these 6 urls and the solution works for them.
我刚刚测试了这6个网址,该解决方案适用于它们。 But you may want to add some try-except for any exceptions that may occur.
但是您可能想要添加一些try-except,否则可能会发生任何异常。
import pandas as pd
import requests
from bs4 import BeautifulSoup
with open('urls.txt','r') as f:
urls=f.readlines()
master_list=[]
for idx,url in enumerate(urls):
s = requests.Session()
r = s.get(url)
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("#detail-displayer").attrs["src"]
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")
rows = []
for row in soup.select(".history-tb tr"):
rows.append([e.text for e in row.select("th, td")])
df = pd.DataFrame.from_records(
rows,
columns=['Feedback', '1 Month', '3 Months', '6 Months'],
)
df = df.iloc[1:]
shop=url.split('/')[-1].split('.')[0]
df['Shop'] = shop
pivot = df.pivot(index='Shop', columns='Feedback')
master_list.append([shop]+pivot.values.tolist()[0])
if idx == len(urls) - 1: #last item in the list
final_output=pd.DataFrame(master_list)
pivot.columns = [' '.join(col).strip() for col in pivot.columns.values]
column_mapping = dict(zip(pivot.columns.tolist(), [col[:12] for col in pivot.columns.tolist()]))
final_output.columns = ['Shop']+[column_mapping[col] for col in pivot.columns]
final_output.set_index('Shop', inplace=True)
final_output.to_excel('Report.xlsx')
Output: 输出:
Perhaps a better solution that you could consider is avoiding the use of pandas at all. 也许您可以考虑的一个更好的解决方案是完全避免使用熊猫。 After you get the data, you could manipulate it to get a list and use XlsxWriter .
获得数据后,可以操纵它以获取列表并使用XlsxWriter 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.