简体   繁体   中英

Python Pandas compare two dataframes then edit

first of all sorry if this is a duplicate, I'm pretty beginner, so I don't really understand the full meaning of somebody else's question.

I'm trying to make a script for a school project that runs through a big Excel file with multiple links, and scrapes the price from the webpage, compares it to the price in the actual price column in the Excel. If it finds no difference: great, But if it does. it should edit the price with the new one that it just scraped.

for example

Excel file : 

| Link        | Price          |
| --------    | -------------- |
| Product1    | 119            |
| Product2    | 89             |

Scraped data : 

| Price          |
| -------------- |
| 119            |
| 91             |

If this scenario happens, the Excel file should be edited to become like this : 

| Link        | Price          |
| --------    | -------------- |
| Product1    | 119            |
| Product2    | 91             |

for now i have only been able to scrape the prices into a list and turn the Excel file into a Dataframe, but I really have no idea what to do next...

here's my code

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

b =[]
tableau = pd.read_excel (r'PriceTrackerTrails\liens.xlsx')

links = pd.DataFrame(tableau, columns=['prix','liens'])

for i in links.index:
    html = requests.get(links['liens'][i]).text
    soup = bs(html, 'lxml')
    a = soup.find('span', {"itemprop":"price"}).text
    b.append(a)

print(links['prix'])
print(b)

output:

0    139,00
1        98
Name: prix, dtype: object
['139,00', '112,00']

and the Excel file is looking like this: Excel 链接源文件

Thank you in advance !

In the for loop, update the value in links['prix'] with the value from a as follows:

for i in links.index:
    html = requests.get(links['liens'][i]).text
    soup = bs(html, 'lxml')
    a = soup.find('span', {"itemprop":"price"}).text
    b.append(a)
    links.loc[i, 'prix'] = a

To have access to which rows had a price update, I would do this a bit differently and create a new column within the links dataframe that stores the new price instead of overwriting the value in the 'prix' column, and creating the list, b . This way it will be easier to validate the data and there is no need for the separate list, b . To do this, all you have to do is change the loc method to a new column name:

for i in links.index:
    html = requests.get(links['liens'][i]).text
    soup = bs(html, 'lxml')
    a = soup.find('span', {"itemprop":"price"}).text
    b.append(a)
    links.loc[i, 'prix_new'] = a

Then create a new dataframe that contains only the rows where there was a change in price as follows:

price_updated = links[links['prix']!=links['prix_new']].reset_index()

reset_index makes the row index for this new dataframe start at 0 (and by default makes a copy). If you want to keep the row index so it matches the links dataframe, then replace reset_index with copy so you don't get a SettingWithCopyWarning warning, a warning about changing the value on a copy of a slice from a DataFrame.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM