简体   繁体   中英

Data Scrape from a website to a csv file format using python and beautifulsoup

I am trying to get all the graphics card details into a csv file but not able to scrape the data(doing this as a project to scrape data for learning purposes). I am new to python and html. I am using request and beautifulsoup libraries.

import bs4
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'
uClient = uReq(my_url)
Negg = uClient.read()
uClient.close
Complete_Graphics_New_Egg = soup(Negg,"html.parser")

Container_Main = Complete_Graphics_New_Egg.findAll("div",{"class":"item-container"})

Container_Main5 = str(Container_Main[5])
path_file='C:\\Users\\HP\\Documents\\Python\\Container_Main5.txt'
file_1 = open(path_file,'w')
file_1.write(Container_Main5)
file_1.close()

##Container_Main_details = Container_Main5.a

#div class="item-badges"

Container_5_1 = str(Container_Main[5].findAll("ul",{"class":"item-features"}))
path_file='C:\\Users\\HP\\Documents\\Python\\Container_test_5_1.txt'
file_5_1 = open(path_file,'w')
file_5_1.write(Container_5_1)
file_5_1.close()
Container_5_1.li

Container_5_2 = str(Container_Main[5].findAll("p",{"class":"item-promo"}))
path_file='C:\\Users\\HP\\Documents\\Python\\Container_test_5_2.txt'
file_5_2 = open(path_file,'w')
file_5_2.write(Container_5_2)
file_5_2.close()
##p class="item-promo"
##div class="item-info"

This should get you started. I'll break it down a bit too for you so you can modify and play while you're learning. I'm also suggesting to use Pandas, as it's a popular library for data manipulation and you'll be using in the near future if you're already not using it

I first initialize a results dataframe to store all the data you'll be parsing:

import bs4
import requests
import pandas as pd

results = pd.DataFrame()

Next, get the html form the site and pass that into BeautifulSoup:

my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'

response = requests.get(my_url)
html = response.text

soup = bs4.BeautifulSoup(html, 'html.parser')

Then you had it find all the tags you were interested in. The only thing I added was have it iterate over each of those tags/elements it finds:

Container_Main = soup.find_all("div",{"class":"item-container"})
for container in Container_Main:

and then in each of those containers, grab the data you wanted from the item features and item promo. I store that data into a temporary dataframe (of 1 row) and then append that to my results dataframe. So after each iteration, the temp dataframe is overwritten with the new info, but the results won;t be overwritten, it'll just add on.

Lastly, use pandas to save the dataframe to csv.

results.to_csv('path/file.csv', index=False)

So full code:

import bs4
import requests
import pandas as pd

results = pd.DataFrame()

my_url = 'https://www.newegg.com/Product/ProductList.aspx?Submit=ENE&DEPA=0&Order=BESTMATCH&Description=graphics+card&N=-1&isNodeId=1'

response = requests.get(my_url)
html = response.text

soup = bs4.BeautifulSoup(html, 'html.parser')

Container_Main = soup.find_all("div",{"class":"item-container"})
for container in Container_Main:

    item_features = container.find("ul",{"class":"item-features"})

    # if there are no item-fetures, move on to the next container
    if item_features == None:
        continue

    temp_df = pd.DataFrame(index=[0])
    features_list = item_features.find_all('li')
    for feature in features_list:
        split_str = feature.text.split(':')        
        header = split_str[0]
        data = split_str[1].strip()
        temp_df[header] = data

    promo = container.find_all("p",{"class":"item-promo"})[0].text
    temp_df['promo'] = promo

    results = results.append(temp_df, sort = False).reset_index(drop = True)


results.to_csv('path/file.csv', index=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM