简体   繁体   中英

I'm using Python 3.7 an BS4 for web scraping, there is a problem I couldn't solve, hope someone knows how to fix this

I suppose to get product information from source page , the data I want is in the HTML tag , but there is another tag in tag, so when I save the data to local storage, it looks very bad. I hope someone knows how to fix this problem.

Here is my code:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

my_url = 'https://list.jd.com/list.html? 
cat=9987,653,655&ev=exbrand_15127&page=1'

#opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

filename = "params.csv"
f = open(filename,"w")
#grabs each product
li_containers =  page_soup.findAll("li",{"class":"gl-item"})
for i in range(0,len(li_containers)):
   p_name_div = li_containers[i].find("div",{"class":"p-name"})
   p_name = p_name_div.a.em.text.strip()
   print(p_name)
   f.write(p_name)
f.close()

There is the some screenshots.

I wanted it to be like this:

我希望它像这样:

But it ended up looking like this:

但它看起来像这样:

Without span tag

With span tag

Try this

my_url = 'https://list.jd.com/list.html? 
cat=9987,653,655&ev=exbrand_15127&page=1'

#opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

#html parsing
page_soup = soup(page_html, "html.parser")

filename = "params.csv"
f = open(filename,"w")
#grabs each product
li_containers =  page_soup.findAll("li",{"class":"gl-item"})
for i in range(0,len(li_containers)):
   p_name_div = li_containers[i].find("div",{"class":"p-name"})
   p_name = p_name_div.a.em.text.strip()
   print(p_name.strip(" "))
   f.write(p_name.strip(" "))
f.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM