[英]Beautiful soup Python extracing data
我是python的新手。 stackoverflow的长时间用户,但第一次发布问题。 我正在尝试使用beautifulsoup从网站中提取数据。 我要提取的示例代码(在数据中列出并标记)
能够提取到列表但我无法提取实际数据。 这里的目标是提取列出:指甲油订阅盒,美容产品订阅盒,女性订阅盒标签:化妆,美容,指甲油
你能告诉我如何实现它吗?
import requests
from bs4 import BeautifulSoup
l1=[]
url='http://boxes.mysubscriptionaddiction.com/box/julep-maven'
source_code=requests.get(url)
plain_text=source_code.text
soup= BeautifulSoup(plain_text,"lxml")
for item in soup.find_all('p'):
l1.append(item.contents)
search='\nListed in:\n'
for a in l1:
if a[0] in ('\nTagged in:\n','\nListed in:\n'):
print(a)
既然你正在使用lxml
,为什么不以更直接的方式使用它( lxml
被认为比BeautifulSoup
更快):
import requests
from lxml import html
url='http://boxes.mysubscriptionaddiction.com/box/julep-maven'
source_code=requests.get(url)
tree = html.fromstring(source_code.content) #parses the html
paras = tree.xpath('//div[@class="box-information"]/p') #gets the para elements
# This loop prints the desired para elements' text.
for ele in paras[1:]:
print(ele.text_content())
输出:
Listed in:
Nail Polish Subscription Boxes, Subscription Boxes for Beauty Products, Subscription Boxes for Women
Tagged in:
Makeup, Beauty, Nail polish
注意:该站点受captcha保护,因此您可能需要将源html作为字符串从浏览器的dev工具中复制并在tree = html.fromstring(copied_string)
使用它以使此代码有效。
soup = BeautifulSoup(plain_text, 'html.parser')
import re
context = soup(text=re.compile(r'Listed in:'))
for item in context:
listed_in = item.parent
tagged_in = listed_in.find_next_siblings()[0]
print(listed_in.text.strip('\n').replace('\n', ''))
print(tagged_in.text.strip('\n').replace('\n', ''))
将全部显示在一行中:
Listed in:Nail Polish Subscription Boxes, Subscription Boxes for Beauty Products, Subscription Boxes for Women, Tagged in: Makeup, Beauty, Nail polish
希望有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.