[英]Extract data from html using beautifulsoup
Im trying to extract the data which is under EXPERIENCE tag. 我试图提取EXPERIENCE标签下的数据。 Im using beautifulsoup to extract the data.
我使用beautifulsoup来提取数据。 Below is my html:
下面是我的html:
<div><span>EXPERIENCE
<br/></span></div><div><span>
<br/></span></div><div><span>
<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018
<br/></span></div><div><span> I worked on JAVA platform
<br/></span></div><div><span>From then i worked in ABC company
</br>2018- Till date
</br></span></div><div><span>I got handson on Python Language
</br></span></div><div><span>PROJECTS
</br></span></div><div><span>Developed and optimized many application, etc...
My work till now: 我的工作到现在为止:
with open('E:/cvparser/test.html','rb') as h:
dh = h.read().splitlines()
out = str(dh)
soup = BeautifulSoup(out,'html.parser')
for tag in soup.select('div:has(span:contains("EXPERIENCE"))'):
final = (tag.get_text(strip = True, separator = '\n'))
print(final)
Expected Output: 预期产出:
I worked in XYZ company from 2016 - 2018
I worked on JAVA platform
From then i worked in ABC company
2018- Till date
I got handson on Python Language
For my code its returning null. 对于我的代码,它返回null。 Can someone help me out here?
有人可以帮帮我吗?
What I understood is you want to have text in span
between EXPERIENCE and PROJECTS 我的理解是要在文本
span
经验和项目之间
Here is what you need: 这是你需要的:
from bs4 import BeautifulSoup as soup
html = """<div><span>EXPERIENCE
<br/></span></div><div><span>
<br/></span></div><div><span>
<br/></span></div><div><span></span><span> </span><span>I worked in XYZ company from 2016 - 2018
<br/></span></div><div><span> I worked on JAVA platform
<br/></span></div><div><span>From then i worked in ABC company
</br>2018- Till date
</br></span></div><div><span>I got handson on Python Language
</br></span></div><div><span>PROJECTS
</br></span></div><div><span>Developed and optimized many application, etc...</span></div>"""
page = soup(html, "html.parser")
save = False
final = ''
for div in page.find_all('div'):
text = div.get_text()
if text and text.strip().replace('\n','') == 'PROJECTS':
save = False
if save and text and text.strip().replace('\n', ''):
# last if is to avoid new line in final result
final = '{0}\n{1}'.format(final,text.replace('\n',''))
else:
if text and 'EXPERIENCE' in text:
save = True
print(final)
OUTPUT: OUTPUT:
I worked in XYZ company from 2016 - 2018
I worked on JAVA platform
From then i worked in ABC company
I got handson on Python Language
I am not sure about your html example, but try this: 我不确定你的html示例,但试试这个:
from bs4 import BeautifulSoup
result2 = requests.get("") # your url here
src2 = result2.content
soup = BeautifulSoup(src2, 'lxml')
for item in soup.find_all('div', {'span': 'Experience'}):
print(item.text)
You can use itertools.groupby
to match all relevant sub contents to their appropriate header: 您可以使用
itertools.groupby
将所有相关的子内容与其相应的标头匹配:
import itertools, re
from bs4 import BeautifulSoup as soup
d = lambda x:[i for b in x.contents for i in ([b] if b.name is None else d(b))]
data = list(filter(None, map(lambda x:re.sub('\n+|^\s+', '', x), d(soup(html, 'html.parser')))))
new_d = [list(b) for _, b in groupby(data, key=lambda x:x.isupper())]
result = {new_d[i][0]:new_d[i+1] for i in range(0, len(new_d), 2)}
Output: 输出:
{'EXPERIENCE': ['\uf0b7', 'I worked in XYZ company from 2016 - 2018', 'I worked on JAVA platform', 'From then i worked in ABC company', 'I got handson on Python Language'], 'PROJECTS': ['Developed and optimized many application, etc...']}
To get your desired output: 要获得所需的输出:
print('\n'.join(result['EXPERIENCE']))
Output: 输出:
I worked in XYZ company from 2016 - 2018
I worked on JAVA platform
From then i worked in ABC company
2018- Till date
I got handson on Python Language
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.