简体   繁体   中英

Python extract italic content from html

I am trying to extract 'Italic' Content from a pdf in python. I have converted the pdf to html so that I can use the italic tag to extract the text. Here is how the html looks like

<br></span></div><div style="position:absolute; border: textbox 1px
solid; writing-mode:lr-tb; left:71px; top:225px; width:422px;
height:15px;"><span style="font-family: TTPGFA+Symbol; font-
size:12px">•</span><span style="font-family: YUWTQX+ArialMT; font-
size:14px">  Kornai, Janos. 1992. </span><span style="font-family:
PUCJZV+Arial-ItalicMT; font-size:14px">The Socialist System: The
Political Economy of Communism</span><span style="font-family:
YUWTQX+ArialMT; font-size:14px">.

This is how the code looks:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/../..myfile.html"))
bTags = []
for i in soup.findAll('span'):
    bTags.append(i.text)

I am not sure how can I get only the italic text.

Try this:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
bTags = []
for i in soup.find_all('span', style=lambda x: x and 'Italic' in x):
    bTags.append(i.text)

print bTags

Passing a function to the style argument will filter results by the result of that function, with its input as the value of the style attribute. We check to see if the string Italic is inside the attribute, and if so, return True.

You may need a more sophisticated algorithm depending on the rest of what your HTML looks like.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM