Python extract italic content from html

Question

I am trying to extract 'Italic' Content from a pdf in python. I have converted the pdf to html so that I can use the italic tag to extract the text. Here is how the html looks like

<br></span></div><div style="position:absolute; border: textbox 1px
solid; writing-mode:lr-tb; left:71px; top:225px; width:422px;
height:15px;"><span style="font-family: TTPGFA+Symbol; font-
size:12px">•</span><span style="font-family: YUWTQX+ArialMT; font-
size:14px">  Kornai, Janos. 1992. </span><span style="font-family:
PUCJZV+Arial-ItalicMT; font-size:14px">The Socialist System: The
Political Economy of Communism</span><span style="font-family:
YUWTQX+ArialMT; font-size:14px">.

This is how the code looks:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open("/../..myfile.html"))
bTags = []
for i in soup.findAll('span'):
    bTags.append(i.text)

I am not sure how can I get only the italic text.

Answer 1

Try this:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html)
bTags = []
for i in soup.find_all('span', style=lambda x: x and 'Italic' in x):
    bTags.append(i.text)

print bTags

Passing a function to the style argument will filter results by the result of that function, with its input as the value of the style attribute. We check to see if the string Italic is inside the attribute, and if so, return True.

You may need a more sophisticated algorithm depending on the rest of what your HTML looks like.

Python extract italic content from html

Question

1 answers

solution1
4 ACCPTED 2016-09-12 20:06:41

Python extract italic content from html

Question

1 answers

solution1 4 ACCPTED 2016-09-12 20:06:41

solution1
4 ACCPTED 2016-09-12 20:06:41