简体   繁体   中英

HTML text extraction

I have a list as below that came from Beautiful Soup.

soup = BeautifulSoup(page.content, 'html.parser')
area = soup.select("td strong")

For example

area=[
<strong><span style="font-size:1.4em;">120 Beats Per Minute (15)</span><br/><br/>Cinema</strong>, 
<strong><span style="font-size:1.4em;">A Little Night Music</span><br/><br/>Theatre</strong>, 
<strong><span style="font-size:1.4em;">A Wrinkle in Time (PG)</span><br/><br/>Cinema</strong>
]

I need to get rid of text except for Cinema, Theatre.

I've come up with the expression below but I can't apply this to the list

x[x.find('<br/><br/>')+10:].replace('</strong>','')

Any ideas how I can apply this expression to extract data from the list to make a new list? I've tried this :

clean_area=[]
for x in area:
   clean_area.append(x[x.find('<br/><br/>')+10:].replace('</strong>',''))

But I get this error : TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

I was answering your first post about an hour ago but you removed it.

I'm not sure if this is the best way to do it but here is what I came up with:

text = [
"""<strong><span style="font-size:1.4em;">120 Beats Per Minute (15)</span><br/><br/>Cinema</strong>""", 
"""<strong><span style="font-size:1.4em;">A Little Night Music</span><br/><br/>Theatre</strong>""", 
"""<strong><span style="font-size:1.4em;">A Wrinkle in Time (PG)</span><br/><br/>Cinema</strong>"""
]

text = ''.join(text) #Converting list of strings to one string

start = "<br/><br/>" #Start indication
end = "</" #End indication

clean_area = []

index = 0
while index < len(text):
    index = text.find(start, index)
    if index == -1:
        break
    clean_area.append(text[index+len(start):text.find(end, index)])
    index += len(start)

print(clean_area)

What you want to use is decompose this will take out any tags you do not want.

In this case it is the span

so

for x in soup.findAll("span"):
    x.decompose()

print(soup.text)

returns

Cinema, Theatre

I could only get this working with 2 passes. I'm sure it's not the best way but it at least works.

soup = BeautifulSoup(result.content, "html.parser")

for x in soup.findAll("span"):
    x.decompose()

area = soup.select("td strong")

a = str(area)
soup2 = BeautifulSoup(a)



tr = []
for tag in soup2.find_all(True):
    tr.append(tag.text)


clean_area = [] 
for i in tr[::3]:
    clean_area.append(i)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM