简体   繁体   中英

extract text in between HTML td tags

I have a <td> and want to extract the text from it, that is I need just the text Tom Cruz, Homer Simpson, Bill Clinton which is inside each <td> tag using one python regular expression.

<td class="clic-cul manga" template=".woxColumnyd" maz="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Tom Cruz</td>

<td class="clic-cul manga" template=".woxColumnx" mac="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Home Simpson</td>

<td class="clic-cul manga" template=".woxColumnz" max="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Bill Clinton</td>

Any ideas?

Updates 1. If HTML Parser is the standard way, how should I go about it?

I know you asked for a regex-only solution but I would urge you to consider other safer, faster and simpler approaches using one of the lxml-based libraries like html5lib or BeautifulSoup, that can parse invalid HTML and provide access to lxml trees.

With BeautifulSoup:

html = """
<td class="clic-cul manga" template=".woxColumnyd" maz="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Tom Cruz</td>
<td class="clic-cul manga" template=".woxColumnx" mac="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Home Simpson</td>
<td class="clic-cul manga" template=".woxColumnz" max="/ajax/blac-woxm/xom-line/expanded/2002-2012/11-05-2022/01/fam.json">Bill Clinton</td>
"""

import bs4
doc = bs4.BeautifulSoup(html, 'lxml')
print([el.text for el in doc.find_all('td')])

The output is then

['Tom Cruz', 'Home Simpson', 'Bill Clinton']

IF you are looking for a one liner regex- >\\u+(\\s\\u+)?\u0026lt;/

IF NOT
LET SAY you have that html stored in a file named dat.txt . I don't know about python but I know ruby. Maybe you could make out something.

xfile3=File.open("dat.txt","r")     #html stored in dat.txt
i=-2                                #Logic here. For iterating i exactly to the position of names in the array
ch= xfile3.read
arr=ch.split(/[<,>]/)               #for splitting ch into arr whenever < or > is encountered
while i<=100                        # replace 100 to some no as it suits
    i=i+4           
    puts arr[i]                     
end

Working proof 证明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM