简体   繁体   中英

Regex html dynamic table

I have stuck with regex syntax. I am trying to create a regex for html code, that looks for a specific string, which is located in a table and gives you back the next column value next to our search string.

 [u'<table> <tr> <td>Ingatlan \\xe1llapota</td> <td>fel\\xfaj\\xedtott</td> </tr> <tr> <td>\\xc9p\\xedt\\xe9s \\xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\\xfas\\xedtv\\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\\xc9p\\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\űt\\xe9s</td> <td>g\\xe1z (cirko)</td> </tr> <tr> <td>L\\xe9gkondicion\\xe1l\\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\\xe1lymentes\\xedtett</td> <td>nem</td> </tr> <tr> <td>F\\xfcrd\ő \\xe9s WC</td> <td>k\\xfcl\\xf6n \\xe9s atlan \\xe1llapota') 

So I would like to create a regex to look for "Ingatlan \\xe1llapota" and return "fel\\xfaj\\xedtott": Ingatlan \\xe1llapota fel\\xfaj\\xedtott

My current regex expression is the following: \\bIngatlan állapota\\s+(.*) I would need to incorporate the td tags and to limit how long string would it return after the search string(Ingatlan állapota)

Any help is much appreciated. Thanks!

As pointed out before use xpath or css instead:

import scrapy

class txt_filter:
    sterm='Ingatlan \xe1llapota'
    txt= '''<table> <tr> <td>Ingatlan \xe1llapota</td> <td>fel\xfaj\xedtott</td> </tr> <tr> <td>\xc9p\xedt\xe9s \xe9ve</td> <td>2018</td> </tr> <tr> <td>Komfort</td> <td>luxus</td> </tr> <tr> <td>Energiatan\xfas\xedtv\xe1ny</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Emelet</td> <td>1</td> </tr> <tr> <td>\xc9p\xfclet szintjei</td> <td class="is-empty">nincs megadva</td> </tr> <tr> <td>Lift</td> <td>van</td> </tr> <tr> <td>Belmagass\xe1g</td> <td>3 m vagy magasabb</td> </tr> <tr> <td>F\u0171t\xe9s</td> <td>g\xe1z (cirko)</td> </tr> <tr> <td>L\xe9gkondicion\xe1l\xf3</td> <td>van</td> </tr> </table>', u'<table> <tr> <td>Akad\xe1lymentes\xedtett</td> <td>nem</td> </tr> <tr> <td>F\xfcrd\u0151 \xe9s WC</td> <td>k\xfcl\xf6n \xe9s atlan </td></tr></table>
    '''
    resp = scrapy.http.response.text.TextResponse(body=txt,url='abc',encoding='utf-8')
    print(resp.xpath('.//td[.="'+sterm+'"]/following-sibling::td[1]/text()').extract())

Result:

$ python3 so_51590811.py 
['felújított']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM