简体   繁体   中英

lxml xpath - Get all text within span tags

I'm trying to scrape pages that look something like this, that have 3 or more span tags per set. The goal is to get a list of dicts ex:

{'ctl02_lblAppearanceInfo1': 'Text',
'ctl02_lblAppearanceInfo2': 'Text'}

html:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............   </span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE..........</span>


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE..........</span>


<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE..............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE.............</span>

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE..........</span>

I've used

tree.xpath('//span[starts-with(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl")]')

with success, as it returns a element object with id and text properties, but if I come across something like this:

<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 1
<br>TEXT LINE 2
<br>TEXT LINE 3
<br>TEXT LINE 4</span>

It will only return back "TEXT LINE 1"

Use contains() and text() .

Here is the code:

from lxml import html

HTML = """<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText"> TEXT HERE 1..............   </span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2" class="ParamText"> TEXT HERE 2..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace" class="ParamText">TEXT HERE 3..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1" class="ParamText"> TEXT HERE 4..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2" class="ParamText"> TEXT HERE 5..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace" class="ParamText">TEXT HERE 6..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1" class="ParamText"> TEXT HERE 7..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2" class="ParamText"> TEXT HERE 8..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace" class="ParamText">TEXT HERE 9..............</span>
<span id="ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1" class="ParamText">
TEXT LINE 10.............
<br>TEXT LINE 11.............
<br>TEXT LINE 12.............
<br>TEXT LINE 13.............</span>
"""

tree = html.fromstring(HTML)
text_lines = tree.xpath('//span[contains(@id, "ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl")]')

results = dict()

for i, text_line in enumerate(text_lines):
    span_id = text_line.xpath('.//@id')[0]
    span_text = [x.strip() for x in text_line.xpath('.//text()')]
    results[i] = dict(id=span_id, texts=span_text)

print results

Output:

{
    0: {
        'texts': ['TEXT HERE 1..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
    },
    1: {
        'texts': ['TEXT HERE 2..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo2'
    },
    2: {
        'texts': ['TEXT HERE 3..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppSpace'
    },
    3: {
        'texts': ['TEXT HERE 4..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo1'
    },
    4: {
        'texts': ['TEXT HERE 5..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppearanceInfo2'
    },
    5: {
        'texts': ['TEXT HERE 6..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl03_lblAppSpace'
    },
    6: {
        'texts': ['TEXT HERE 7..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo1'
    },
    7: {
        'texts': ['TEXT HERE 8..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppearanceInfo2'
    },
    8: {
        'texts': ['TEXT HERE 9..............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl04_lblAppSpace'
    },
    9: {
        'texts': ['TEXT LINE 10.............', 'TEXT LINE 11.............', 'TEXT LINE 12.............', 'TEXT LINE 13.............'],
        'id': 'ctl00_ContentPlaceHolder1_CaseDetailParties1_gvParties_ctl02_gvAttyInfo_ctl02_lblAppearanceInfo1'
    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM