简体   繁体   中英

Extract from table (Scrapy)

want to ask for a help with table parsing using scrapy in python2 Here is my table: link to table I need to get values of the <td> tags. Try to use next python code:

rows = resp.xpath('//*[@id="Vorlage_Infobox_Unternehmen"]')
if not rows:
    rows = resp.xpath('.//*[@id="Vorlage_Infobox_Unternehmen"]//table')
try:
    if rows:
        extract = lambda row, path: row.xpath(path).extract_first().strip()
        if '<th>' in str(rows):
            infobox = {extract(row, 'string(./th)'): extract(row, 'string(./td)') for row in rows}
        elif '<tr>' in str(rows):
            infobox = {extract(row, 'string(./td[1])'): extract(row, 'string(./td[2])') for row in rows}
        elif '<table>' in str(rows):
            infobox = {extract(row, 'string(./th)'): extract(row, 'string(./td)') for row in rows}
        else:
            infobox = {extract(row, 'string(./table/tbody/tr[1])'): extract(row, 'string(./td[1])') for row in rows}

But I do something wrong and can not get what I wand. Please help me to understand my mistake.

If you want to get the values of <td> inside <table> you could do this on your xpath:

    table = resp.xpath('//table[@id="Vorlage_Infobox_Unternehmen"]')
    if table:
        all_table_data = table.xpath('//td')

when you use table.xpath('some_xpath') it will apply it on the element that was already selected. You could also skip that test and do it directly:

    all_table_data = resp.xpath('//table[@id="Vorlage_Infobox_Unternehmen"]//td')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM