简体   繁体   中英

Not able to extract data with Python scrapy

I am not able to scrape the data from the following Checkbox and one address field

<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'>
<tr id='row618534' >
   <td style='border-bottom:1px solid silver;background:#ffffff;' padding-bottom :10px;>
      <div id='r618534'>
      <div style='color:red; font-weight:bold; '>
         Warning... Duplicate Found!
      </div>
      <table width=100% border=0 cellpadding=2 cellspacing=0 style='margin-top:15px;border:4px #70797a; border-radius: 5px;'>
         <tr>
            <td style='background:lightgreen; width:55px;' valign=top> 
               <img src='../images/checkwhite.png' style='width:30px;'>
            </td>
            <td style='background:lightgreen;'>
               <input checked type=checkbox name=jobs[] value='618534'>
               <strong>2 Colonial Dr Newport Beach CA  92660</strong> &nbsp; &nbsp;   
            <td style='background:lightgreen;' align=right><input type='hidden' id='miles618534'><span style='margin-left:0px;' onclick="sub618534()"  class='button_input'> Process this order</span></span></td>
         <tr>
            <td>Your Input</td>
            <td  style='padding-left:28px;'>2 COLONIAL DR NEWPORT BEACH CA 92660</td>
            <td align=right><a href='customer_multi_jobs_review.php?del=1&djob=NjE4NTM0' style='color:blue;'><b><img title='Remove / Delete Order' src='../images/deletorder.png' style='width:30px;'></b></a></td>
         </tr>
      </table>
      <div style=' margin-left:40px;'>
      Exterior BPO - Light Photo Set (3 photos*)  &nbsp; &nbsp; &nbsp; <br>$9.00 &nbsp; &nbsp; &nbsp; We found a rep 4.6 miles from order.  &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; <span style='color:silver'> Resolution 640x480  &nbsp; &nbsp; &nbsp;  GPS REQUIRED:  Yes  <span style='margin-left:10px;'>Datestamped </span> </span><br clear=all>
      <div style=float:left;'>

Id from input checked type=checkbox name=jobs[] value='618534'> Address from after the text 'Your Input'

I tried many ways but I get only the id but I am not able to capture the address details. Please find my code below

for input_node in response.xpath('//input[@name="jobs[]"]'):
    id = input_node.xpath(./@value).extract_first()
    address = input_node.xpath('./following-sibling::table[1]//td[.="Your Input"]/following-sibling::td[1]/text()').extract_first()

Try the following. It should fetch you the required fields you are after.

from scrapy import Selector

htmldoc = """
<table width=900 cellspacing=0 border=0 cellpadding=5 style='border-top:1px solid silver;border-left:1px solid silver;border-right:1px solid silver;'><tr id='row618534' ><td style='border-bottom:1px solid silver;background:#ffffff;' padding-bottom :10px;><div id='r618534'><div style='color:red; font-weight:bold; '>Warning... Duplicate Found!</div> <table width=100% border=0 cellpadding=2 cellspacing=0 style='margin-top:15px;border:4px #70797a; border-radius: 5px;'><tr><td style='background:lightgreen; width:55px;' valign=top><img src='../images/checkwhite.png' style='width:30px;'></td><td style='background:lightgreen;'><input checked type=checkbox name=jobs[] value='618534'>  <strong>2 Colonial Dr Newport Beach CA  92660</strong> &nbsp; &nbsp;   <td style='background:lightgreen;' align=right><input type='hidden' id='miles618534'><span style='margin-left:0px;' onclick="sub618534()"  class='button_input'> Process this order</span></span></td><tr><td>Your Input</td><td  style='padding-left:28px;'>2 COLONIAL DR NEWPORT BEACH CA 92660</td><td align=right><a href='customer_multi_jobs_review.php?del=1&djob=NjE4NTM0' style='color:blue;'><b><img title='Remove / Delete Order' src='../images/deletorder.png' style='width:30px;'></b></a></td></tr></table><div style=' margin-left:40px;'> Exterior BPO - Light Photo Set (3 photos*)  &nbsp; &nbsp; &nbsp; <br>$9.00 &nbsp; &nbsp; &nbsp; We found a rep 4.6 miles from order.  &nbsp; &nbsp; &nbsp;  &nbsp; &nbsp; &nbsp; <span style='color:silver'> Resolution 640x480  &nbsp; &nbsp; &nbsp;  GPS REQUIRED:  Yes  <span style='margin-left:10px;'>Datestamped </span> </span><br clear=all><div style=float:left;'>
"""
sel = Selector(text=htmldoc)
for input_node in sel.xpath('//tr//input[@name="jobs[]"]'):
    id_num =  input_node.xpath('./@value').extract_first()
    address = input_node.xpath('.//following::td[contains(text(),"Your Input")]//following-sibling::td//text()').extract_first().strip()
    print(f'{id_num}\n{address}')

Output it produces:

618534
2 COLONIAL DR NEWPORT BEACH CA 92660

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM