简体   繁体   中英

How to extract webpage source code using Scrapy's xpath?

I have written the following code

from scrapy import Selector
html = '''
<html><head></head><body><table>

<tr> <td>a1</td> <td>b1</td> </tr>
<tr> <td>a2</td> <td>b2</td> </tr>

</table></body></html>
'''

selector = Selector(text=html)
temp = selector.xpath("//td").extract()
print(temp)

and hope to get the following result

[
'<td>a1</td>',
'<td>b1</td>',
'<td>a2</td>',
'<td>b2</td>'
]

But I actually got this

[
'<td>a1</td> <td>b1</td> </tr>\n<tr> <td>a2</td> <td>b2</td> </tr>\n</table>\n</body>\n</html>\n', 
'<td>b1</td> </tr>\n<tr> <td>a2</td> <td>b2</td> </tr>\n</table>\n</body>\n</html>\n', 
'<td>a2</td> <td>b2</td> </tr>\n</table>\n</body>\n</html>\n', 
'<td>b2</td> </tr>\n</table>\n</body>\n</html>\n'
]


but with '/text()' in xpath

temp = selector.xpath("//td/text()").extract()

It turned out to be alright

['a1', 'b1', 'a2', 'b2']

It might just be a simple question, I just didn't find the key.

I tried 'extract', 'extract_frist', 'get', 'getall' all have the same problem.

I don't know what's wrong, please help me

在我卸载我的 Anaconda,然后安装一个纯 python 后,我解决了这个问题......这很奇怪。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM