This a source code from a website: http://www.example.com and I want to extract with scrapy crawler all THIS IS A TEXT.
<tr>
<td>
<table>
<tr>
<td colspan="5" style="text-align:left;padding-left:4px;" class="category"> <imgsrc="http://www.example.com/images/menu.gif">
THIS IS A TEXT </td>
</tr>
<tr>
<td class="date" colspan="5">THIS IS A TEXT</td>
</tr>
<tr>
<td style="test-align:left;width:40px;">THIS IS A TEXT</td>
<td style="padding-right:4px; width:180px;text-align:right">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"> <nobr><a id="I1" name="I1"
href="javascript:MoreInformation(1,'1141','1563513','TT','home');">
THIS IS A TEXT</a></nobr>
</td>
<td style="padding-left:5px; width:180px;text-align:left">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"></td>
</tr>
<tr>
<td style="test-align:left;width:40px;">THIS IS A TEXT </td>
<td style="padding-right:4px; width:180px;text-align:right">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"> THIS IS A TEXT </td>
<td style="padding-left:5px; width:180px;text-align:left">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"></td>
</tr>*
</table>
</td>
</tr>
This is my scrapy_project.py: I tried to extract everything from td:rows = hxs.select('.//td') , I don't know how to extract separate "This is a text". I receive this mistake: u'\\n\\t\\t\\t\\t\\t\\t\\t\\t. Someone can help me please?
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dirbot.items import Website
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["example.com"]
start_urls = [
"http://www.example.com/",
"",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows = hxs.select('//table[@id="content"]//table/tr')
items = []
for row in rows:
item = Website()
item ["job"] = row.select("td[1]/text()").extract()
item ["description"] = row.select("td[0]/a/nobr/text()").extract()
item ["name"] = row.select("td[2]/text()").extract()
items.append(item)
return items
Another question: how can eliminate this: u'\\n\\t\\t\\t\\t\\t\\t\\t\\t
For removing \\n\\t\\t\\t\\t\\t\\t\\t\\t you can use regex. like in your code instead of .extract() yo can use .re() like:
row.select("td[0]/a/nobr/text()").re('[^\t\n]+')
it will remove your \\n\\t . Hope this helps :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.