[英]Scrapy: TypeError: string indices must be integers, not str?
I wrote a spider which scrape data from a news website: 我写了一个蜘蛛,从新闻网站抓取数据:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from items import CravlingItem
import re
class CountrySpider(CrawlSpider):
name = 'Post_and_Parcel_Human_Resource'
allowed_domains = ['postandparcel.info']
start_urls = ['http://postandparcel.info/category/news/human-resources/']
rules = (
Rule(LinkExtractor(allow='',
restrict_xpaths=(
'//*[@id="page"]/div[4]/div[1]/div[1]/div[1]/h1/a',
'//*[@id="page"]/div[4]/div[1]/div[1]/div[2]/h1/a',
'//*[@id="page"]/div[4]/div[1]/div[1]/div[3]/h1/a'
)),
callback='parse_item',
follow=False),
)
def parse_item(self, response):
i = CravlingItem()
i['title'] = " ".join(response.xpath('//div[@class="cd_left_big"]/div/h1/text()')
.extract()).strip() or " "
i['headline'] = self.clear_html(
" ".join(response.xpath('//div[@class="cd_left_big"]/div//div/div[1]/p')
.extract()).strip()) or " "
i['text'] = self.clear_html(
" ".join(response.xpath('//div[@class="cd_left_big"]/div//div/p').extract()).strip()) or " "
i['url'] = response.url
i['image'] = (" ".join(response.xpath('//*[@id="middle_column_container"]/div[2]/div/img/@src')
.extract()).strip()).replace('wp-content/', 'http://postandparcel.info/wp-content/') or " "
i['author'] = " "
# print("\n")
# print(i)
return i
@staticmethod
def clear_html(html):
text = re.sub(r'<(style).*?</\1>(?s)|<[^>]*?>|\n|\t|\r', '', html)
return text
and i also wrote a piece of code in the pipeline to refine the extracted texts: and here is the pipeline: 并且我还在管道中编写了一段代码以优化提取的文本:这是管道:
from scrapy.conf import settings
from scrapy import log
import pymongo
import json
import codecs
import re
class RefineDataPipeline(object):
def process_item(self, item, spider):
#In this section: the below edits will be applied to all scrapy crawlers.
item['text'] =str( item['text'].encode("utf-8"))
replacements ={"U.S.":" US ", " M ":"Million", "same as the title":"", " MMH Editorial ":"", " UPS ":"United Parcel Service", " UK ":" United Kingdom "," Penn ":" Pennsylvania ", " CIPS ":" Chartered Institute of Procurement and Supply ", " t ":" tonnes ", " Uti ":" UTI ", "EMEA":" Europe, Middle East and Africa ", " APEC ":" Asia-Pacific Economic Cooperation ", " m ":" million ", " Q4 ":" 4th quarter ", "LLC":"", "Ltd":"", "Inc":"", "Published text":" Original text "}
allparen= re.findall('\(.+?\)',item['text'])
for item in allparen:
if item[1].isupper() and item[2].isupper():
replacements[str(item)]=''
elif item[1].islower() or item[2].islower():
replacements[str(item)]=item[1:len(item)-1]
else:
try:
val = int(item[1:len(item)-1])
replacements[str(item)]= str(val)
except ValueError:
pass
def multireplace(s, replacements):
substrs = sorted(replacements, key=len, reverse=True)
regexp = re.compile('|'.join(map(re.escape, substrs)))
return regexp.sub(lambda match: replacements[match.group(0)],s)
item['text'] = multireplace(item['text'], replacements)
item['text'] = re.sub( '\s+', ' ', item['text'] ).strip()
return item
but there is a huge problem which prevent the spider from scraping the data successfully: 但是存在一个巨大的问题,导致蜘蛛无法成功抓取数据:
Traceback (most recent call last): File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 588, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/home/hathout/Desktop/updataed portcalls/thomas/thomas/pipelines.py", line 41, in process_item item['text'] = multireplace(item['text'], replacements) TypeError: string indices must be integers, not str 追溯(最近一次调用):文件“ /usr/lib/python2.7/dist-packages/twisted/internet/defer.py”,第588行,位于_runCallbacks current.result = callback(current.result,* args, ** kw)文件“ / home / hathout / Desktop / updataed portcalls / thomas / thomas / pipelines.py”,第41行,位于process_item item ['text'] = multireplace(item ['text'],替换项)TypeError字符串索引必须是整数,而不是str
I really do not know how to overcome the "TypeError: string indices must be integers, not str" error. 我真的不知道如何克服“ TypeError:字符串索引必须是整数,而不是str”错误。
Short answer: the variable item
is a string 简短答案:变量item
是字符串
Long answer: in this section 长答案:在本节中
allparen= re.findall('\(.+?\)',item['text'])
for item in allparen:
...
you are looping over allparen which should either be a list of strings or an empty list, and using the same variable name item
as the looping variable. 您正在遍历应为字符串列表或空列表的allparen,并使用与循环变量相同的变量名称item
。 So item is a string, not a dict/Item object. 所以item是一个字符串,而不是dict / Item对象。 use a different name for the looping variable, like: 为循环变量使用其他名称,例如:
for paren in allparen:
if paren[1].isupper() and paren[2].isupper():
...
basically your original item
variable is overwritten by your use of the same variable name in the loop. 基本上,您在循环中使用相同的变量名称将覆盖原始item
变量。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.