[英]tweepy elasticsearch - streaming sample tweets and hashtags
I am trying to set up tweepy to stream to Elasticsearch, however, I seem to be having problems with streaming sample tweets without using hashtag or location, I have tried the steam.sample() however that seems to giving me errors: 我正在尝试设置tweepy以流式传输到Elasticsearch,但是,似乎在不使用井号或位置的情况下流式传输示例tweet出现问题,但是我尝试过steam.sample(),但这似乎给了我错误:
{u'delete': {u'status': {u'user_id_str': u'1538141671', u'user_id': 1538141671, u'id': 972190631614406656, u'id_str': u'972190631614406656'}, u'timestamp_ms': u'1520623506593'}}
Traceback (most recent call last):
File "sentiment2.py", line 98, in <module>
stream.sample()
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 419, in sample
self._start(async)
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 361, in _start
self._run()
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 294, in _run
raise exception
KeyError: 'text'
or this error: 或此错误:
File "sentiment2.py", line 98, in <module>
stream.sample()
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 419, in sample
self._start(async)
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 361, in _start
self._run()
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 294, in _run
raise exception
IndexError: list index out of range
These errors don't necessarily happen straight away, I can see some tweets being printed to the console however none of them are actually indexed as the number of documents in elasticsearch index is not increasing. 这些错误并不一定会立即发生,我可以看到一些推文被打印到控制台,但是实际上没有索引这些推文,因为Elasticsearch索引中的文档数量没有增加。
Also, I seem to be having problem getting the hashtags from the JSON object, when I change to search through by filtered hashtags to test retrieving it, Im getting this error, I believe it is some sort of incompatible object types but not sure how to fix that? 另外,我似乎在从JSON对象获取主题标签时遇到问题,当我更改为通过过滤的主题标签进行搜索以测试对其进行检索时,我收到此错误,我认为这是某种不兼容的对象类型,但不确定如何解决这个问题?
File "sentiment2.py", line 99, in <module>
stream.filter(track=['#EUref', '#Brexit'])
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 445, in filter
self._start(async)
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 361, in _start
self._run()
File "/usr/local/lib/python2.7/dist-packages/tweepy/streaming.py", line 294, in _run
raise exception
elasticsearch.exceptions.RequestError: TransportError(400, u'mapper_parsing_exception', u'object mapping for [hashtags] tried to parse field [hashtags] as object, but found a concrete value')
My code: 我的代码:
import json
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from textblob import TextBlob
from elasticsearch import Elasticsearch
from datetime import datetime
# import twitter keys and tokens
from config import *
# create instance of elasticsearch
es = Elasticsearch()
indexName = "test_new_fields"
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
class TweetStreamListener(StreamListener):
# on success
def on_data(self, data):
# decode json
dict_data = json.loads(data) # data is a json string
print(dict_data)
# pass tweet into TextBlob
tweet = TextBlob(dict_data["text"])
# determine if sentiment is positive, negative, or neutral
if tweet.sentiment.polarity < 0:
sentiment = "negative"
elif tweet.sentiment.polarity == 0:
sentiment = "neutral"
else:
sentiment = "positive"
# output polarity sentiment and tweet text
print (str(tweet.sentiment.polarity) + " " + sentiment + " " + dict_data["text"])
coord = dict_data["coordinates"]
if coord is not None:
coord = dict_data["coordinates"]
lan = dict_data["coordinates"][0]
lat = dict_data["coordinates"][1]
else:
coord = "None"
es.indices.put_settings(index=indexName, body={"index.blocks.write":False})
# add text and sentiment info to elasticsearch
es.index(index=indexName,
doc_type="test-type",
body={"author": dict_data["user"]["screen_name"],
"date": dict_data["created_at"], # unfortunately this gets stored as a string
"location": dict_data["user"]["location"], # user location
"followers": dict_data["user"]["followers_count"],
"friends": dict_data["user"]["friends_count"],
"time_zone": dict_data["user"]["time_zone"],
"lang": dict_data["user"]["lang"],
#"timestamp": float(dict_data["timestamp_ms"]), # double not recognised as date
"timestamp": dict_data["timestamp_ms"],
"datetime": datetime.now(),
"message": dict_data["text"],
"hashtags": dict_data["entities"]["hashtags"][0]["text"],
#"retweetCount": dict_data["'retweet_count'"],
"polarity": tweet.sentiment.polarity,
"subjectivity": tweet.sentiment.subjectivity,
"sentiment": sentiment,
# handle geo data
"coordinates": coord
# if coord is not None:
# "coordinates": dict_data["coordinates"]
# "lan": dict_data["coordinates"][0]
# "lat": dict_data["coordinates"][1]
# else:
# "coordinates": "None"
})
return True
# on failure
def on_error(self, status):
print (status)
if __name__ == '__main__':
# create instance of the tweepy tweet stream listener
listener = TweetStreamListener()
# set twitter keys/tokens
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
# create instance of the tweepy stream
stream = Stream(auth, listener)
stream.sample()
# search twitter for these keywords
#stream.filter(track=['#EUref', '#Brexit'])
the mapping: 映射:
{
"test_new_fields" : {
"mappings" : {
"test-type" : {
"properties" : {
"author" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"coordinates" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"country" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"countrycode" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"date" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"datetime" : {
"type" : "date"
},
"followers" : {
"type" : "long"
},
"friends" : {
"type" : "long"
},
"geoEnabled" : {
"type" : "boolean"
},
"hashtags" : {
"properties" : {
"indices" : {
"type" : "long"
},
"text" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
},
"lang" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"location" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"message" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"polarity" : {
"type" : "float"
},
"sentiment" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"subjectivity" : {
"type" : "float"
},
"time_zone" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"timestamp" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
Your hashtags
field is an object field with a subfield called indices
- why build an object with only a field? 您的hashtags
字段是一个带有称为indices
的子字段的对象字段-为什么构建仅包含一个字段的对象? it has no sense 这没有意义
"hashtags" : {
"properties" : {
"indices" : {
"type" : "long"
}
If you don't want to change your index, you have to declare the subfield when indexing: 如果您不想更改索引,则必须在建立索引时声明子字段:
"hashtags": {"indices": int(dict_data["entities"]["hashtags"][0]["text"])},
#"retweetCount": dict_data["'retweet_count'"],
"polarity": tweet.sentiment.polarity,
But, if you can, I suggest to you to make your hashtags field not an object composed by a long field, but a long field directly 但是,如果可以的话,我建议您让主题标签字段不是由长字段组成的对象,而是直接由长字段组成
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.