[英]Pyspark: how to load online .gz log file into pyspark.sql.dataframe.DataFrame
So I have a .gz log file hosted online, like this 所以我有一个.gz日志文件在线托管,像这样
https://example.com/sample.log.gz
I can load this into a Python list using this: 我可以使用以下命令将其加载到Python列表中:
import urllib2
from StringIO import StringIO
import gzip
request = urllib2.Request('https://example.com/sample.log.gz')
response = urllib2.urlopen(request)
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.readlines() # Python list
I then tried to convert this list to DataFrame using 然后我尝试使用以下方法将此列表转换为DataFrame
sqlContext.createDataFrame(data)
but got 但是得到了
TypeError: Can not infer schema for type: <type 'str'>
What would be an effective way to load the .gz log file directly into pyspark.sql.dataframe.DataFrame
then? 什么是将.gz日志文件直接加载到pyspark.sql.dataframe.DataFrame
的有效方法呢?
Appreciate your help! 感谢您的帮助!
The problem comes from the form of your data
variable. 问题来自data
变量的形式。 It is ['qwr', 'asd', 'wer']
but needs to be [['qwr'], ['asd'], ['wer']]
. 它是['qwr', 'asd', 'wer']
但需要为[['qwr'], ['asd'], ['wer']]
。
To do so you can use data = [[x] for x in data]
为此,您可以使用data = [[x] for x in data]
Then sqlContext.createDataFrame(data)
然后sqlContext.createDataFrame(data)
Another solution could be to directly load your file as a textFile (however it requires to save the file somehow) then convert as presented above: 另一个解决方案可能是将文件直接加载为textFile(但是需要以某种方式保存文件),然后按上述方式进行转换:
f = tempfile.NamedTemporaryFile(delete=True)
shutils.copyfileobj(response, f)
rdd = sc.textFile(f.name)
//save transformation as previously
rdd_list = rdd.map(lambda x: [x])
df = sqlContext.createDataFrame(rdd_list)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.