Pyspark：如何将在线.gz日志文件加载到pyspark.sql.dataframe.DataFrame中

Question

So I have a .gz log file hosted online, like this 所以我有一个.gz日志文件在线托管，像这样

https://example.com/sample.log.gz

I can load this into a Python list using this: 我可以使用以下命令将其加载到Python列表中：

import urllib2
from StringIO import StringIO
import gzip

request = urllib2.Request('https://example.com/sample.log.gz')
response = urllib2.urlopen(request)
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.readlines() # Python list

I then tried to convert this list to DataFrame using 然后我尝试使用以下方法将此列表转换为DataFrame

sqlContext.createDataFrame(data)

but got 但是得到了

TypeError: Can not infer schema for type: <type 'str'>

What would be an effective way to load the .gz log file directly into pyspark.sql.dataframe.DataFrame then? 什么是将.gz日志文件直接加载到pyspark.sql.dataframe.DataFrame的有效方法呢？

Appreciate your help! 感谢您的帮助！

Answer 1

The problem comes from the form of your data variable. 问题来自data变量的形式。 It is ['qwr', 'asd', 'wer'] but needs to be [['qwr'], ['asd'], ['wer']] . 它是['qwr', 'asd', 'wer']但需要为[['qwr'], ['asd'], ['wer']] 。

To do so you can use data = [[x] for x in data] 为此，您可以使用data = [[x] for x in data]

Then sqlContext.createDataFrame(data) 然后sqlContext.createDataFrame(data)

Another solution could be to directly load your file as a textFile (however it requires to save the file somehow) then convert as presented above: 另一个解决方案可能是将文件直接加载为textFile（但是需要以某种方式保存文件），然后按上述方式进行转换：

f = tempfile.NamedTemporaryFile(delete=True)
shutils.copyfileobj(response, f)
rdd = sc.textFile(f.name)
//save transformation as previously
rdd_list = rdd.map(lambda x: [x])
df = sqlContext.createDataFrame(rdd_list)

Pyspark：如何将在线.gz日志文件加载到pyspark.sql.dataframe.DataFrame中

问题描述

1 个解决方案

解决方案1
1 2017-01-30 17:15:55

Pyspark：如何将在线.gz日志文件加载到pyspark.sql.dataframe.DataFrame中

问题描述

1 个解决方案

解决方案1 1 2017-01-30 17:15:55

解决方案1
1 2017-01-30 17:15:55