简体   繁体   English

Pyspark:如何将在线.gz日志文件加载到pyspark.sql.dataframe.DataFrame中

[英]Pyspark: how to load online .gz log file into pyspark.sql.dataframe.DataFrame

So I have a .gz log file hosted online, like this 所以我有一个.gz日志文件在线托管,像这样

https://example.com/sample.log.gz

I can load this into a Python list using this: 我可以使用以下命令将其加载到Python列表中:

import urllib2
from StringIO import StringIO
import gzip

request = urllib2.Request('https://example.com/sample.log.gz')
response = urllib2.urlopen(request)
buf = StringIO(response.read())
f = gzip.GzipFile(fileobj=buf)
data = f.readlines() # Python list

I then tried to convert this list to DataFrame using 然后我尝试使用以下方法将此列表转换为DataFrame

sqlContext.createDataFrame(data)

but got 但是得到了

TypeError: Can not infer schema for type: <type 'str'>

What would be an effective way to load the .gz log file directly into pyspark.sql.dataframe.DataFrame then? 什么是将.gz日志文件直接加载到pyspark.sql.dataframe.DataFrame的有效方法呢?

Appreciate your help! 感谢您的帮助!

The problem comes from the form of your data variable. 问题来自data变量的形式。 It is ['qwr', 'asd', 'wer'] but needs to be [['qwr'], ['asd'], ['wer']] . 它是['qwr', 'asd', 'wer']但需要为[['qwr'], ['asd'], ['wer']]

To do so you can use data = [[x] for x in data] 为此,您可以使用data = [[x] for x in data]

Then sqlContext.createDataFrame(data) 然后sqlContext.createDataFrame(data)


Another solution could be to directly load your file as a textFile (however it requires to save the file somehow) then convert as presented above: 另一个解决方案可能是将文件直接加载为textFile(但是需要以某种方式保存文件),然后按上述方式进行转换:

f = tempfile.NamedTemporaryFile(delete=True)
shutils.copyfileobj(response, f)
rdd = sc.textFile(f.name)
//save transformation as previously
rdd_list = rdd.map(lambda x: [x])
df = sqlContext.createDataFrame(rdd_list)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Pyspark:依靠 pyspark.sql.dataframe.DataFrame 需要很长时间 - Pyspark: count on pyspark.sql.dataframe.DataFrame takes long time Pyspark:如何从 pyspark.sql.dataframe.DataFrame 中选择唯一的 ID 数据? - Pyspark: how to select unique ID data from a pyspark.sql.dataframe.DataFrame? 将pyspark.sql.dataframe.DataFrame类型转换为Dictionary - Convert pyspark.sql.dataframe.DataFrame type Dataframe to Dictionary 写一个pyspark.sql.dataframe.DataFrame不丢失信息 - Write a pyspark.sql.dataframe.DataFrame without losing information 如何将pyspark.sql.dataframe.DataFrame转换回databricks笔记本中的sql表 - How can I convert a pyspark.sql.dataframe.DataFrame back to a sql table in databricks notebook 如何在不使用 pandas on spark API 的情况下为 pyspark.sql.dataframe.DataFrame 编写这个 pandas 逻辑? - How to write this pandas logic for pyspark.sql.dataframe.DataFrame without using pandas on spark API? difference between pyspark.pandas.frame.DataFrame and pyspark.sql.dataframe.DataFrame and their conversion - difference between pyspark.pandas.frame.DataFrame and pyspark.sql.dataframe.DataFrame and their conversion 尝试在 Databricks 环境中合并或连接两个 pyspark.sql.dataframe.DataFrame - Trying to Merge or Concat two pyspark.sql.dataframe.DataFrame in Databricks Environment 如何在 pyspark DataFrame 中加载 csv 文件 - How to load a csv file in a pyspark DataFrame 如何将 PySpark 中的 dataframe 加载到雪花 - How to load a dataframe in PySpark to Snowflake
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM