在pyspark上导入python库

Question

Pretty new to Python. Python的新手。

I would like to read in some XML files from S3 and query them. 我想从S3中读取一些XML文件并对其进行查询。 I am connected to AWS and have spun up some EC2 clusters but I am not sure how to import the libraries I need to get the data. 我已连接到AWS，并启动了一些EC2群集，但是我不确定如何导入获取数据所需的库。

I think using the xmlutils library to convert from xml to json and then using the read.json in the sqlcontext library which i do have access to will work (see below) 我认为使用xmlutils库将xml转换为json，然后使用我确实有权访问的sqlcontext库中的read.json将起作用（请参见下文）

 converter = xml2json("S3 logs", "output.sql", encoding="utf-8")
 converter.convert()

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

logs = sqlContext.read.json("output.sql")
logs.registerAsTable("logs")

query_results = sqlContext.sql("SELECT * from logs...")

EDIT 编辑

I am trying to use this block of code to get xmlutils installed in my virtual environment on Spark from the cloudera link. 我正在尝试使用此代码块从cloudera链接在我的虚拟环境中的Spark上安装xmlutils。 (already set SparkConf and SparkContext) （已经设置了SparkConf和SparkContext）

def import_my_special_package(x):
    import my.special.package
    return x

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x))
int_rdd.collect()

I tried passing both xmlutils and 'xmlutils' in the function argument as x but it didn't work. 我尝试在函数参数中将xmlutils和'xmlutils'都作为x传递，但没有用。 Am I doing something wrong? 难道我做错了什么？ Thanks 谢谢

Answer 1

pip and virtualenv are installed by default for Python 2.7 on 2015.03 AMIs - https://aws.amazon.com/amazon-linux-ami/2015.03-release-notes/ . 在2015.03 AMI上默认为Python 2.7安装了pip和virtualenv- https: //aws.amazon.com/amazon-linux-ami/2015.03-release-notes/。

The above site shows how to access pip on the new AMI images. 上面的站点显示了如何访问新AMI图像上的pip。

在pyspark上导入python库

问题描述

1 个解决方案

解决方案1
0 2015-11-18 18:22:29

在pyspark上导入python库

问题描述

1 个解决方案

解决方案1 0 2015-11-18 18:22:29

解决方案1
0 2015-11-18 18:22:29