本机python-用于循环的数据帧并将记录插入到db

Question

I am working on AWS Glue, and so I cannot use Pandas/Numpy, etc currently. 我正在使用AWS Glue，因此目前无法使用Pandas / Numpy等。

I have a dataframe of records, which I need to process and update to mysql database. 我有一个记录数据框，需要处理并更新到mysql数据库。 I need to check for record availability and already exists, do an insert...on duplicate key . 我需要检查记录是否可用并且已经存在， 对重复键执行insert ...操作 。 For this reason, I need to loop through the dataframe using native python libraries. 因此，我需要使用本机python库遍历数据框。 All dataframe iterators I found were using pandas, but is there a way to do without pandas? 我发现的所有数据框迭代器都使用熊猫，但是有没有熊猫的方法吗？

Please find herewith a sample dataframe: 请在此找到示例数据框：

df1 = sqlContext.createDataFrame([
    ('4001','81A01','Portland, ME','NY'),
    ('4002','44444','Portland, ME','NY'),
    ('4022','33333','BANGALORE','KA'),
    ('5222','88888','CHENNAI','TN')],
    ("zip_code_new", "territory_code_new", "territory_name_new", "state_new"))

I tried the following, but i got an error message, " AttributeError: 'DataFrame' object has no attribute 'values' " 我尝试了以下操作，但收到错误消息“ AttributeError：'DataFrame'对象没有属性'values' “

for i in df1.values():
    print i

UPDATE : The following code seem to work with native python to loop through the dataframe. 更新：以下代码似乎与本机python配合使用以遍历数据框。 Also, psidom's code also should work, but i could not see the print results. 另外，psidom的代码也应该工作，但是我看不到打印结果。

arr = df1.collect()
  for r in arr:
      print r.zip_code_new

Thanks 谢谢

Answer 1

You don't use for loop on spark data frame; 您无需在spark数据帧上使用for循环； It has foreach method to loop through rows; 它具有foreach方法来遍历行； for example, we can print the zip_code_new in each row as follows: 例如，我们可以按如下所示在每行中打印zip_code_new ：

def process_row(r):
    # your sql statement may go here
    print('zip_code_new: ', r.zip_code_new)

df1.foreach(process_row)

#('zip_code_new: ', u'4002')
#('zip_code_new: ', u'5222')
#('zip_code_new: ', u'4022')
#('zip_code_new: ', u'4001')

本机python-用于循环的数据帧并将记录插入到db

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-06-17 04:58:34

本机python-用于循环的数据帧并将记录插入到db

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-06-17 04:58:34

解决方案1
1 已采纳 2018-06-17 04:58:34