简体   繁体   English

使用pandas.DataFrame将镶木地板文件写入CSV文件时如何防止表格格式?

[英]How to prevent Tabular format when writing a parquet file into CSV file using pandas.DataFrame?

I read a parquet file that is the output of spark mllib using pyarrow.parquet . 我读了一个镶木地板文件,它是使用pyarrow.parquetspark mllib的输出。 The output is consists of some rows and each row has two pairs: a word and a vector(each line is a word2vec pair). 输出由一些行组成,每行有两对:一个单词和一个向量(每一行是一个word2vec对)。 like the following: 如下:

  word1 "[-0.10812066 0.04352815 0.00529436 -0.0492562 -0.0974493533 0.275364409 -0.06501597 -0.3123745185 0.28186324 -0.05055101 0.06338456 -0.0842542 -0.10491376 -0.09692618 0.02451115 0.10766134]" word2 "[-0.10812066 0.04352815 0.1875908 -0.0492562 ... ... 

when I used DataFrame to write the results in a csv file, I got this: 当我使用DataFrame将结果写入csv文件时,我得到了这个:

  word1 "[-0.10812066 0.04352815 0.00529436 -0.0492562 -0.0974493533 0.275364409 -0.06501597 -0.3123745185 0.28186324 -0.05055101 0.06338456 -0.0842542 -0.10491376 -0.09692618 0.02451115 0.10766134]" word2 "[-0.10812066 0.04352815 0.1875908 -0.0492562 ... ... 

as you can see, each vector at the special position is separated into some lines. 如您所见,特殊位置的每个向量都被分成几行。 How can I get csv output as something I read from parquet file? 我怎样才能将csv输出作为我从镶木地板文件中读取的内容? my source code is here: 我的源代码在这里:

import pandas as pd
import pyarrow.parquet as pq

data = pq.read_pandas('C://Users//...//p.parquet', columns=['word', 'vector']).to_pandas()

df = pd.DataFrame(data)

pd.DataFrame.to_csv(df, 'C://Users/...//p.csv', sep=" ", encoding='utf-8', columns=['word', 'vector'], index=False, header=False)

The DataFrame size is: 47524 and DataFrame shape is: (23762, 2) DataFrame大小为: 47524 ,DataFrame形状为: (23762,2)

After a lot of searches, I didn't find a direct solution for my problem. 经过大量搜索后,我没有找到解决问题的直接解决方案。 but I solved my problem using lists in python. 但我使用python中的列表解决了我的问题。

data = pq.read_pandas('C://...//p.parquet', columns['word','vector']).to_pandas()
df = pd.DataFrame(data)

vector = df['vector'].tolist()
word = df['word'].tolist()

k = [[]]
for i in range(0, word.__len__()):
    l = []
    l.append(word[i])
    l.extend(vector[i])
    k.append(l)

with open('C://...//f.csv', "w", encoding="utf-8") as f:
    writer = csv.writer(f)
    for row in k:
        writer.writerow(row)

so, the output was shown in the same shape as expected. 因此,输出显示为与预期相同的形状。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM