如何根据满足特定条件的现有文件创建和重新排列新的 csv 文件？

Question

我有一个 CSV 文件，其中包含 4 列（user_Id、status、tweet_Id、tweet_text）和超过 50,000 行的推文。 第一列 user_id 有 4 个唯一的 ID，它们在整个列中重复。 第二列状态是二元分类，每条推文为 0 或 1。 第三列是推文 ID，第四列是推文的文本。 对于第一列。

输入文件已经根据两列排序，首先是 tweet_id，然后是 user_id。文件如下所示：

  Sr#,       user_id,     status,      tweet_id,                 tweet_text

   1,         3712,          1,         444567,       It is not easy to to do this you know...

   2,         3713,          0,         444567,       It is not easy to to do this you know...

   3,         3714,          1,         444567,       It is not easy to to do this you know...

   4,         3715,          1,         444567,       It is not easy to to do this you know...

   5,         3712,          1,         444572,       The process is yet to start

   6,         3713,          0,         444572,       The process is yet to start

   7,         3714,          0,         444572,       The process is yet to start

   8,         3712,          1,         444580,       I am betting on this

   9,         3714,          0,         444580,       I am betting on this

  10,         3715,          0,         444580,       I am betting on this

    and so on.......

如果我们观察前 4 行，user_id 值不同但 tweet_id 和 text 相同。 同样对于行号。 4、5和6，user_id不同但是tweet_id和text是一样的。

我必须编写一个新的 CSV 文件，其中对于每个 tweet_id 和文本，第一列（在此示例 4 中）的所有用户 ID 都创建为新列，并且对于每个用户 ID 列，该推文的分类值是status 列写在新的 id 列下。 如果 worker_id 没有状态值，则该 user_id 的状态值留空。

输出文件可能如下所示。

Sr#,         tweet_text,                        tweet_id,    3712,    3713,    3714,   3715

1,    It is not easy to to do this you know...,  444567,       1,       0,       1,     1

2,    The process is yet to start,               444572,       1,       0,       0,

3,    I am betting on this,                      444580,       1,                0,     0

我尝试了每当 tweet_id 更改时，tweet_id、tweet_text 和四个唯一 Id 的状态都会写入新文件的想法。 我使用它的代码如下：

 import csv
 import pandas as pd

 with open('combined_csvFinalSortedClean2.csv', 'w', newline='') as csvfile:
   filewriter = csv.writer(csvfile, delimiter=',', quotechar='|', quoting=csv.QUOTE_MINIMAL)
   filewriter.writerow(['tweet_id','tweet_text', '3712', '3713', '3714', '3714'])

 df = pd.read_csv('combined_csvFinalSortedClean2.csv', sep=',', header=None, index_col=False)

 with open("combined_csvFinalSorted2.csv", "r", encoding="utf-8") as csv_file:
   reader = csv.reader(csv_file, delimiter=',')
   header = next(reader) # get header
   curr_tweet=0
   curr_wid=0
   count=0

   for row in reader:
     wid=row[0]
     id=row[2]

     if (curr_tweet!=id) and (curr_wid!=wid):
      curr_tweet=id
      curr_wid=wid
      count=1
      df[0]=id
      df[1]=row[3]

     if wid==3712:
       df[2]=row[2]
     else: 
       df[2] = None

     if wid==3713:
       df[3]=row[2]
     else: 
       df[3]= None

     if wid==3714:
       df[4]=row[2]
     else: 
       df[4] = None

     if wid==3715:
       df[5]=row[2]
     else: 
       df[5] = None

     df.to_csv('output_file.csv', sep=',', encoding='utf-8', index=False)
     count+=1

     #else:
       #None
       #count+=1

我试过了，但问题是熊猫的 to_csv 模块只将最后一行写入新的输出文件，并且没有根据给定的 if...else 条件将任何内容写入四个唯一的 Id 列。 我会感谢一些帮助。

谢谢。

Answer 1

这是使用pivot_table的一种方法：

newdf = (pd
        .pivot_table(df, 
              index=['tweet_id','tweet_text'], 
              columns=['user_id'], 
              values='status', 
              fill_value=0)
        .reset_index()
        .rename({'user_id': 'sr'}))

print(newdf)

user_id  tweet_id                                       tweet_text  3712  \
0          444567         It is not easy to to do this you know...     1   
1          444572                      The process is yet to start     1   
2          444580                             I am betting on this     1   

user_id  3713  3714  3715  
0           0     1     1  
1           0     0     0  
2           0     0     0

如何根据满足特定条件的现有文件创建和重新排列新的 csv 文件？

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-01-28 13:16:00

如何根据满足特定条件的现有文件创建和重新排列新的 csv 文件？

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-01-28 13:16:00

解决方案1
0 已采纳 2020-01-28 13:16:00