简体   繁体   English

在python中解析CSV的特定列

[英]Parsing specific columns of CSV in python

so I have this CSV and I would like to do the following:所以我有这个 CSV,我想执行以下操作:

Original data:原始数据:

在此处输入图片说明

Parsed Data:解析数据:

在此处输入图片说明

So, to put in words, if a column has commas then I want to create a new column with only one value and delete the column which has multiple values.因此,换句话说,如果一列有逗号,那么我想创建一个只有一个值的新列并删除具有多个值的列。

For example: N2 has I1, I3 and I4.例如:N2 有 I1、I3 和 I4。 Hence the new data gets 3 columns, containing one value only.因此,新数据有 3 列,仅包含一个值。

I want to make it dynamic in such a way that all the permutations are reflected.我想让它动态地反映所有排列。 Like in the case of N3 that has 2 places and 2 items.就像 N3 有 2 个位置和 2 个项目的情况一样。

I am trying to use python's pandas to do this.我正在尝试使用 python 的熊猫来做到这一点。 Some help would be appreciated.一些帮助将不胜感激。

Here is another option: 这是另一个选择:

df['Place'] = df['Place'].str.split(',')
df['Item'] = df['Item'].str.split(',')

exploded = pd.DataFrame([
    a + [p, t] for *a, P, T in df.values
    for p in P for t in T
], columns=df.columns)

And the output: 并输出:

  Name Place Item
0   N1    P1   I1
1   N2    P2   I1
2   N2    P2   I3
3   N2    P2   I4
4   N3    P2   I2
5   N3    P2   I5
6   N3    P3   I2
7   N3    P3   I5

Here is a solution 这是一个解决方案

split_place = df['Place'].str.split(',', expand=True)\
    .stack().str.strip().reset_index(level=1, drop=True)
split_item = df['Item'].str.split(',', expand=True)\
    .stack().str.strip().reset_index(level=1, drop=True)

df_temp = df[['Name']].merge(
    split_place.rename('split_place'), 
    left_index=True, 
    right_index=True, 
    how='outer'
)

exploded_df = df_temp.merge(
    split_item.rename('split_item'), 
    left_index=True, right_index=True, 
    how='outer'
).reset_index(drop=True)\
.rename(columns={'new_x': 'Place', 'new_y': 'Item'})

PS: You need pandas v0.24.0, otherwise the merge won't work here. PS:您需要Pandas v0.24.0,否则合并将无法在此处进行。 在此处输入图片说明

You are effectively attempting to take the Cartesian product of each row, then binding the result back into a DataFrame . 您实际上是在尝试采用每一行的笛卡尔积,然后将结果绑定回DataFrame As such, you could use itertools and do something like 这样,您可以使用itertools并执行类似的操作

from itertools import chain, product
df_lists = df.applymap(lambda s: s.split(','))
pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)

With your example input: 用您的示例输入:

In [334]: df
Out[334]:
  Name  Place      Item
0   N1     P1        I1
1   N2     P2  I1,I3,I4
2   N3  P2,P3     I2,I5

In [336]: df_lists = df.applymap(lambda s: s.split(','))

In [337]: pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)
Out[337]:
  Name Place Item
0   N1    P1   I1
1   N2    P2   I1
2   N2    P2   I3
3   N2    P2   I4
4   N3    P2   I2
5   N3    P2   I5
6   N3    P3   I2
7   N3    P3   I5

You can use iterrows() : 您可以使用iterrows()

df = pd.DataFrame({'Name': ['N1', 'N2', 'N3'], 'Place':['P1', 'P2','P2,P3'], 'Item':['I1,', 'I1,I3,I4', 'I2,I5']})

result = pd.DataFrame()
new_result = pd.DataFrame()

df['Place'] = df['Place'].apply(lambda x: x.strip(','))
df['Item'] = df['Item'].apply(lambda x: x.strip(','))

for a,b  in df.iterrows():
    curr_row = df.iloc[a]
    temp  = ((curr_row['Place'].split(',')))
    for x in temp:
        curr_row['Place'] = x
        result = result.append(curr_row, ignore_index=True)

for a,b  in result.iterrows():
    curr_row = result.iloc[a]
    temp  = ((curr_row['Item'].split(',')))
    for x in temp:
        curr_row['Item'] = x
        new_result = new_result.append(curr_row, ignore_index=True)

Output: 输出:

  Name Place Item
0   N1    P1   I1
1   N2    P2   I1
2   N2    P2   I3
3   N2    P2   I4
4   N3    P2   I2
5   N3    P2   I5
6   N3    P3   I2
7   N3    P3   I5

This is the simplest way you can achieve your desired output. 这是获得所需输出的最简单方法。

You can avoid the use of pandas. 您可以避免使用熊猫。 If you want to stick with the standard csv module, you simply have to split each field on comma ( ',' ) and then iterate on the splitted elements. 如果要坚持使用标准的csv模块,则只需在逗号( ',' )上拆分每个字段,然后对拆分后的元素进行迭代。

Code could be assuming the input delimiter is a semicolon ( ; ) ( I cannot know what it is except it cannot be a comma): 代码可能假设输入定界符是分号( ;不能知道它是什么,除了它不能是逗号):

with open('input.csv', newline='') as fd, open('output.csv', 'w', newline='') as fdout:
    rd = csv.DictReader(fd, delimiter=';')
    wr = csv.writer(fdout)
    _ = wr.writerow(rd.fieldnames)
    for row in rd:
       for i in row['Item'].split(','):
           i = i.strip()
           if len(i) != 0:
               for p in row['Place'].split(','):
                   p = p.strip()
                   if len(p) != 0:
                       for n in row['Name'].split(','):
                           n = n.strip()
                           if len(n) != 0:
                               wr.writerow((n,p,i))

Output is: 输出为:

Name,Place,Item
N1,P1,I1
N2,P2,I1
N2,P2,I3
N2,P2,I4
N3,P2,I2
N3,P3,I2
N3,P2,I5
N3,P3,I5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM