在python中解析CSV的特定列

Question

so I have this CSV and I would like to do the following:所以我有这个 CSV，我想执行以下操作：

Original data:原始数据：

Parsed Data:解析数据：

So, to put in words, if a column has commas then I want to create a new column with only one value and delete the column which has multiple values.因此，换句话说，如果一列有逗号，那么我想创建一个只有一个值的新列并删除具有多个值的列。

For example: N2 has I1, I3 and I4.例如：N2 有 I1、I3 和 I4。 Hence the new data gets 3 columns, containing one value only.因此，新数据有 3 列，仅包含一个值。

I want to make it dynamic in such a way that all the permutations are reflected.我想让它动态地反映所有排列。 Like in the case of N3 that has 2 places and 2 items.就像 N3 有 2 个位置和 2 个项目的情况一样。

I am trying to use python's pandas to do this.我正在尝试使用 python 的熊猫来做到这一点。 Some help would be appreciated.一些帮助将不胜感激。

Answer 1

Here is another option: 这是另一个选择：

df['Place'] = df['Place'].str.split(',')
df['Item'] = df['Item'].str.split(',')

exploded = pd.DataFrame([
    a + [p, t] for *a, P, T in df.values
    for p in P for t in T
], columns=df.columns)

And the output: 并输出：

  Name Place Item
0   N1    P1   I1
1   N2    P2   I1
2   N2    P2   I3
3   N2    P2   I4
4   N3    P2   I2
5   N3    P2   I5
6   N3    P3   I2
7   N3    P3   I5

Answer 2

Here is a solution 这是一个解决方案

split_place = df['Place'].str.split(',', expand=True)\
    .stack().str.strip().reset_index(level=1, drop=True)
split_item = df['Item'].str.split(',', expand=True)\
    .stack().str.strip().reset_index(level=1, drop=True)

df_temp = df[['Name']].merge(
    split_place.rename('split_place'), 
    left_index=True, 
    right_index=True, 
    how='outer'
)

exploded_df = df_temp.merge(
    split_item.rename('split_item'), 
    left_index=True, right_index=True, 
    how='outer'
).reset_index(drop=True)\
.rename(columns={'new_x': 'Place', 'new_y': 'Item'})

PS: You need pandas v0.24.0, otherwise the merge won't work here. PS：您需要Pandas v0.24.0，否则合并将无法在此处进行。

Answer 3

You are effectively attempting to take the Cartesian product of each row, then binding the result back into a DataFrame . 您实际上是在尝试采用每一行的笛卡尔积，然后将结果绑定回DataFrame 。 As such, you could use itertools and do something like 这样，您可以使用itertools并执行类似的操作

from itertools import chain, product
df_lists = df.applymap(lambda s: s.split(','))
pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)

With your example input: 用您的示例输入：

In [334]: df
Out[334]:
  Name  Place      Item
0   N1     P1        I1
1   N2     P2  I1,I3,I4
2   N3  P2,P3     I2,I5

In [336]: df_lists = df.applymap(lambda s: s.split(','))

In [337]: pd.DataFrame(chain.from_iterable(df_lists.apply(lambda row: product(*row), axis=1)), columns=df.columns)
Out[337]:
  Name Place Item
0   N1    P1   I1
1   N2    P2   I1
2   N2    P2   I3
3   N2    P2   I4
4   N3    P2   I2
5   N3    P2   I5
6   N3    P3   I2
7   N3    P3   I5

Answer 4

You can use iterrows() : 您可以使用iterrows() ：

df = pd.DataFrame({'Name': ['N1', 'N2', 'N3'], 'Place':['P1', 'P2','P2,P3'], 'Item':['I1,', 'I1,I3,I4', 'I2,I5']})

result = pd.DataFrame()
new_result = pd.DataFrame()

df['Place'] = df['Place'].apply(lambda x: x.strip(','))
df['Item'] = df['Item'].apply(lambda x: x.strip(','))

for a,b  in df.iterrows():
    curr_row = df.iloc[a]
    temp  = ((curr_row['Place'].split(',')))
    for x in temp:
        curr_row['Place'] = x
        result = result.append(curr_row, ignore_index=True)

for a,b  in result.iterrows():
    curr_row = result.iloc[a]
    temp  = ((curr_row['Item'].split(',')))
    for x in temp:
        curr_row['Item'] = x
        new_result = new_result.append(curr_row, ignore_index=True)

Output: 输出：

  Name Place Item
0   N1    P1   I1
1   N2    P2   I1
2   N2    P2   I3
3   N2    P2   I4
4   N3    P2   I2
5   N3    P2   I5
6   N3    P3   I2
7   N3    P3   I5

This is the simplest way you can achieve your desired output. 这是获得所需输出的最简单方法。

Answer 5

You can avoid the use of pandas. 您可以避免使用熊猫。 If you want to stick with the standard csv module, you simply have to split each field on comma ( ',' ) and then iterate on the splitted elements. 如果要坚持使用标准的csv模块，则只需在逗号（ ',' ）上拆分每个字段，然后对拆分后的元素进行迭代。

Code could be assuming the input delimiter is a semicolon ( ; ) ( I cannot know what it is except it cannot be a comma): 代码可能假设输入定界符是分号（ ; ） （我不能知道它是什么，除了它不能是逗号）：

with open('input.csv', newline='') as fd, open('output.csv', 'w', newline='') as fdout:
    rd = csv.DictReader(fd, delimiter=';')
    wr = csv.writer(fdout)
    _ = wr.writerow(rd.fieldnames)
    for row in rd:
       for i in row['Item'].split(','):
           i = i.strip()
           if len(i) != 0:
               for p in row['Place'].split(','):
                   p = p.strip()
                   if len(p) != 0:
                       for n in row['Name'].split(','):
                           n = n.strip()
                           if len(n) != 0:
                               wr.writerow((n,p,i))

Output is: 输出为：

Name,Place,Item
N1,P1,I1
N2,P2,I1
N2,P2,I3
N2,P2,I4
N3,P2,I2
N3,P3,I2
N3,P2,I5
N3,P3,I5

在python中解析CSV的特定列

问题描述

4 个解决方案

解决方案1
1 已采纳 2019-02-24 18:11:36

解决方案2
0 2019-02-24 17:52:28

解决方案3
0 2019-02-24 17:54:05

解决方案4
0 2019-02-24 18:04:57

解决方案5
0 2019-02-24 21:38:44

在python中解析CSV的特定列

问题描述

4 个解决方案

解决方案1 1 已采纳 2019-02-24 18:11:36

解决方案2 0 2019-02-24 17:52:28

解决方案3 0 2019-02-24 17:54:05

解决方案4 0 2019-02-24 18:04:57

解决方案5 0 2019-02-24 21:38:44

解决方案1
1 已采纳 2019-02-24 18:11:36

解决方案2
0 2019-02-24 17:52:28

解决方案3
0 2019-02-24 17:54:05

解决方案4
0 2019-02-24 18:04:57

解决方案5
0 2019-02-24 21:38:44