简体   繁体   English

需要通过读取带有随机列的csv文件来创建Pandas数据框

[英]Need to create a Pandas dataframe by reading csv file with random columns

I have the following csv file with records: 我有以下带有记录的csv文件:

  • A 1, B 2, C 10, D 15 A 1,B 2,C 10,D 15
  • A 5, D 10, G 2 A 5,D 10,G 2
  • D 6, E 7 D 6,E 7
  • H 7, G 8 高7、8

My column headers/names are: A, B, C, D, E, F, G 我的列标题/名称是:A,B,C,D,E,F,G

So my initial dataframe after using "read_csv" becomes: 因此,使用“ read_csv”后,我的初始数据帧变为:

A     B     C      D       E      F      G   
A 1   B 2   C 10   D 15   NaN    NaN    NaN
A 5   D 10  G 2    NaN    NaN    NaN    NaN
D 6   E 7   NaN    NaN    NaN    NaN    NaN
H 7   G 8   NaN    NaN    NaN    NaN    Nan

The value can be separate into [column name][column value], so A 1 means col=A and value=1, and D 15 means col=D and value=15, etc... 该值可以分为[列名] [列值],因此A 1表示col = A且value = 1,D 15表示col = D且value = 15,依此类推...

What I want is to assign the numeric value to the appropriate column based on the and have a dataframe that looks like this: 我想要的是基于将数字值分配给适当的列,并具有如下所示的数据框:

A     B     C      D       E      F      G   
A 1   B 2   C 10   D 15   NaN    NaN    NaN
A 5   Nan   NaN    D 10   NaN    NaN    G 2
NaN   NaN   NaN    D 6    E 7    NaN    NaN
NaN   NaN   NaN    NaN    NaN    NaN    G 8

And even better, just the values alone: 甚至更好的是,仅凭价值观:

A     B     C      D       E      F      G   
1     2     10     15      NaN    NaN    NaN
5     Nan   NaN    10      NaN    NaN    2
NaN   NaN   NaN    6       7      NaN    NaN
NaN   NaN   NaN    NaN     NaN    NaN    8

You can loop through rows with apply function( axis = 1 ) and construct a pandas series for each row based on the key value pairs after the splitting, and the newly constructed series will be automatically aligned by their index, just notice here there is no F column but an extra H , not sure if it is what you need. 您可以使用apply函数( axis = 1 )遍历各行,并在拆分后根据键值对为每行构造一个熊猫系列,新构建的系列将通过其索引自动对齐,只是请注意此处没有F列,但需要额外的H ,不确定是否是您所需要的。 But removing the H and adding an extra NaN F column should be straight forward: 但是,删除H并添加额外的NaN F列应该很简单:

df.apply(lambda r: pd.Series({x[0]: x[1] for x in r.str.split(' ') 
                                    if isinstance(x, list) and len(x) == 2}), axis = 1)


#     A   B   C   D   E   G   H
#0    1   2  10  15 NaN NaN NaN
#1    5 NaN NaN  10 NaN   2 NaN
#2  NaN NaN NaN   6   7 NaN NaN
#3  NaN NaN NaN NaN NaN   8   7

Apply solution: 应用解决方案:

Use split by whitespace, remove NaN rows by dropna , set_index and convert one column DataFrame to Series by DataFrame.squeeze . 使用split由空格,删除NaN按行dropnaset_index和一个列转换DataFrame ,以SeriesDataFrame.squeeze Last reindex by new column names: 上次按新列名称reindex

print (df.apply(lambda x: x.str.split(expand=True)
                               .dropna()
                               .set_index(0)
                               .squeeze(), axis=1)
         .reindex(columns=list('ABCDEFGH')))

     A    B    C    D    E   F    G    H
0    1    2   10   15  NaN NaN  NaN  NaN
1    5  NaN  NaN   10  NaN NaN    2  NaN
2  NaN  NaN  NaN    6    7 NaN  NaN  NaN
3  NaN  NaN  NaN  NaN  NaN NaN    8    7

Stack solution: 堆栈解决方案:

Use stack for creating Series , split by whitespace and create new columns, append column with new column names ( A , B ...) to index by set_index , convert one column DataFrame to Series by DataFrame.squeeze , remove index values with old column names by reset_index , unstack , reindex by new column names (it add missing columns filled by NaN ),convert values to float by astype and last remove column name by rename_axis (new in pandas 0.18.0 ): 使用stack创建Seriessplit的空白,创造新列,新列名(附加列AB ...),以indexset_index ,一个转换DataFrame ,以SeriesDataFrame.squeeze ,删除与旧列索引值由名reset_indexunstackreindex的新的列名(将其添加缺少的填充柱NaN ),将值转换为float通过astype和最后删除列名由rename_axis (在新pandas 0.18.0 ):

print (df.stack()
         .str.split(expand=True)
         .set_index(0, append=True)
         .squeeze()
         .reset_index(level=1, drop=True)
         .unstack()
         .reindex(columns=list('ABCDEFGH'))
         .astype(float)
         .rename_axis(None, axis=1))

     A    B     C     D    E   F    G    H
0  1.0  2.0  10.0  15.0  NaN NaN  NaN  NaN
1  5.0  NaN   NaN  10.0  NaN NaN  2.0  NaN
2  NaN  NaN   NaN   6.0  7.0 NaN  NaN  NaN
3  NaN  NaN   NaN   NaN  NaN NaN  8.0  7.0

Here is the code: 这是代码:

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

def classifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals

df.apply(classifier, axis=1)

Input: 输入:

from io import StringIO
import pandas as pd
import numpy as np

data = """A 1, B 2, C 10, D 15
A 5, D 10, G 2
D 6, E 7
H 7, G 8"""

df = pd.read_csv(StringIO(data), header=None)
print("df:\n", df)

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

def classifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)

print("\nres:\n", res)

Output: 输出:

df:
    0    1     2     3
0   A 1  B 2   C 10  D 15
1   A 5  D 10  G 2   NaN
2   D 6  E 7   NaN   NaN
3   H 7  G 8   NaN   NaN

res:
    A   B   C   D   E   F   G   H
0   1   2   10  15  NaN NaN NaN NaN
1   5   NaN NaN 10  NaN NaN 2   NaN
2   NaN NaN NaN 6   7   NaN NaN NaN
3   NaN NaN NaN NaN NaN NaN 8   7

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 csv 文件中读取随机列:Django - Reading random columns from a csv file: Django 将带有元素列表的csv文件读入pandas数据帧 - Reading a csv file with a list of elements into pandas dataframe 在Pandas DataFrame中读取具有可变列的文本文件 - Reading text file with variable columns in pandas dataframe EmptyDataError:从 S3 存储桶读取多个 csv 文件到 Pandas Dataframe 时,没有要从文件解析的列 - EmptyDataError: No columns to parse from file when reading multiple csv files from S3 bucket to pandas Dataframe Pandas:将DataFrame写入csv时的列的随机顺序 - Pandas: Random order of columns when writing DataFrame to csv 在读取csv或tsv文件之前,在Pandas DataFrame列上应用条件 - Applying Conditions on Pandas DataFrame Columns before reading csv or tsv files 从csv堆栈文件创建pandas DataFrame - Create a pandas DataFrame from a csv stacked file 如何在 pandas Z6A8064B5DF47945155500553C47C55057DZ 中的 pandas Z6A8064B5DF4794515500553C47C55057DZ 文件中创建 4 个标题,使其成为 header 行的 16 列? - How to create 4 headers such that it becomes header rows for 16 columns in pandas dataframe in CSV file? 读取大型csv文件,python,pandas的随机行 - Reading random rows of a large csv file, python, pandas 如何创建一个用随机字符串填充列的熊猫数据框? - How to create a pandas dataframe where columns are filled with random strings?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM