简体   繁体   English

修改从 csv 文件导入的 pandas 数据框

[英]Modify pandas dataframe imported from csv file

i've a big database imported from a csv file (using pd.read_csv), here's how it look in csv file:我有一个从 csv 文件导入的大数据库(使用 pd.read_csv),这是它在 csv 文件中的样子:

      0       1       2
0   Milan   Draw    Juventus
1    2.47    3.24    3.03
2    2.45    3.23    3.06
0   Napoli  Draw    Parma
1    1.45    4.41    7.38
2    1.45    4.40    7.36
3    1.46    4.39    7.33
4    1.47    4.33    7.14
5    1.47    4.33    7.13
6    1.47    4.34    7.10
7    1.43    4.54    7.70
0   Fiorentina  Draw    Pisa
1    2.86    3.50    2.45
2    2.92    3.51    2.40
3    3.14    3.55    2.25
4    2.79    3.45    2.61

I need the dataframe to look like this:我需要数据框看起来像这样:

      0         1         2       3       4
0   Milan   Juventus     2.47    3.24    3.03
1   Milan   Juventus     2.45    3.23    3.06
2   Napoli  Parma       1.45     4.41    7.38
3   Napoli  Parma       1.45     4.40    7.36
4   Napoli  Parma       1.46     4.39    7.33
5   Napoli  Parma       1.47     4.33    7.14
6   Napoli  Parma       1.47     4.33    7.13
7   Napoli  Parma       1.47     4.34    7.10
8   Napoli  Parma       1.43     4.54    7.70
9   Fiorentina  Pisa     2.86    3.50    2.45
10  Fiorentina  Pisa     2.92    3.51    2.40
11  Fiorentina  Pisa     3.14    3.55    2.25
12  Fiorentina  Pisa     2.79    3.45    2.61

It's very easy to do it in excel with a formula but i would like to be able to do it in python since the csv file is very very big so managing with pandas is way faster but i don't know if it is possible nd how to do it...thanks!使用公式在 excel 中很容易做到这一点,但我希望能够在 python 中做到这一点,因为 csv 文件非常非常大,因此使用 pandas 进行管理要快得多,但我不知道是否可行以及如何实现去做...谢谢!

Here's a way to do what your question asks:这是执行您的问题所要求的方法:

df = pd.read_csv('eestlane.txt', sep=r"\s+")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]

Output:输出:

             0     1     2     3     4
0        Milan  Draw  2.47  3.24  3.03
1        Milan  Draw  2.45  3.23  3.06
2       Napoli  Draw  1.45  4.41  7.38
3       Napoli  Draw  1.45  4.40  7.36
4       Napoli  Draw  1.46  4.39  7.33
5       Napoli  Draw  1.47  4.33  7.14
6       Napoli  Draw  1.47  4.33  7.13
7       Napoli  Draw  1.47  4.34  7.10
8       Napoli  Draw  1.43  4.54  7.70
9   Fiorentina  Draw  2.86  3.50  2.45
10  Fiorentina  Draw  2.92  3.51  2.40
11  Fiorentina  Draw  3.14  3.55  2.25
12  Fiorentina  Draw  2.79  3.45  2.61

Explanation:解释:

  • use read_csv to get a 3-column dataframe with an index that contains 0 only for rows with names使用read_csv获取一个 3 列数据帧,其索引仅对具有名称的行包含 0
  • use reset_index to get an index without duplicates, and rename to change the original index to a column named zero_for_names使用reset_index得到一个没有重复的索引,并rename将原始索引更改为名为zero_for_names的列
  • create two new columns new1, new2 and use masking on zero_for_names together with reindex and its ffill method arg to prepare these columns to be the first two columns of the target output specified in the question创建两个新列new1, new2并在zero_for_names上使用掩码以及reindex及其ffill方法 arg 将这些列准备为问题中指定的目标输出的前两列
  • use zero_for_names to filter out original name rows, then drop this column and use reset_index to get a new index without gaps使用zero_for_names过滤掉原始名称行,然后删除此列并使用reset_index获得无间隙的新索引
  • rearrange the columns into the desired order将列重新排列为所需的顺序
  • update df.columns to match the desired column names (integers as strings) shown in the question.更新df.columns以匹配问题中显示的所需列名(整数作为字符串)。

Try:尝试:

def isfloat(x):
    try:
        float(x)
        return True
    except ValueError:
        return False


df = pd.read_csv("your_file.csv", sep=r"\s+")  # <-- you may to adjust sep= accordingly

# make sure the columns are of type int
df.columns = map(int, df.columns)

mask = df.applymap(isfloat)
x = df[mask].copy()
df[mask] = np.nan
df[[3, 4, 5]] = x

df[[0, 1, 2]] = df[[0, 1, 2]].ffill()
df = df.dropna().reset_index(drop=True).drop(columns=1)
df.columns = range(len(df.columns))

print(df)

Prints:印刷:

             0         1     2     3     4
0        Milan  Juventus  2.47  3.24  3.03
1        Milan  Juventus  2.45  3.23  3.06
2       Napoli     Parma  1.45  4.41  7.38
3       Napoli     Parma  1.45  4.40  7.36
4       Napoli     Parma  1.46  4.39  7.33
5       Napoli     Parma  1.47  4.33  7.14
6       Napoli     Parma  1.47  4.33  7.13
7       Napoli     Parma  1.47  4.34  7.10
8       Napoli     Parma  1.43  4.54  7.70
9   Fiorentina      Pisa  2.86  3.50  2.45
10  Fiorentina      Pisa  2.92  3.51  2.40
11  Fiorentina      Pisa  3.14  3.55  2.25
12  Fiorentina      Pisa  2.79  3.45  2.61

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM