[英]Modify pandas dataframe imported from csv file
i've a big database imported from a csv file (using pd.read_csv), here's how it look in csv file:我有一个从 csv 文件导入的大数据库(使用 pd.read_csv),这是它在 csv 文件中的样子:
0 1 2
0 Milan Draw Juventus
1 2.47 3.24 3.03
2 2.45 3.23 3.06
0 Napoli Draw Parma
1 1.45 4.41 7.38
2 1.45 4.40 7.36
3 1.46 4.39 7.33
4 1.47 4.33 7.14
5 1.47 4.33 7.13
6 1.47 4.34 7.10
7 1.43 4.54 7.70
0 Fiorentina Draw Pisa
1 2.86 3.50 2.45
2 2.92 3.51 2.40
3 3.14 3.55 2.25
4 2.79 3.45 2.61
I need the dataframe to look like this:我需要数据框看起来像这样:
0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61
It's very easy to do it in excel with a formula but i would like to be able to do it in python since the csv file is very very big so managing with pandas is way faster but i don't know if it is possible nd how to do it...thanks!使用公式在 excel 中很容易做到这一点,但我希望能够在 python 中做到这一点,因为 csv 文件非常非常大,因此使用 pandas 进行管理要快得多,但我不知道是否可行以及如何实现去做...谢谢!
Here's a way to do what your question asks:这是执行您的问题所要求的方法:
df = pd.read_csv('eestlane.txt', sep=r"\s+")
df = df.reset_index().rename(columns={'index':'zero_for_names'})
df[['new1','new2']] = df.loc[df['zero_for_names'] == 0, ['0','1']].reindex(df.index, method='ffill')
df = df[df['zero_for_names'] != 0].drop(columns='zero_for_names').reset_index(drop = True)
df=df[['new1','new2','0','1','2']]
df.columns=[str(i) for i in range(len(df.columns))]
Output:输出:
0 1 2 3 4
0 Milan Draw 2.47 3.24 3.03
1 Milan Draw 2.45 3.23 3.06
2 Napoli Draw 1.45 4.41 7.38
3 Napoli Draw 1.45 4.40 7.36
4 Napoli Draw 1.46 4.39 7.33
5 Napoli Draw 1.47 4.33 7.14
6 Napoli Draw 1.47 4.33 7.13
7 Napoli Draw 1.47 4.34 7.10
8 Napoli Draw 1.43 4.54 7.70
9 Fiorentina Draw 2.86 3.50 2.45
10 Fiorentina Draw 2.92 3.51 2.40
11 Fiorentina Draw 3.14 3.55 2.25
12 Fiorentina Draw 2.79 3.45 2.61
Explanation:解释:
read_csv
to get a 3-column dataframe with an index that contains 0 only for rows with namesread_csv
获取一个 3 列数据帧,其索引仅对具有名称的行包含 0reset_index
to get an index without duplicates, and rename
to change the original index to a column named zero_for_names
reset_index
得到一个没有重复的索引,并rename
将原始索引更改为名为zero_for_names
的列new1, new2
and use masking on zero_for_names
together with reindex
and its ffill
method arg to prepare these columns to be the first two columns of the target output specified in the questionnew1, new2
并在zero_for_names
上使用掩码以及reindex
及其ffill
方法 arg 将这些列准备为问题中指定的目标输出的前两列zero_for_names
to filter out original name rows, then drop this column and use reset_index
to get a new index without gapszero_for_names
过滤掉原始名称行,然后删除此列并使用reset_index
获得无间隙的新索引df.columns
to match the desired column names (integers as strings) shown in the question.df.columns
以匹配问题中显示的所需列名(整数作为字符串)。Try:尝试:
def isfloat(x):
try:
float(x)
return True
except ValueError:
return False
df = pd.read_csv("your_file.csv", sep=r"\s+") # <-- you may to adjust sep= accordingly
# make sure the columns are of type int
df.columns = map(int, df.columns)
mask = df.applymap(isfloat)
x = df[mask].copy()
df[mask] = np.nan
df[[3, 4, 5]] = x
df[[0, 1, 2]] = df[[0, 1, 2]].ffill()
df = df.dropna().reset_index(drop=True).drop(columns=1)
df.columns = range(len(df.columns))
print(df)
Prints:印刷:
0 1 2 3 4
0 Milan Juventus 2.47 3.24 3.03
1 Milan Juventus 2.45 3.23 3.06
2 Napoli Parma 1.45 4.41 7.38
3 Napoli Parma 1.45 4.40 7.36
4 Napoli Parma 1.46 4.39 7.33
5 Napoli Parma 1.47 4.33 7.14
6 Napoli Parma 1.47 4.33 7.13
7 Napoli Parma 1.47 4.34 7.10
8 Napoli Parma 1.43 4.54 7.70
9 Fiorentina Pisa 2.86 3.50 2.45
10 Fiorentina Pisa 2.92 3.51 2.40
11 Fiorentina Pisa 3.14 3.55 2.25
12 Fiorentina Pisa 2.79 3.45 2.61
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.