[英]How to filter elements containing only specific repeated characters in a dataframe
I am looking to create a new dataframe that filters out redundant information from a previous dataframe. 我希望创建一个新的数据框,以从先前的数据框中过滤掉多余的信息。 The original dataframe is created from looking through many file folders and providing a column of elements each containing a string of the full path to access each file. 原始数据帧是通过浏览许多文件夹并提供一列元素(每个元素包含访问每个文件的完整路径的字符串)而创建的。 Each file is named according to trial number and score in a corresponding test folder. 每个文件均根据试验编号和分数在相应的测试文件夹中命名。 I need to remove all reiterations of scores that are 100 for each trial, however, the first score of 100 for each trial must remain. 我需要删除所有针对每个试验的100分数的重复,但是必须保留针对每个试验的100的第一分。
With python Pandas, I am aware of using df[df[col_header].str.contains('text')] to specifically filter out what is needed and the use of '~' as a boolean NOT. 对于python Pandas,我知道使用df [df [col_header] .str.contains('text')]专门过滤掉所需的内容以及将“〜”用作布尔NOT。
The unfiltered dataframe column with redundant scores looks like this 带有多余分数的未经过滤的数据框列如下所示
\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test1\trial3-100 #<- must remove
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
\\desktop\Test_Scores\test2\trial3-100 #<- must remove
\\desktop\Test_Scores\test2\trial3-100 #<- must remove
.
.
.
n
The expected result after using some code as a filter would be a dataframe that looks like this 使用一些代码作为过滤器后的预期结果将是一个看起来像这样的数据框
\\desktop\Test_Scores\test1\trial1-98
\\desktop\Test_Scores\test1\trial2-100
\\desktop\Test_Scores\test2\trial1-95
\\desktop\Test_Scores\test2\trial2-100
.
.
.
.
n
This one line should solve your problem. 这一行应该可以解决您的问题。
df = df.loc[df["col"].shift().str.contains("-100") != df["col"].str.contains("-100")]
Update: 更新:
df["col"] = df["col"].str.replace('\t','\\t')
df['test_number'] = df.col.str.split('-').str[0].str.split('\\').str[-2]
df['score'] = df.col.str.split('-').str[1]
df.drop_duplicates(["test_number","score"], inplace = True)
df.drop(["test_number","score"],1,inplace = True)
Check this solution out. 签出此解决方案。 The reason why I am doing the replace in very first line is your data contains \\t
which in programming is a tab delimiter. 我在第一行进行替换的原因是您的数据包含\\t
,在编程中这是一个制表符分隔符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.