简体   繁体   English

如何在没有熊猫的情况下删除重复项?

[英]How to remove duplicates without pandas?

This is the data这是数据

row1| sbkjd nsdnak ABC 
row2| vknfe edcmmi ABC
row3| fjnfn msmsle XYZ
row4| sdkmm tuiepd XYZ
row5| adjck rulsdl LMN

I have already tried this using pandas and got help from stackoverflow.我已经使用pandas尝试过这个,并从 stackoverflow 得到了帮助。 But, I want to be able to remove the duplicates without having to use the pandas library or any library in general.但是,我希望能够删除重复项,而不必使用pandas库或任何一般的库。 So, only one of the rows having "ABC" must be chosen, only one of the rows having "XYZ" must be chosen and the last row is unique, so, it should be chosen.因此,必须仅选择具有“ABC”的行中的一个,必须仅选择具有“XYZ”的行中的一个并且最后一行是唯一的,因此应该选择它。 How do I do this?我该怎么做呢? So, my final output should contain this:因此,我的最终输出应包含以下内容:

[ row1 or row2 + row3 or row4 + row5 ] [ROW1ROW2 + ROW3ROW4 + ROW5]

This should only select the unique rows from your original table.这应该只从原始表中选择唯一的行。 If there are two or more rows which share duplicate data, it will select the first row.如果有两行或更多行共享重复数据,它将选择第一行。

data = [["sbkjd", "nsdnak", "ABC"],
        ["vknfe", "edcmmi", "ABC"],
        ["fjnfn", "msmsle", "XYZ"],
        ["sdkmm", "tuiepd", "XYZ"],
        ["adjck", "rulsdl", "LMN"]]

def check_list_uniqueness(candidate_row, unique_rows):
    for element in candidate_row:
        for unique_row in unique_rows:
            if element in unique_row:
                return False
    return True

final_rows = []
for row in data:
    if check_list_uniqueness(row, final_rows):
        final_rows.append(row)

print(final_rows)

This Bash command would do (assuming your data is in a file called test , and that values of column 4 do not appear in other columns)此 Bash 命令可以执行(假设您的数据位于名为test的文件中,并且第 4 列的值不会出现在其他列中)

cut -d ' ' -f 4 test | tr '\n' ' ' | sed 's/\([a-zA-Z][a-zA-Z]*[ ]\)\1/\1/g' | tr ' ' '\n' | while read str; do grep -m 1 $str test; done

cut -d ' ' -f 4 test chooses the data in the fourth column cut -d ' ' -f 4 test选择第四列的数据
tr '\\n' ' ' turns the column into a row (translating new line character to a space) tr '\\n' ' '将列变成一行(将换行符转换为空格)
sed 's/\\([a-zA-Z][a-zA-Z]*[ ]\\)\\1/\\1/g' deletes the repetitions sed 's/\\([a-zA-Z][a-zA-Z]*[ ]\\)\\1/\\1/g'删除重复
tr ' ' '\\n' turns the row of unique values to a column tr ' ' '\\n'将唯一值的行变成一列
while read str; do grep -m 1 $str test; done while read str; do grep -m 1 $str test; done reads the unique words and prints the first line from test that matches that word while read str; do grep -m 1 $str test; done读取唯一的单词并打印test中与该单词匹配的第一行

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM