简体   繁体   中英

Is there a way to compare two csv files cell by cell?

For a research project I need to compare the common cells between two csv files. For example I want the python code to iterate through each cell of my first csv onto the second file. For instance if my second file has 2500 cells, I want the code to try to find if the first cell of my first csv file is to be found within the 2500 cells of my second file. Once he has done it for the first cell he goes on to the second one and so on until he reaches the last one. finally i need to have all of the common cells to be printed either in a txt or a csv file.

Here you may find the two csv files : https://drive.google.com/drive/folders/18Au23kD33Za25GNEaLIY1bT18xKtZWtY?usp=sharing

!pip install pandas

table_2 = pd.DataFrame(pd.read_csv('/content/drive/MyDrive/Bioinformatique 2/DATAs/Etude chinoise /En cours/list_chinese_sorted.csv', sep=";", error_bad_lines=False ))
table_1 = pd.DataFrame(pd.read_csv('/content/drive/MyDrive/Bioinformatique 2/DATAs/Data_carveme/list.csv', sep=";", error_bad_lines=False))

for a in range(0,5589):
  for b in range(0,10):
    for x in range(0,2637):
      scoreline = 0
      for y in range(0,16):
        if table_1[a,b] == table_2[x,y]:
          scoreline += 1
      print(x,scoreline)

I looked into your files last evening: As you said in a comment, the content looks like tree structures. So what you're asking here seems a bit odd - not saying it is odd. It might help to know what your goal beyond the question is.

I did the following (as hinted at in my comment):

  1. Identification of the common elements by the intersection of two sets, for each file one set of items.
  2. Based on the common items checking the second dataframe for the occurance of the items (per index the number of occurances in the row) and printing out the non-zero results.
# Identification of common items
with open("list.csv", "r") as file:
    table_1_set = {
        item.strip().strip("()")
        for line in file
        for item in line.strip().split(";") if item != ''
    }
with open("list_chinese_sorted.csv", "r") as file:
    table_2_set = {
        item.strip().strip("()")
        for line in file
        for item in line.strip().split(";") if item != ''
    }
common = table_2_set & table_2_set

# Finding and printing the non-zero occurances in the 2. dataframe
df = pd.read_csv(
    "list_chinese_sorted.csv", sep=";", error_bad_lines=False
)
for item in sorted(common):
    print(f"item: {item}")
    df_num_matches = df.eq(item).sum(axis="columns")
    for i, num in df_num_matches[~df_num_matches.eq(0)].items():
        print(f"{i}: {num}")

Result (prints) look like:

...
item: acetanaerobacterium
1381: 1
1382: 2
item: acetatifactor
1242: 1
item: acetethylicum
1246: 1
item: acetitomaculum
1243: 1
item: acetivibrio
1383: 1
1384: 2
item: acetobacter
1819: 1
item: acetobacteraceae
1818: 1
1819: 1
1820: 1
1821: 1
1822: 1
1823: 1
1824: 1
1825: 1
1826: 1
1827: 1
1828: 1
1829: 1
1830: 1
1831: 1
item: acetobacterium
1200: 1
...

But you might want to do some data cleansing: It seems to me that there is some stuff in the files that prevents propper processing. Eg, in the first file you'll find lines like

blattabacterium;sp.;(cryptocercus;punctulatus);str.;cpu;;;;

which will lead to confusing results (the ( and ) ).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM