简体   繁体   English

如何根据 Python 中的特定变量(不是逐行)比较两个 csv 文件?

[英]How to compare two csv files according to a specific variable (not line by line) in Python?

I have two csv files I converted from Json file (copy the text in EXCEL and convert to csv), the format is a bit messy, I want to compare each whole line according to the ID number, but the problem is ID number is in different columns for each row, and I want to print the difference between two lines which have the same ID number.我有两个 csv 文件是我从 Json 文件转换而来的(复制 EXCEL 中的文本并转换为 csv),格式有点乱,我想比较每个数字行的 ID,但问题是根据 ID,每行不同的列,我想打印具有相同 ID 号的两行之间的差异。

Here is the data sample (i can't rename each column because each column has different variable values):这是数据示例(我无法重命名每一列,因为每一列都有不同的变量值):

CSV_01: CSV_01:

age 10   height 150   ID  1001     sex F
age 10   height 150   ID  1001     sex M
ID 1001  height 150   age  12      sex M
age 10   ID  2002     height 151   sex F
age 10   height 150   ID  2002     sex M

CSV_02: CSV_02:

age 10   height 150   ID  2002     sex F
age 10   height 150   ID  1001     sex M
ID 1001  height 150   age  12      sex M
age 10   ID  1001     height 151   sex F
age 10   height 150   ID  2002     sex M

I have almost 1000 rows&500ish columns (and for each row it also contains duplicated same ID) Something like: age 10 height 150 ID 1001 sex M... ID 1001...我有近 1000 行和 500 列(对于每一行,它还包含重复的相同 ID) 类似:年龄 10 身高 150 ID 1001 性别 M... ID 1001 ...

But I assume it doesn't matter, but the variables have different orders, which means ultimately I want to compare the first 3 rows in CVS_01 with the 2nd, 3rd,4th row in CSV_02 (because they have the same ID), but this is just an example so it should be different row numbers in my large data set.但我认为没关系,但变量有不同的顺序,这意味着最终我想将CVS_01 中的前 3 行与 CSV_02的第 2、3、4 行进行比较(因为它们具有相同的 ID),但这只是一个例子,所以它应该是我的大数据集中不同的行号。

Here's what I've tried after importing csv files in Python:这是我在 Python 中导入 csv 文件后所尝试的:

resultBool01 = (CSV_01 != CSV_02).stack()  # Create Frame of comparison booleans
resultdiff01 = pd.concat([CSV_01.stack()[resultBool01], CSV_02.stack()[resultBool01]], 
axis=1)
resultdiff01.columns=["output_01", "output_02"]

This gave me the difference between each row (ie: the first row between two files), but this is not what I want, because in the first row they have different ID.这给了我每行之间的差异(即:两个文件之间的第一行),但这不是我想要的,因为在第一行它们有不同的 ID。 I'm stuck for a few days already, not sure if this is the right direction, but it could be more difficult if I compare json or txt files.我已经被困了几天,不确定这是否是正确的方向,但如果我比较 json 或 txt 文件可能会更困难。 Can someone help me?有人能帮我吗? Many thanks.非常感谢。

Have you tried converting your data into a dictionary?您是否尝试过将数据转换为字典?

While the csv files are a bit messy, at least they have a clearly defined structure and each fieldname is present before its referenced value.虽然 csv 文件有点混乱,但至少它们具有明确定义的结构,并且每个字段名都存在于其引用值之前。

Whitespaces and other special characters notwithstanding, you could parse each csv line for line first, save each line (or entry) as a dictionary of unique datapoints and append it to a list of dictionaries.尽管有空格和其他特殊字符,但您可以首先解析每个 csv 行,将每行(或条目)保存为唯一数据点的字典,并将 append 保存到字典列表中。 Then, you can either operate on it directly, or you can export a properly sorted and aligned csv file for later use.然后,您可以直接对其进行操作,也可以导出正确排序和对齐的 csv 文件以供以后使用。

(I cannot comment yet, so I hope this suffices, otherwise I am happy to help with the actual code, too) (我还不能发表评论,所以我希望这就足够了,否则我也很乐意为实际代码提供帮助)

Addendum:附录:

The code might not be perfect for your specific file, but it can serve as a blueprint to develop yours.该代码可能不适合您的特定文件,但它可以作为开发您的文件的蓝图。

Basically, each row as represented in csv (really tsv) format is:基本上,以 csv(真的是 tsv)格式表示的每一行是:

fieldname   value   fieldname2   value2   fieldname3   value3

and this code will read it in and save the value2 as the value of the key "fieldname2" in a new dictionary, which we then save in a list and later return from the function此代码将读取它并将 value2 作为键“fieldname2”的值保存在新字典中,然后我们将其保存在列表中,然后从 function 返回

def filereader(filename):
        _out = []
        with open(filename, 'r',  newline='', encoding='?????') as csvfile:
                csvfile = csv.reader(csvfile, delimiter='\t')
                for row in csvfile:
                        rowDict = {}
                        rowDict[row[0]] = row[1]
                        rowDict[row[2]] = row[3]
                        rowDict[row[4]] = row[5]
                        _out.append(rowDict)
        return _out

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM