简体   繁体   English

Python:如何将字典值与文件名匹配?

[英]Python: how to match dictionary value to file name?

I am relatively new to Python and struggling with the following: 我对Python相对较新,并且在以下方面苦苦挣扎:

I have a list of about 52,000 dictionaries containing metadata on PDFs (that are stored separately). 我有大约52,000个字典的列表,这些字典包含PDF(单独存储)上的元数据。 Now, I want to match 5,000 of these PDFs to their corresponding metadata dictionaries, but I'm not sure how to do this. 现在,我想将这些PDF中的5,000个与它们相应的元数据字典进行匹配,但是我不确定如何做到这一点。

Metadata: 元数据:

[{'Title': 'This is the title', 'Author': 'John A.', 'Code': '8372', ...}, {'Title': 'This is another title', 'Author': 'Peter B.', 'Code': '5837_c', ...}, ...]

The PDF file names correspond to the 'Code' values (ie the file names are 5346, 8372, 3475_c, 0294, 5837_c, etc., always either three, four or five numbers or three, four or five numbers complemented by _c). PDF文件名对应于“代码”值(即文件名是5346、8372、3475_c,0294、5837_c等,始终为三个,四个或五个数字或三个,四个或五个数字,以_c补充)。 Is there a way in which I can match the PDFs to the right dictionaries in the list of metadata dictionaries, using the file names of the PDFs to match? 有没有一种方法可以使用PDF的文件名来将PDF与元数据字典列表中的正确字典进行匹配?

Other solutions are also very welcome! 其他解决方案也非常欢迎!

Edit: My aim is to create a Textacy Corpus, in which every entry is a Textacy Doc (ie the content of one PDF) and its corresponding Textacy Metadata (ie the PDFs metadata). 编辑:我的目的是创建一个Textacy语料库,其中每个条目都是一个Textacy Doc(即一个PDF的内容)及其对应的Textacy元数据(即PDFs元数据)。

textacy_corpus = textacy.Corpus(u'en', texts=pdfs_list, metadatas=metadata_list)

From Textacy's documentation: "[Metadata] stream must align exactly with texts or docs , or else metadata will be mis-assigned. More concretely, the first item in metadatas will be assigned to the first item in texts or docs , and so on from there." 来自Textacy的文档: “ [[Metadata]流必须与textsdocs完全对齐,否则元数据将被错误分配。更具体地说, metadatas的第一项将被分配给textsdocs的第一项,依此类推。那里。” This is why I want to match the PDFs to the right metadata. 这就是为什么我想将PDF与正确的元数据进行匹配。

dict((x['Code'],x) for x in <YOUR_LIST>)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM