[英]How to create a dictionary from a CSV file
我有一个像这样的 CSV 文件:
w syn
0 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
1 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
2 abaca http://kaiko.getalp.org/dbnary/fra/tagal
3 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
4 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
.. ... ...
95 abandon http://kaiko.getalp.org/dbnary/fra/apostasie
96 abandon http://kaiko.getalp.org/dbnary/fra/capitulation
97 abandon http://kaiko.getalp.org/dbnary/fra/cession_de_...
98 abandon http://kaiko.getalp.org/dbnary/fra/confiance
99 abandon http://kaiko.getalp.org/dbnary/fra/défection
[100 rows x 2 columns]
6
{'abaca': 'tagal', 'abdomen': 'ventre', 'abricot': 'michemis', 'ADN': 'acide désoxyribonucléique', 'Indien': 'sauvage', 'abandon': 'défection'}
我试图创建一个字典,其中每个单词及其同义词。 我想出了这段代码,但最终的字典只包含该词的一个同义词,但正如您在 csv 文件中看到的那样,一个词可以有多个同义词。
# read specific columns of csv file using Pandas
df = pd.read_csv("sparql.csv", usecols = ["w","syn"]) #usecols = ["l","f","s","w","syn","synonyme"]
print(df)
liste_mot = df['w'].tolist()
liste_mot = set(liste_mot)
print(len(liste_mot))
liste_sys = []
dict_syn = {}
for index, row in df.iterrows():
k, v = row
sys = os.path.basename(v)
if "_" in sys:
sys = sys.split("_")
sys = " ".join(sys)
dict_syn[k] = sys
else:
dict_syn[k] = sys
print(dict_syn)
我想要得到的是每个单词作为关键字,并将所有同义词列表作为其值,但到目前为止,每个单词(w)我只能得到一个同义词(syn),而不是所有同义词。
这是一个部分基于您的代码的工作示例。 同义词放在一个列表中:
from io import StringIO
import pandas as pd
text = """
w syn
0 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
1 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
2 abaca http://kaiko.getalp.org/dbnary/fra/tagal
3 abaca http://kaiko.getalp.org/dbnary/fra/Musa_textilis
4 abaca http://kaiko.getalp.org/dbnary/fra/chanvre_de_...
95 abandon http://kaiko.getalp.org/dbnary/fra/apostasie
95 abandon http://kaiko.getalp.org/dbnary/fra/apostasie
96 abandon http://kaiko.getalp.org/dbnary/fra/capitulation
97 abandon http://kaiko.getalp.org/dbnary/fra/cession_de_...
98 abandon http://kaiko.getalp.org/dbnary/fra/confiance
99 abandon http://kaiko.getalp.org/dbnary/fra/défection
"""
# read in data
df = pd.read_csv(StringIO(text), sep='\s+')
# get the synonym out of the url
df['real_syn'] = df['syn'].str.extract('.*/(.*)')
# dictionary to write results to
result = {}
# loop over every row of the dataframe
for _, row in df[['w', 'real_syn']].iterrows():
word = row['w']
syn = row['real_syn']
# check if word is already in result dictionary and make sure words are not added twice
if result.get(word) and syn not in result[word]:
result[word] = result[word] + [syn]
else:
# if word is not yet in dictionary, then add it a key, and add the synonym as a list
result[word] = [syn]
print(result)
我不确定您的 CSV 是否实际上是固定宽度的,或者这只是一个不错的打印输出。
如果您不需要 Pandas,Python 的标准 CSV 模块就可以胜任。
import csv
import os
import pprint
from collections import defaultdict
def syn_splitter(s):
syn = os.path.basename(s)
syn = syn.replace('_', ' ')
return syn
# So we can just start appending syns, without having to "prime" the dictionary with an empty list
word_syn_map = defaultdict(list)
with open('sample.csv', 'r', newline='') as f:
reader = csv.reader(f)
next(reader) # discard header
for row in reader:
w, syn = row
syn = syn_splitter(syn)
word_syn_map[w].append(syn)
pprint.pprint(word_syn_map)
# word_syn_map = dict(word_syn_map) if you want to get rid of the defaultdict wrapper
我模拟了sample.csv :
w,syn
abaca,http://kaiko.getalp.org/dbnary/fra/Musa_textilis
abaca,http://kaiko.getalp.org/dbnary/fra/tagal
abaca,http://kaiko.getalp.org/dbnary/fra/Musa_textilis
abandon,http://kaiko.getalp.org/dbnary/fra/apostasie
abandon,http://kaiko.getalp.org/dbnary/fra/capitulation
abandon,http://kaiko.getalp.org/dbnary/fra/confiance
abandon,http://kaiko.getalp.org/dbnary/fra/défection
我得到了:
defaultdict(<class 'list'>,
{'abaca': ['Musa textilis', 'tagal', 'Musa textilis'],
'abandon': ['apostasie',
'capitulation',
'confiance',
'défection']})
另一种方法:
import os
df = pd.read_csv("sparql.csv", usecols=["w","syn"])
df["syn_new"] = df.syn.map(os.path.basename).str.replace("_", " ")
dict_syn = {
key: group.syn_new.to_list()
for key, group in df[["w", "syn_new"]].groupby("w")
}
您的样品的结果:
{'abaca': ['Musa textilis',
'chanvre de ...',
'tagal',
'Musa textilis',
'chanvre de ...'],
'abandon': ['apostasie',
'capitulation',
'cession de ...',
'confiance',
'défection']}
你可以试试如果
df["syn_new"] = df.syn.str.rsplit("/", 1, expand=True)[1].str.replace("_", " ")
也可以,可能会更快。
也许您不想要list
s 而是set
s set
为dict_syn
值以避免重复:
...
key: set(group.syn_new.to_list())
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.