[英]Search methods and string matching in python
I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. 我有一项任务是在由4列和187000行组成的表中搜索一组特定术语(大约138000个术语)。 The column headers are
id
, title
, scientific_title
and synonyms
, where each column might contain more than one term inside it. 列标题是
id
, title
, scientific_title
和synonyms
,其中每列可能包含多个术语。
I should end up with a csv table with the id where a term has been found and the term itself. 我应该得到一个csv表,其中id已找到一个术语,术语本身。 What could be the best and the fastest way to do so?
什么是最好和最快的方法?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table. 在我的脚本中,我尝试通过按顺序迭代术语中的不同单词并将每个单词与表中每列的每一行进行比较来创建短语。
It looks something like this: 它看起来像这样:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this? 我该怎么做?
To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. 澄清问题:我们正在运行小型科学项目,我们需要使用特定关键字提取所有文本部分。 We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm !
我们使用了http://www.julesberman.info/coded.htm上发布的编码字典和python脚本! But it seems that something does not working properly.
但似乎有些东西不能正常工作。
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment". 例如,该脚本不识别字符串“在糖尿病或肾损伤患者中药物洗脱支架植入后评估Sarpogrelate对缺血性心脏病的功效的多中心随机试验”中的关键词“心脏病”。
Thanks for understanding! 感谢您的理解! we are a biologist and medical doctor, with little bit knowlege of python!
我们是一名生物学家和医生,对python有一点了解!
If you need some more code i would post it online. 如果您需要更多代码,我会在线发布。
The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. 您链接到的网站上的代码区分大小写 - 只有当tumorabs.txt和neocl.xml中的术语完全相同时才会起作用。 If you can't change your data then change:
如果您无法更改数据,请更改:
After: 后:
for line in text:
add: 加:
line = line.lower()
(this is indented four spaces) (这是缩进的四个空格)
And change: 并改变:
phrase = ' '.join(sentence_array[0:last_element])
to: 至:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml. 当我更改tumorabs.txt和neocl.xml中某些数据的大小时,AFAICT与网站中未经修改的代码一起使用。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.