I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id
, title
, scientific_title
and synonyms
, where each column might contain more than one term inside it.
I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?
In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.
It looks something like this:
title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)
for i in range(length):
for place_length in range(len(sentence_array)):
last_element = place_length + 1
phrase = ' '.join(sentence_array[0:last_element])
if phrase in literalhash:
final_dict.setdefault(id,[])
if not phrase in final_dict[id]:
final_dict[trial_id].append(phrase)
How should I be doing this?
To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.
For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".
Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!
If you need some more code i would post it online.
The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:
After:
for line in text:
add:
line = line.lower()
(this is indented four spaces)
And change:
phrase = ' '.join(sentence_array[0:last_element])
to:
phrase = ' '.join(sentence_array[0:last_element]).lower()
AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.