Search methods and string matching in python

Question

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id , title , scientific_title and synonyms , where each column might contain more than one term inside it.

I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?

In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.

It looks something like this:

title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)

for i in range(length):

    for place_length in range(len(sentence_array)):

        last_element = place_length + 1
        phrase = ' '.join(sentence_array[0:last_element])

        if phrase in literalhash:
            final_dict.setdefault(id,[])
            if not phrase in final_dict[id]:
                final_dict[trial_id].append(phrase)

How should I be doing this?

Answer 1

To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.

For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".

Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!

If you need some more code i would post it online.

Answer 2

The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:

After:

for line in text:

add:

    line = line.lower()

(this is indented four spaces)

And change:

  phrase = ' '.join(sentence_array[0:last_element])

to:

  phrase = ' '.join(sentence_array[0:last_element]).lower()

AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.

Search methods and string matching in python

Question

2 answers

solution1
0 2015-08-14 14:40:22

solution2
0 2015-08-14 15:26:35

Search methods and string matching in python

Question

2 answers

solution1 0 2015-08-14 14:40:22

solution2 0 2015-08-14 15:26:35

solution1
0 2015-08-14 14:40:22

solution2
0 2015-08-14 15:26:35