简体   繁体   中英

Search methods and string matching in python

I have a task to search for a group of specific terms(around 138000 terms) in a table made of 4 columns and 187000 rows. The column headers are id , title , scientific_title and synonyms , where each column might contain more than one term inside it.

I should end up with a csv table with the id where a term has been found and the term itself. What could be the best and the fastest way to do so?

In my script, I tried creating phrases by iterating over the different words in a term in order and comparing each word with each row of each column of the table.

It looks something like this:

title_prepared = string_preparation(title)
sentence_array = title_prepared.split(" ")
length = len(sentence_array)

for i in range(length):

    for place_length in range(len(sentence_array)):

        last_element = place_length + 1
        phrase = ' '.join(sentence_array[0:last_element])

        if phrase in literalhash:
            final_dict.setdefault(id,[])
            if not phrase in final_dict[id]:
                final_dict[trial_id].append(phrase)

How should I be doing this?

To clarify the problem: we are running small scientific project where we need to extract all text parts with particular keywords. We have used coded dictionary and python script posted on http://www.julesberman.info/coded.htm ! But it seems that something does not working properly.

For exemple the script do not recognize a keyword "Heart Disease" in string "A Multicenter Randomized Trial Evaluating the Efficacy of Sarpogrelate on Ischemic Heart Disease After Drug-eluting Stent Implantation in Patients With Diabetes Mellitus or Renal Impairment".

Thanks for understanding! we are a biologist and medical doctor, with little bit knowlege of python!

If you need some more code i would post it online.

The code on the website you link to is case-sensitive - it will only work when the terms in tumorabs.txt and neocl.xml are the exact same case. If you can't change your data then change:

After:

for line in text:

add:

    line = line.lower()

(this is indented four spaces)

And change:

  phrase = ' '.join(sentence_array[0:last_element])

to:

  phrase = ' '.join(sentence_array[0:last_element]).lower()

AFAICT this works with the unmodified code from the website when I change the case of some of the data in tumorabs.txt and neocl.xml.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM