简体   繁体   中英

Find a specific word in a docx file

I am working on a group assignment to read a docx file and then output the word 'carrier' or 'carriers' with the word directly to the right of it. The output we are receiving is only 26 of the total 82 mentions of the word carrier in the document. I would prefer recommendations to what might be causing this. My hunch is that it has something to do with the For loop.

from docx import Document

emptyString = {}
tupl = ()
doc = Document('Interstate Commerce Act.docx')

for i ,paragraph in enumerate(doc.paragraphs):
text = paragraph.text
text = text.split()
#text = text.lower()

    if 'carrier' in text:
        next = text.index('carrier') + 1
        now = text.index('carrier')
        #print(text[now], text[next]) 
        tupl = (text[now], text[next])
        emptyString[i] = tupl

    if 'carriers' in text:
        next = text.index('carriers') + 1
        now = text.index('carriers')
        #print(text[now], text[next])
        tupl = (text[now], text[next])
        emptyString[i] = tupl

    if 'Carriers' in text:
        next = text.index('Carriers') + 1
        now = text.index('Carriers')
        #print(text[now], text[next])
        tupl = (text[now], text[next])
        emptyString[i] = tupl

    if 'Carrier' in text:
        next = text.index('Carrier') + 1
        now = text.index('Carrier')
        #print(text[now], text[next])
        tupl = (text[now], text[next])   
        emptyString[i] = tupl

print(emptyString)

Your text = text.split() line is going to cause certain items to be "hidden". For example, "The carrier is a Carrier." will produce the word list:

["The", "carrier", "is", "a", "Carrier."]

Since the last item is "Carrier." and not "Carrier" it will not be found by your "exact match" test.

Perhaps better to split by word and then check whether a lowercase version includes "carrier":

words = text.split()
for i, word in enumerate(words):
    if "carrier" in word.lower():
        print("word %d is a match" % i)

Using the lowercase comparison avoids the need for separate tests for all the case varieties.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM