简体   繁体   中英

Find a substring within a textfile in Python

So I am trying to extract a link within a textfile in Python -- this link varies from textfile to textfile but has the same format. I tried using the re library but keep getting errors.

The syntax of the link is:

docs.com/searchres.aspx?docformat=all&docid=[SOME NUMBER] - 

So the end of the link has a specifying number in the SOME NUMBER field and at the end of the link there is a ' - ' How can I search, find, and save this link from a textfile. Thank you -- this is my first time posting on SO

Here's a Python solution that uses memory maps. A few caveats:

  1. You said there was only one instance in the file, and if there is more than one, it will return the first.
  2. I put this together quickly from some old code. If ] is not in the text file, it will continue reading. Take a look at the mmap documentation here to see how you might modify the code to be more robust.

EDIT: Python's code formatter hates me, so I had to make some minor changes to get it to block properly. Sorry about that.

match = open(db, 'r')
try:
    search = mmap.mmap(match.fileno(), 0, access=mmap.ACCESS_READ)
    index = search.find(str(target))
    if index != -1:
        #"This entry exists. We have the index of it, now read the line."
        search.seek(index)
        #"Seek to the index."
        strOut = ""
        read = search.read(1)
        while read != ']':
            strOut = strOut + read
            read = search.read(1)
        search.close()
        match.close()

        print strOut
    else:
        #-1 indicates it's not in the file
        print strOut
except Exception as err:
    match.close()
    print strOut

So this response is simple, but works for small files. When you say "save this link" I assume having the url in a string variable is good enough.

import re

f = open(filename_str, 'r')
file_content = f.read()
p = re.compile('docs.com(.)*\-')
m = p.search(file_content)
if m != None:
    link = m.group(0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM