简体   繁体   中英

How to perform a regex substitution starting from a specific index in Python

I have two files, one file I parse through looking for regular expressions to substitute with strings from the second file. The first file is a .csv file that contains strings at the third index. Index 0-2 are just added on data.

A string from file 1 looks like this:

"foo http://abc bar http://123."
...
...

In file 2, there are just a list of URL's that are meant to replace the ones found in file1.

File 2 looks like this:

"http://def"
"http://456"
...
...

I start by iterating through file 1, looking for URL's. When I find a URL(s), I replace it with a URL from file 2 and then move on to the next URL. This is all done in order, so no URL's from file 2 are repeated when replacing URL's in file 1.

The resulting string after the parsing is complete, should look like this:

"foo http://def bar http://456"

My problem is, when using re.sub to perform the substitution, I can only either replace the first URL or both of them at the same time with the same URL from file 2. For example, my string ends up looking like this:

"foo http://def bar http://def"

Is there a way that I can use re.sub to replace the first URL, then keep track of where it is in the string so that when it hits the second URL, it will replace it with the corresponding URL from file 2?

The code I have written is as follows:

shortened = open('shortenedURLs.txt','r')
linesReadfromFile = shortened.readlines()
newRetweet = open('new_Retweet.csv','w')
with open('tweets_nurl.csv','rb') as inputfile1:
    read=csv.reader(inputfile1, delimiter=',')
    a = 0
    for row in read:
        url = re.findall('https*://', row[3])
    if url:
            for i in xrange(len(url)):
            currentLine=row[3].rstrip('\n')
            if re.search('http://', row[3]):
                iter = re.finditer(r'http://',row[3])
                indices = [m.start(0) for m in iter]
                print indices
                currentLine=re.sub(r'http://[^\s]*', linesReadfromFile[a].rstrip('\n'),  currentLine, count=1)
                a=a+1
            if re.search('https://',row[3]):
                currentLine = re.sub(r'https://[^\s]*', linesReadfromFile[a].rstrip('\n'), currentLine) 
                a=a+1
        newRetweet.write(row[0]+","+row[1]+","+row[2]+","+currentLine+'\n')


    else:
        newRetweet.write(','.join(row)+'\n')

shortened.close()
newRetweet.close()

The "print indices" tell me where the matches are found, but I'm not sure how to utilize them to specify where substitutions should take place.

Thanks for any help!

site_urls = list(open("urls.txt"))
def replacer(match):
    return site_urls.pop(0).strip()

re.sub("http[^ ]*",replacer,my_file_text)

I think would do what you want ...

of coarse since the function is one line you could easily replace it with a lambda ... I just used a normal method for illustrative purposes

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM