简体   繁体   中英

Python splitting text between every instance of two specific strings (Regex)

I'm trying to do some kind of web scraping with python and I'm having some trouble. I have a big mass of scraped text and I'm trying to generate a list that contains every instance between two specific strings.

A bunch of lines contain something in the format of "href= /profile/pc/WORD/matches" and I want to create a list of all the WORDs (Every word between an instance of " /profile/pc/" and "/matches").

I tried starting with something like this but I'm not even getting any output. Any help on where to go from here?

import re
url="http:examplewebsite.com"  
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()   
webpage = web_byte.decode('utf-8')  
q = webpage.replace('"','_')    #Replace quotation marks with underscores
print (re.split(r'href=_/profile/pc/', q))

PS Previously I did something like this but I was only getting the first result.

 substring1 = '<a href=_/profile/pc/'   #Starting string before name
 substring2 = '/matches_>'   #Ending string after name
 my_string = q[(q.index(substring1)+len(substring1)):q.index(substring2)]

You have many lines in webpage and you need to move through them all to try to match on them.

import re
url="http:examplewebsite.com"  
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()   
webpage = web_byte.decode('utf-8')  
q = webpage.replace('"','_')    #Replace quotation marks with underscores
for row in q:
    print (re.split(r'href=_/profile/pc/', row))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM