简体   繁体   中英

How to remove double quotes nested within other double quotes regex

I am trying to load data gathered using Beautiful Soup using json.loads. However, the data I am using has an issue where some of the fields contain double quotes within the field. Example:

"rComments":"He is a very easy grader, but gets boring occasionally. I wish he would quit saying "Without further ado..." Cancer Bio is a great class because there is a different lecturer each time."

This is causing the following error:

JSONDecodeError: Expecting ',' delimiter: line 1 column 3556 (char 3555)

Is there a way to replace the double quotes just around "Without further ado..." with single/no quotes using regex or another method? I need to maintain the other double quotes because those are needed for JSON

Here is a copy of my code. It fails for any Prof ID that has a nested double quote.

# Make Request
url1 = 'https://www.ratemyprofessors.com/paginate/professors/ratings?tid={}&filter=&courseCode=&page=1'.format(124880)
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.text, "html.parser")
soup1 = str(soup1)

# Remove Double Quotes in Comments
soup1 = re.sub(r'(?:[\b\s\:]\".*)(?:.*)(\")(?:.*\")', '', soup1)

# Create Dictionary
Dict1 = json.loads(soup1)

I have also tried the below regex and it also didn't work.

:r"(\".*?)\"(.*?)\"(.*\")

For reference, this is what is returned by repr(soup1).

'\'{"ratings":[{"attendance":"N/A","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":2,"id":29366967,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL4015","rComments":"One of my favorite professors at Tech. Really cares about his students, and even brought us apples from Elijay and snacks during the final. His tests are not too bad and the group project is pretty easy. Good teacher and even better human being.","rDate":"01/01/2018","rEasy":3.0,"rEasyString":"3.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":5.0,"rOverallString":"5.0","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1514816343000,"rWouldTakeAgain":"Yes","sId":361,"takenForCredit":"Yes","teacher":null,"teacherGrade":"B+","teacherRatingTags":["Inspirational","Caring"],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"Not Mandatory","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":0,"id":28805507,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL3450","rComments":"GOAT","rDate":"10/30/2017","rEasy":3.0,"rEasyString":"3.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":5.0,"rOverallString":"5.0","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1509404689000,"rWouldTakeAgain":"Yes","sId":361,"takenForCredit":"Yes","teacher":null,"teacherGrade":"A","teacherRatingTags":["Caring","Get ready to read","Accessible outside class"],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"average","easyColor":"good","helpColor":"poor","helpCount":0,"id":19977224,"notHelpCount":0,"onlineClass":"","quality":"poor","rClarity":2,"rClass":"BIOL3450","rComments":"Dr Merril is a really, really nice person, and I\\\'m sure he\\\'s great doing his research but he is just not a good professor for a lecture based class with 150ish people. He\\\'s soft spoken, moves too fast in lecture and goes into unnecessary detail. Also does not hold office hours. Would rather defer students to TA.","rDate":"03/31/2012","rEasy":4.0,"rEasyString":"4.0","rErrorMsg":null,"rHelpful":1,"rInterest":"Low","rOverall":1.5,"rOverallString":"1.5","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1333212949000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"good","easyColor":"good","helpColor":"average","helpCount":0,"id":15545116,"notHelpCount":0,"onlineClass":"","quality":"good","rClarity":5,"rClass":"BIOL3340","rComments":"Dr. Merrill is a very nice man and a decent teacher. Class attendance isn\\\'t necessary, however, he does offer extra credit for attendence occasionally. The class is all memorization and a lot of nit-picky information. Didn\\\'t like the class too much, but he was a fine teacher.","rDate":"03/18/2009","rEasy":4.0,"rEasyString":"4.0","rErrorMsg":null,"rHelpful":2,"rInterest":"Meh","rOverall":3.5,"rOverallString":"3.5","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1237418592000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"good","easyColor":"poor","helpColor":"good","helpCount":1,"id":10944025,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL8802","rComments":"He is a very easy grader, but gets boring occasionally. I wish he would quit saying "Without further ado..." Cancer Bio is a great class because there is a different lecturer each time.","rDate":"11/18/2005","rEasy":1.0,"rEasyString":"1.0","rErrorMsg":null,"rHelpful":4,"rInterest":"It\\\'s my life","rOverall":4.5,"rOverallString":"4.5","rStatus":1,"rTextBookUse":"N/A","rTimestamp":1132303531000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"person"},{"attendance":"N/A","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":0,"id":614809,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":4,"rClass":"3331","rComments":"Not very challenging","rDate":"02/22/2003","rEasy":2.0,"rEasyString":"2.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":4.5,"rOverallString":"4.5","rStatus":1,"rTextBookUse":"N/A","rTimestamp":1045879151000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"}],"remaining":0}\''

It looks like the API you're downloading from returns JSON, not HTML, so you don't need to parse it with BeautifulSoup. You can simply do the following:

import requests


url = 'https://www.ratemyprofessors.com/paginate/professors/ratings?tid={}&filter=&courseCode=&page=1'.format(124880)
page = requests.get(url)
page.json()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM