简体   繁体   English

如何删除嵌套在其他双引号正则表达式中的双引号

[英]How to remove double quotes nested within other double quotes regex

I am trying to load data gathered using Beautiful Soup using json.loads.我正在尝试使用 json.loads 加载使用 Beautiful Soup 收集的数据。 However, the data I am using has an issue where some of the fields contain double quotes within the field.但是,我使用的数据存在一个问题,即某些字段在字段中包含双引号。 Example:例子:

"rComments":"He is a very easy grader, but gets boring occasionally. I wish he would quit saying "Without further ado..." Cancer Bio is a great class because there is a different lecturer each time."

This is causing the following error:这导致以下错误:

JSONDecodeError: Expecting ',' delimiter: line 1 column 3556 (char 3555)

Is there a way to replace the double quotes just around "Without further ado..." with single/no quotes using regex or another method?有没有办法使用正则表达式或其他方法用单引号/无引号替换“不用多说...”周围的双引号? I need to maintain the other double quotes because those are needed for JSON我需要保留其他双引号,因为 JSON 需要这些

Here is a copy of my code.这是我的代码的副本。 It fails for any Prof ID that has a nested double quote.对于任何具有嵌套双引号的 Prof ID,它都会失败。

# Make Request
url1 = 'https://www.ratemyprofessors.com/paginate/professors/ratings?tid={}&filter=&courseCode=&page=1'.format(124880)
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.text, "html.parser")
soup1 = str(soup1)

# Remove Double Quotes in Comments
soup1 = re.sub(r'(?:[\b\s\:]\".*)(?:.*)(\")(?:.*\")', '', soup1)

# Create Dictionary
Dict1 = json.loads(soup1)

I have also tried the below regex and it also didn't work.我也尝试了下面的正则表达式,但它也没有用。

:r"(\".*?)\"(.*?)\"(.*\")

For reference, this is what is returned by repr(soup1).作为参考,这是 repr(soup1) 返回的内容。

'\'{"ratings":[{"attendance":"N/A","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":2,"id":29366967,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL4015","rComments":"One of my favorite professors at Tech. Really cares about his students, and even brought us apples from Elijay and snacks during the final. His tests are not too bad and the group project is pretty easy. Good teacher and even better human being.","rDate":"01/01/2018","rEasy":3.0,"rEasyString":"3.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":5.0,"rOverallString":"5.0","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1514816343000,"rWouldTakeAgain":"Yes","sId":361,"takenForCredit":"Yes","teacher":null,"teacherGrade":"B+","teacherRatingTags":["Inspirational","Caring"],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"Not Mandatory","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":0,"id":28805507,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL3450","rComments":"GOAT","rDate":"10/30/2017","rEasy":3.0,"rEasyString":"3.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":5.0,"rOverallString":"5.0","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1509404689000,"rWouldTakeAgain":"Yes","sId":361,"takenForCredit":"Yes","teacher":null,"teacherGrade":"A","teacherRatingTags":["Caring","Get ready to read","Accessible outside class"],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"average","easyColor":"good","helpColor":"poor","helpCount":0,"id":19977224,"notHelpCount":0,"onlineClass":"","quality":"poor","rClarity":2,"rClass":"BIOL3450","rComments":"Dr Merril is a really, really nice person, and I\\\'m sure he\\\'s great doing his research but he is just not a good professor for a lecture based class with 150ish people. He\\\'s soft spoken, moves too fast in lecture and goes into unnecessary detail. Also does not hold office hours. Would rather defer students to TA.","rDate":"03/31/2012","rEasy":4.0,"rEasyString":"4.0","rErrorMsg":null,"rHelpful":1,"rInterest":"Low","rOverall":1.5,"rOverallString":"1.5","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1333212949000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"good","easyColor":"good","helpColor":"average","helpCount":0,"id":15545116,"notHelpCount":0,"onlineClass":"","quality":"good","rClarity":5,"rClass":"BIOL3340","rComments":"Dr. Merrill is a very nice man and a decent teacher. Class attendance isn\\\'t necessary, however, he does offer extra credit for attendence occasionally. The class is all memorization and a lot of nit-picky information. Didn\\\'t like the class too much, but he was a fine teacher.","rDate":"03/18/2009","rEasy":4.0,"rEasyString":"4.0","rErrorMsg":null,"rHelpful":2,"rInterest":"Meh","rOverall":3.5,"rOverallString":"3.5","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1237418592000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"good","easyColor":"poor","helpColor":"good","helpCount":1,"id":10944025,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL8802","rComments":"He is a very easy grader, but gets boring occasionally. I wish he would quit saying "Without further ado..." Cancer Bio is a great class because there is a different lecturer each time.","rDate":"11/18/2005","rEasy":1.0,"rEasyString":"1.0","rErrorMsg":null,"rHelpful":4,"rInterest":"It\\\'s my life","rOverall":4.5,"rOverallString":"4.5","rStatus":1,"rTextBookUse":"N/A","rTimestamp":1132303531000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"person"},{"attendance":"N/A","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":0,"id":614809,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":4,"rClass":"3331","rComments":"Not very challenging","rDate":"02/22/2003","rEasy":2.0,"rEasyString":"2.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":4.5,"rOverallString":"4.5","rStatus":1,"rTextBookUse":"N/A","rTimestamp":1045879151000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"}],"remaining":0}\''

It looks like the API you're downloading from returns JSON, not HTML, so you don't need to parse it with BeautifulSoup. You can simply do the following:看起来你正在下载的 API 返回 JSON,而不是 HTML,所以你不需要用 BeautifulSoup 解析它。你可以简单地执行以下操作:

import requests


url = 'https://www.ratemyprofessors.com/paginate/professors/ratings?tid={}&filter=&courseCode=&page=1'.format(124880)
page = requests.get(url)
page.json()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM