繁体   English   中英

如何删除嵌套在其他双引号正则表达式中的双引号

[英]How to remove double quotes nested within other double quotes regex

我正在尝试使用 json.loads 加载使用 Beautiful Soup 收集的数据。 但是,我使用的数据存在一个问题,即某些字段在字段中包含双引号。 例子:

"rComments":"He is a very easy grader, but gets boring occasionally. I wish he would quit saying "Without further ado..." Cancer Bio is a great class because there is a different lecturer each time."

这导致以下错误:

JSONDecodeError: Expecting ',' delimiter: line 1 column 3556 (char 3555)

有没有办法使用正则表达式或其他方法用单引号/无引号替换“不用多说...”周围的双引号? 我需要保留其他双引号,因为 JSON 需要这些

这是我的代码的副本。 对于任何具有嵌套双引号的 Prof ID,它都会失败。

# Make Request
url1 = 'https://www.ratemyprofessors.com/paginate/professors/ratings?tid={}&filter=&courseCode=&page=1'.format(124880)
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.text, "html.parser")
soup1 = str(soup1)

# Remove Double Quotes in Comments
soup1 = re.sub(r'(?:[\b\s\:]\".*)(?:.*)(\")(?:.*\")', '', soup1)

# Create Dictionary
Dict1 = json.loads(soup1)

我也尝试了下面的正则表达式,但它也没有用。

:r"(\".*?)\"(.*?)\"(.*\")

作为参考,这是 repr(soup1) 返回的内容。

'\'{"ratings":[{"attendance":"N/A","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":2,"id":29366967,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL4015","rComments":"One of my favorite professors at Tech. Really cares about his students, and even brought us apples from Elijay and snacks during the final. His tests are not too bad and the group project is pretty easy. Good teacher and even better human being.","rDate":"01/01/2018","rEasy":3.0,"rEasyString":"3.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":5.0,"rOverallString":"5.0","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1514816343000,"rWouldTakeAgain":"Yes","sId":361,"takenForCredit":"Yes","teacher":null,"teacherGrade":"B+","teacherRatingTags":["Inspirational","Caring"],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"Not Mandatory","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":0,"id":28805507,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL3450","rComments":"GOAT","rDate":"10/30/2017","rEasy":3.0,"rEasyString":"3.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":5.0,"rOverallString":"5.0","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1509404689000,"rWouldTakeAgain":"Yes","sId":361,"takenForCredit":"Yes","teacher":null,"teacherGrade":"A","teacherRatingTags":["Caring","Get ready to read","Accessible outside class"],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"average","easyColor":"good","helpColor":"poor","helpCount":0,"id":19977224,"notHelpCount":0,"onlineClass":"","quality":"poor","rClarity":2,"rClass":"BIOL3450","rComments":"Dr Merril is a really, really nice person, and I\\\'m sure he\\\'s great doing his research but he is just not a good professor for a lecture based class with 150ish people. He\\\'s soft spoken, moves too fast in lecture and goes into unnecessary detail. Also does not hold office hours. Would rather defer students to TA.","rDate":"03/31/2012","rEasy":4.0,"rEasyString":"4.0","rErrorMsg":null,"rHelpful":1,"rInterest":"Low","rOverall":1.5,"rOverallString":"1.5","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1333212949000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"good","easyColor":"good","helpColor":"average","helpCount":0,"id":15545116,"notHelpCount":0,"onlineClass":"","quality":"good","rClarity":5,"rClass":"BIOL3340","rComments":"Dr. Merrill is a very nice man and a decent teacher. Class attendance isn\\\'t necessary, however, he does offer extra credit for attendence occasionally. The class is all memorization and a lot of nit-picky information. Didn\\\'t like the class too much, but he was a fine teacher.","rDate":"03/18/2009","rEasy":4.0,"rEasyString":"4.0","rErrorMsg":null,"rHelpful":2,"rInterest":"Meh","rOverall":3.5,"rOverallString":"3.5","rStatus":1,"rTextBookUse":"Yes","rTimestamp":1237418592000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"},{"attendance":"N/A","clarityColor":"good","easyColor":"poor","helpColor":"good","helpCount":1,"id":10944025,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":5,"rClass":"BIOL8802","rComments":"He is a very easy grader, but gets boring occasionally. I wish he would quit saying "Without further ado..." Cancer Bio is a great class because there is a different lecturer each time.","rDate":"11/18/2005","rEasy":1.0,"rEasyString":"1.0","rErrorMsg":null,"rHelpful":4,"rInterest":"It\\\'s my life","rOverall":4.5,"rOverallString":"4.5","rStatus":1,"rTextBookUse":"N/A","rTimestamp":1132303531000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"person"},{"attendance":"N/A","clarityColor":"good","easyColor":"average","helpColor":"good","helpCount":0,"id":614809,"notHelpCount":0,"onlineClass":"","quality":"awesome","rClarity":4,"rClass":"3331","rComments":"Not very challenging","rDate":"02/22/2003","rEasy":2.0,"rEasyString":"2.0","rErrorMsg":null,"rHelpful":5,"rInterest":"N/A","rOverall":4.5,"rOverallString":"4.5","rStatus":1,"rTextBookUse":"N/A","rTimestamp":1045879151000,"rWouldTakeAgain":"N/A","sId":361,"takenForCredit":"N/A","teacher":null,"teacherGrade":"N/A","teacherRatingTags":[],"unUsefulGrouping":"people","usefulGrouping":"people"}],"remaining":0}\''

看起来你正在下载的 API 返回 JSON,而不是 HTML,所以你不需要用 BeautifulSoup 解析它。你可以简单地执行以下操作:

import requests


url = 'https://www.ratemyprofessors.com/paginate/professors/ratings?tid={}&filter=&courseCode=&page=1'.format(124880)
page = requests.get(url)
page.json()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM