[英]Regex/Python: Find everything before one symbol, if it's after another symbol
如果有长破折号(“ ―”),则希望返回完整的字符串;如果为true,则返回第一个逗号(“,”)之前的所有内容。 我如何在Regex中使用Python来做到这一点?
from bs4 import BeautifulSoup
import requests
import json
import pandas as pd
request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'lxml')
# for loop
s = soup.find_all("div", class_="quoteText")[0].text
s = " ".join(s.split())
s[:s.index(",")]
s
原始输出:
“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare, City of Ashes //<![CDATA[ function submitShelfLink(unique_id, book_id, shelf_id, shelf_name, submit_form, exclusive) { var checkbox_id = \'shelf_name_\' + unique_id + \'_\' + shelf_id; var element = document.getElementById(checkbox_id) var checked = element.checked if (checked && exclusive) { // can\'t uncheck a radio by clicking it! return } if(document.getElementById("savingMessage")){ Element.show(\'savingMessage\') } var element_id = \'shelfInDropdownName_\' + unique_id + \'_\' + shelf_id; Element.upda
所需输出:
“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare
我不确定我是否理解正确,但我认为你的意思是:
example_string = "part to return,example__text"
if example_string.count('__') > 0:
try:
result = re.search('(.*?)\,', example_string).group(0)
except:
result = None
print(result)
打印“返回的零件”
如果您是指'__'和''之间的字符串部分,我将使用:
example_string = "lala__part to return, lala"
try:
result = re.search('__(.*?)\,', example_string).group(0)
except:
result = None
print(result)
from bs4 import BeautifulSoup
from bs4.element import NavigableString
import requests
request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'html.parser')
# for loop
s = soup.find_all("div", class_="quoteText")[0]
text = ''
text += "".join([t.strip() for t in s.contents if type(t) == NavigableString])
for book_or_author_tag in s.find_all("a", class_ = "authorOrTitle"):
text += "\n" + book_or_author_tag.text.strip()
print(text)
所需的报价单会在初始quoteText div中拆分,但是在其上调用text
会返回您尝试使用正则表达式删除的所有CDATA垃圾。
通过遍历该div的每个子级并检查其是否为可导航的字符串类型,我们可以仅提取所需的实际文本数据。 然后添加作者和书籍,希望您的正则表达式变得简单得多。
这是一种解决方案:
import re
s = 'adflakjd, fkljlkjdf ― Cassandra Clare, City of Ash, adflak'
x = re.findall('.*―.*?(?=,)', s)
print x
['adflakjd, fkljlkjdf ― Cassandra Clare']
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.