简体   繁体   中英

Regex/Python: Find everything before one symbol, if it's after another symbol

Looking to return a full string after if there is a long dash ("―"), and if true, return everything before the first comma (","). How would I do this using Python with Regex?

from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'lxml')
# for loop
s = soup.find_all("div", class_="quoteText")[0].text
s = " ".join(s.split()) 
s[:s.index(",")]
s

Raw Output:

“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare, City of Ashes //<![CDATA[ function submitShelfLink(unique_id, book_id, shelf_id, shelf_name, submit_form, exclusive) { var checkbox_id = \'shelf_name_\' + unique_id + \'_\' + shelf_id; var element = document.getElementById(checkbox_id) var checked = element.checked if (checked && exclusive) { // can\'t uncheck a radio by clicking it! return } if(document.getElementById("savingMessage")){ Element.show(\'savingMessage\') } var element_id = \'shelfInDropdownName_\' + unique_id + \'_\' + shelf_id; Element.upda

Desired Output:

“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare

I'm not sure I understand it properly, but I think you mean:

example_string = "part to return,example__text"
if example_string.count('__') > 0:
    try:
        result = re.search('(.*?)\,', example_string).group(0)
    except:
        result = None
print(result)

This prints 'part to return'

If you mean, the part of the string between the '__' and the ',' I would use:

example_string = "lala__part to return, lala"
try:
    result = re.search('__(.*?)\,', example_string).group(0)
except:
    result = None
print(result)
from bs4 import BeautifulSoup
from bs4.element import NavigableString
import requests


request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'html.parser')
# for loop
s = soup.find_all("div", class_="quoteText")[0]

text = ''

text += "".join([t.strip() for t in s.contents if type(t) == NavigableString])

for book_or_author_tag in s.find_all("a", class_ = "authorOrTitle"):
    text += "\n" + book_or_author_tag.text.strip()

print(text)

The quote you want is split across the initial quoteText div, but calling text on it returns all that CDATA junk you're trying to remove with the regex.

By looping over every child of that div and checking whether it's a navigable string type, we can extract only the actual text data you want. then tack on the author and book, and hopefully your regex becomes a lot simpler.

Here's one solution:

import re

s = 'adflakjd, fkljlkjdf ― Cassandra Clare, City of Ash, adflak'

x = re.findall('.*―.*?(?=,)', s)


print x

['adflakjd, fkljlkjdf ― Cassandra Clare']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM