Regex/Python: Find everything before one symbol, if it's after another symbol

Question

Looking to return a full string after if there is a long dash ("―"), and if true, return everything before the first comma (","). How would I do this using Python with Regex?

from bs4 import BeautifulSoup
import requests
import json
import pandas as pd

request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'lxml')
# for loop
s = soup.find_all("div", class_="quoteText")[0].text
s = " ".join(s.split()) 
s[:s.index(",")]
s

Raw Output:

“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare, City of Ashes //<![CDATA[ function submitShelfLink(unique_id, book_id, shelf_id, shelf_name, submit_form, exclusive) { var checkbox_id = \'shelf_name_\' + unique_id + \'_\' + shelf_id; var element = document.getElementById(checkbox_id) var checked = element.checked if (checked && exclusive) { // can\'t uncheck a radio by clicking it! return } if(document.getElementById("savingMessage")){ Element.show(\'savingMessage\') } var element_id = \'shelfInDropdownName_\' + unique_id + \'_\' + shelf_id; Element.upda

Desired Output:

“That does it," said Jace. "I\'m going to get you a dictionary for Christmas this year.""Why?" Isabelle said."So you can look up \'fun.\' I\'m not sure you know what it means.” ― Cassandra Clare

Answer 1

I'm not sure I understand it properly, but I think you mean:

example_string = "part to return,example__text"
if example_string.count('__') > 0:
    try:
        result = re.search('(.*?)\,', example_string).group(0)
    except:
        result = None
print(result)

This prints 'part to return'

If you mean, the part of the string between the '__' and the ',' I would use:

example_string = "lala__part to return, lala"
try:
    result = re.search('__(.*?)\,', example_string).group(0)
except:
    result = None
print(result)

Answer 2

from bs4 import BeautifulSoup
from bs4.element import NavigableString
import requests


request = requests.get('https://www.goodreads.com/quotes/tag/fun?page=1')
soup = BeautifulSoup(request.text, 'html.parser')
# for loop
s = soup.find_all("div", class_="quoteText")[0]

text = ''

text += "".join([t.strip() for t in s.contents if type(t) == NavigableString])

for book_or_author_tag in s.find_all("a", class_ = "authorOrTitle"):
    text += "\n" + book_or_author_tag.text.strip()

print(text)

The quote you want is split across the initial quoteText div, but calling text on it returns all that CDATA junk you're trying to remove with the regex.

By looping over every child of that div and checking whether it's a navigable string type, we can extract only the actual text data you want. then tack on the author and book, and hopefully your regex becomes a lot simpler.

Answer 3

Here's one solution:

import re

s = 'adflakjd, fkljlkjdf ― Cassandra Clare, City of Ash, adflak'

x = re.findall('.*―.*?(?=,)', s)


print x

['adflakjd, fkljlkjdf ― Cassandra Clare']

Regex/Python: Find everything before one symbol, if it's after another symbol

Question

3 answers

solution1
1 2018-01-25 16:52:41

solution2
1 2018-01-25 16:57:30

solution3
1 ACCPTED 2018-01-25 17:07:53

Regex/Python: Find everything before one symbol, if it's after another symbol

Question

3 answers

solution1 1 2018-01-25 16:52:41

solution2 1 2018-01-25 16:57:30

solution3 1 ACCPTED 2018-01-25 17:07:53

solution1
1 2018-01-25 16:52:41

solution2
1 2018-01-25 16:57:30

solution3
1 ACCPTED 2018-01-25 17:07:53