简体   繁体   中英

String separation by commas, but with a condition (ignore comma separated single word)

With the following code (a bit messy, I acknowledge) I separate a string by commas, but the condition is that it doesn't separate when the string contains comma separated single words, for example: It doesn't separate "Yup, there's a reason why you want to hit the sack just minutes after climax" but it separates "The increase in heart rate, which you get from masturbating, is directly beneficial to the circulation, and can reduce the likelihood of a heart attack" into ['The increase in heart rate', 'which you get from masturbating', 'is directly beneficial to the circulation', 'and can reduce the likelihood of a heart attack']

The problem is the purpose of the code fails when it encounters with such a string: "When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow." I don't want a separation after oxytocin, but after prolactin. I need a regex to do that.

import os
import textwrap
import re
import io
from textblob import TextBlob


string = str(input_string)

listy= [x.strip() for x in string.split(',')]
listy = [x.replace('\n', '') for x in listy]
listy = [re.sub('(?<!\d)\.(?!\d)', '', x) for x in listy]
listy = filter(None, listy) # Remove any empty strings    

newstring= []

for segment in listy:

    wc = TextBlob(segment).word_counts

    if listy[len(listy)-1] != segment:

        if len(wc) > 3:  # len(segment.split(' ')) > 7:
            newstring.append(segment+"&&")
        else:
            newstring.append(segment+",")

    else:

        newstring.append(segment)

sep = [x.strip() for x in (' '.join(newstring)).split('&&')]

Consider the below..

mystr="When men ejaculate, it releases a slew of chemicals including oxytocin, vasopressin, and prolactin, all of which naturally help you hit the pillow."

rExp=r",(?!\s+(?:and\s+)?\w+,)"
mylst=re.compile(rExp).split(mystr)
print(mylst)

should give the below output..

['When men ejaculate', ' it releases a slew of chemicals including oxytocin, vasopressin, and prolactin', ' all of which naturally help you hit the pillow.']

Let's look at how we split the string...

,(?!\s+\w+,)

Use every comma that is not followed by( (?! -> negative look ahead) \\s+\\w+, space and a word with comma.
The above would fail in case of vasopressin, and as and is not followed by , . So introduce a conditional and\\s+ within.

,(?!\s+(?:and\s+)?\w+,)

Although I might want to use the below

,(?!\s+(?:(?:and|or)\s+)?\w+,)

Test regex here
Test code here

In essence consider replacing your line

listy= [x.strip() for x in string.split(',')]

with

listy= [x.strip() for x in re.split(r",(?!\s+(?:and\s+)?\w+,)",string)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM