简体   繁体   中英

Split text in column dataframe based on another column value with regex and lambda

I have a Pandas dataframe made out of two columns: "headline" and "content".

A "headline" looks like:

"The Disastrous Employment Numbers Show Almost Every Job Is at Risk"

and the corresponding "content" would be:

"SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."

Now what I want to do is to split the content by the headline so that trash before would be gone. The reason why I chose to split by the headline is because it's the only constant I have in each document.

As you can see there are no spaces before and after the title showing up in the content. Would anyone have an idea how to go by this? I was thinking of regex and lambda but I have no idea how to write it.

The string you supplied has the symbol: • (char code 8226) not sure if that is correct or a typo, nevertheless, to remove all junk before and after the headline:

import re

s = "SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."

m = re.search(r'\|(.+)', s)
print(m.group(0)[1:])

# output
The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak

To remove only the junk before the headline:

import re

# NOTE you need to specify a raw string using r"string"
s = r"SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."  # spaced out or it wont work

m = re.search(r'The Disastrous Employment Numbers Show Almost Every Job Is at Risk(.+)', s)
print(m.group())

# output
The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM