簡體   English   中英

基於具有正則表達式和 lambda 的另一列值拆分列 dataframe 中的文本

[英]Split text in column dataframe based on another column value with regex and lambda

我有一個 Pandas dataframe 由兩列組成:“標題”和“內容”。

“標題”看起來像:

"The Disastrous Employment Numbers Show Almost Every Job Is at Risk"

相應的“內容”將是:

"SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."

現在我想做的是按標題拆分內容,這樣之前的垃圾就消失了。 我選擇按標題拆分的原因是因為它是我在每個文檔中唯一的常量。

如您所見,內容中顯示的標題前后沒有空格。 有人知道如何通過這個來 go 嗎? 我在想正則表達式和 lambda 但我不知道怎么寫。

您提供的字符串具有符號: •(字符代碼 8226)不確定這是正確的還是拼寫錯誤,但要刪除標題前后的所有垃圾:

import re

s = "SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."

m = re.search(r'\|(.+)', s)
print(m.group(0)[1:])

# output
The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak

只刪除標題前的垃圾:

import re

# NOTE you need to specify a raw string using r"string"
s = r"SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."  # spaced out or it wont work

m = re.search(r'The Disastrous Employment Numbers Show Almost Every Job Is at Risk(.+)', s)
print(m.group())

# output
The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM