基於具有正則表達式和 lambda 的另一列值拆分列 dataframe 中的文本

Question

我有一個 Pandas dataframe 由兩列組成：“標題”和“內容”。

“標題”看起來像：

"The Disastrous Employment Numbers Show Almost Every Job Is at Risk"

相應的“內容”將是：

"SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."

現在我想做的是按標題拆分內容，這樣之前的垃圾就消失了。 我選擇按標題拆分的原因是因為它是我在每個文檔中唯一的常量。

如您所見，內容中顯示的標題前后沒有空格。 有人知道如何通過這個來 go 嗎？ 我在想正則表達式和 lambda 但我不知道怎么寫。

Answer 1

您提供的字符串具有符號： •（字符代碼 8226）不確定這是正確的還是拼寫錯誤，但要刪除標題前后的所有垃圾：

import re

s = "SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."

m = re.search(r'\|(.+)', s)
print(m.group(0)[1:])

# output
The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak

只刪除標題前的垃圾：

import re

# NOTE you need to specify a raw string using r"string"
s = r"SectionsSEARCHSkip to contentSkip to site indexToday’s PaperThe Upshot|The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time."  # spaced out or it wont work

m = re.search(r'The Disastrous Employment Numbers Show Almost Every Job Is at Risk(.+)', s)
print(m.group())

# output
The Disastrous Employment Numbers Show Almost Every Job Is at RisknThe Coronavirus Outbreak\n•\nLatest Updates\nMaps and Tracker\nImpact on Workers\nLife at Home\nNewsletter\nAdvertisementContinue reading the main storyUpshotThe Disastrous Employment Numbers Show Almost Every Job Is at RiskEven if public health concerns can be resolved relatively soon, a hole in aggregate demand could persist for some time.

基於具有正則表達式和 lambda 的另一列值拆分列 dataframe 中的文本

問題描述

1 個解決方案

解決方案1
0 2020-05-13 22:30:07

基於具有正則表達式和 lambda 的另一列值拆分列 dataframe 中的文本

問題描述

1 個解決方案

解決方案1 0 2020-05-13 22:30:07

解決方案1
0 2020-05-13 22:30:07