從 Pandas Dataframe 中的列中刪除 URL

Question

我有一個小數據框，正在嘗試從“鏈接”列中的字符串末尾刪除 url。 我嘗試了以下代碼，它適用於 url 獨立的列。 問題是，只要在 url 之前有句子，代碼就不會刪除這些 url

這是數據： https : //docs.google.com/spreadsheets/d/10LV8BHgofXKTwG-MqRraj0YWez-1vcwzzTJpRhdWgew/edit?usp=sharing （電子表格鏈接）

import pandas as pd  

df = pd.read_csv('TestData.csv')    

df['Links'] = df['Links'].replace(to_replace=r'^https?:\/\/.*[\r\n]*',value='',regex=True)

df.head()

謝謝！

Answer 1

嘗試這個：

import re
df['cleanLinks'] = df['Links'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

輸出：

df['cleanLinks']

    cleanLinks
0   random words to see if it works now 
1   more stuff that doesn't mean anything 
2   one last try please work

Answer 2

嘗試更清潔的正則表達式：

df['example'] = df['example'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

在 pandas .replace()或其他任何地方實現正則表達式之前，您應該在單個基本字符串示例上使用re.sub()測試模式。 當遇到一個大問題時，把它分解成一個小問題。

此外，我們可以使用 str.replace 方法：

df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)

Answer 3

對於 Dataframe df，可以使用更干凈的正則表達式刪除 URL，如下所示：

df = pd.read_csv('./data-set.csv')
print(df['text'])

def clean_data(dataframe):
#replace URL of a text
    dataframe['text'] = dataframe['text'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')

clean_data(df)
print(df['text']);

從 Pandas Dataframe 中的列中刪除 URL

問題描述

3 個解決方案

解決方案1
5 已采納 2018-08-23 21:21:37

解決方案2
4 2018-08-23 21:14:01

解決方案3
1 2021-01-21 09:59:25

從 Pandas Dataframe 中的列中刪除 URL

問題描述

3 個解決方案

解決方案1 5 已采納 2018-08-23 21:21:37

解決方案2 4 2018-08-23 21:14:01

解決方案3 1 2021-01-21 09:59:25

解決方案1
5 已采納 2018-08-23 21:21:37

解決方案2
4 2018-08-23 21:14:01

解決方案3
1 2021-01-21 09:59:25