简体   繁体   English

从 Pandas Dataframe 中的列中删除 URL

[英]Removing URL from a column in Pandas Dataframe

I have a small dataframe and am trying to remove the url from the end of the string in the Links column.我有一个小数据框,正在尝试从“链接”列中的字符串末尾删除 url。 I have tried the following code and it works on columns where the url is on its own.我尝试了以下代码,它适用于 url 独立的列。 The problem is that as soon as there are sentences before the url the code won't remove those urls问题是,只要在 url 之前有句子,代码就不会删除这些 url

Here is the data: https://docs.google.com/spreadsheets/d/10LV8BHgofXKTwG-MqRraj0YWez-1vcwzzTJpRhdWgew/edit?usp=sharing (link to spreadsheet)这是数据: https : //docs.google.com/spreadsheets/d/10LV8BHgofXKTwG-MqRraj0YWez-1vcwzzTJpRhdWgew/edit?usp=sharing (电子表格链接)

import pandas as pd  

df = pd.read_csv('TestData.csv')    

df['Links'] = df['Links'].replace(to_replace=r'^https?:\/\/.*[\r\n]*',value='',regex=True)

df.head()

Thanks!谢谢!

Try this:尝试这个:

import re
df['cleanLinks'] = df['Links'].apply(lambda x: re.split('https:\/\/.*', str(x))[0])

Output:输出:

df['cleanLinks']

    cleanLinks
0   random words to see if it works now 
1   more stuff that doesn't mean anything 
2   one last try please work 

Try a cleaner regex:尝试更清洁的正则表达式:

df['example'] = df['example'].replace(r'http\S+', '', regex=True).replace(r'www\S+', '', regex=True)

Before implementing regex in pandas .replace() or anywhere else for that matter you should test the pattern using re.sub() on a single basic string example.在 pandas .replace()或其他任何地方实现正则表达式之前,您应该在单个基本字符串示例上使用re.sub()测试模式。 When faced with a big problem, break it down into a smaller one.当遇到一个大问题时,把它分解成一个小问题。

Additionally we could go with the str.replace method:此外,我们可以使用 str.replace 方法:

df['status_message'] = df['status_message'].str.replace('http\S+|www.\S+', '', case=False)

For Dataframe df, URLs can be removed by using cleaner regex as follows:对于 Dataframe df,可以使用更干净的正则表达式删除 URL,如下所示:

df = pd.read_csv('./data-set.csv')
print(df['text'])

def clean_data(dataframe):
#replace URL of a text
    dataframe['text'] = dataframe['text'].str.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', ' ')

clean_data(df)
print(df['text']);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM