[英]How to remove links from tweets (proper)?
I got this:我懂了:
def non_ascii(s):
return "".join(i for i in s if ord(i)<128)
def lower(text):
return text.lower()
def clean_links(text):
txt = re.compile('http[s]{1}://[\w+][.]{1}[\w+][.]{1}[\w]{2,3}')
return txt.sub(r'', text)
def clean_html(text):
html = re.compile('<.*?>')
return html.sub(r'', text)
def punct(text):
token = RegexpTokenizer(r'\w+')
text = token.tokenize(text)
text = " ".join(text)
return text
Then later I call these functions like:然后稍后我将这些函数称为:
data['cleaned'] = data['tweet'].apply(non_ascii)
data['cleaned'] = data['tweet'].apply(lower)
data['cleaned'] = data['tweet'].apply(clean_links)
data['cleaned'] = data['tweet'].apply(clean_html)
data['cleaned'] = data['tweet'].apply(punct)
The problem is that any link still is in the data['cleaned'] column, I need those pesky links erased!问题是任何链接仍然在 data['cleaned'] 列中,我需要删除那些讨厌的链接!
The original tweets are in data['tweet'].原始推文在 data['tweet'] 中。
Please add your support, or your way of doing this " removing links ".请添加您的支持,或您的“删除链接”方式。
Links still in the data look like:仍在数据中的链接如下所示:
https t co OR1IkVzzgO
Second function (and next functions) you have to run on data['cleaned']
第二个 function (和下一个功能)你必须在
data['cleaned']
data['cleaned'] = data['tweet'].apply(non_ascii)
data['cleaned'] = data['cleaned'].apply(lower)
data['cleaned'] = data['cleaned'].apply(clean_links)
data['cleaned'] = data['cleaned'].apply(clean_html)
data['cleaned'] = data['cleaned'].apply(punct)
OR you should chain it或者你应该把它锁起来
data['cleaned'] = data['tweet'].apply(non_ascii).apply(lower).apply(clean_links).apply(clean_html).apply(punct)
OR you should put all functions in one function and run apply()
only once或者您应该将所有功能放在一个 function 中并只运行一次
apply()
def clean(text):
text = non_ascii(text)
text = lower(text)
text = clean_links(text)
text = clean_html(text)
text = punct(text)
return text
data['cleaned'] = data['tweet'].apply(clean)
EDIT:编辑:
Instead of text.lower()
you can use str.lower(text)
and you don't have to create own function lower()
代替
text.lower()
您可以使用str.lower(text)
并且您不必创建自己的 function lower()
Your regex doesn't match to links so I used something little better 'http(s)?://\w+(\.\w+){1,}(/\w+)*'
- but it may not work with more complex links and you should use regex suggested in comments.您的正则表达式与链接不匹配,所以我使用了更好
'http(s)?://\w+(\.\w+){1,}(/\w+)*'
- 但它可能不适用于更多复杂的链接,您应该使用评论中建议的正则表达式。
Stackoverflow doesn't allow to use https:// t.co/ OR1IkVzzgO
in code so you have to remove spaces from link. Stackoverflow 不允许在代码中使用
https:// t.co/ OR1IkVzzgO
,因此您必须从链接中删除空格。
Minimal working code with example data带有示例数据的最少工作代码
import re
import nltk.tokenize
import pandas as pd
def non_ascii(s):
return "".join(i for i in s if ord(i)<128)
def clean_links(text):
txt = re.compile('http(s)?://\w+(\.\w+){1,}(/\w+)*')
return txt.sub(r'', text)
def clean_html(text):
html = re.compile('<.*?>')
return html.sub(r'', text)
def punct(text):
token = nltk.tokenize.RegexpTokenizer(r'\w+')
text = token.tokenize(text)
text = " ".join(text)
return text
def clean(text):
text = non_ascii(text)
text = str.lower(text)
text = clean_links(text)
text = clean_html(text)
text = punct(text)
return text
# --- main ---
data = pd.DataFrame({
'tweet': ['Example https://stackoverflow.com/
. And <tag>other</tag> line https:// t.co/ OR1IkVzzgO. Any question?']
})
data['cleaned'] = data['tweet'].apply(clean)
print(data.to_string())
EDIT:编辑:
More universal version which gets list of functions获取功能列表的更通用版本
def clean(text, *functions):
for func in functions:
text = func(text)
return text
data['cleaned'] = data['tweet'].apply(clean, args=[non_ascii, str.lower, clean_links, clean_html, punct])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.