简体   繁体   English

如何从推文中删除链接(正确)?

[英]How to remove links from tweets (proper)?

I got this:我懂了:

def non_ascii(s):
    return "".join(i for i in s if ord(i)<128)

def lower(text):
    return text.lower()

def clean_links(text):
    txt = re.compile('http[s]{1}://[\w+][.]{1}[\w+][.]{1}[\w]{2,3}')
    return txt.sub(r'', text)

def clean_html(text):
    html = re.compile('<.*?>')
    return html.sub(r'', text)

def punct(text):
    token = RegexpTokenizer(r'\w+')
    text = token.tokenize(text)
    text = " ".join(text)
    return text

Then later I call these functions like:然后稍后我将这些函数称为:

data['cleaned'] = data['tweet'].apply(non_ascii)
data['cleaned'] = data['tweet'].apply(lower)
data['cleaned'] = data['tweet'].apply(clean_links)
data['cleaned'] = data['tweet'].apply(clean_html)
data['cleaned'] = data['tweet'].apply(punct)

The problem is that any link still is in the data['cleaned'] column, I need those pesky links erased!问题是任何链接仍然在 data['cleaned'] 列中,我需要删除那些讨厌的链接!

The original tweets are in data['tweet'].原始推文在 data['tweet'] 中。

Please add your support, or your way of doing this " removing links ".请添加您的支持,或您的“删除链接”方式。

Links still in the data look like:仍在数据中的链接如下所示:

https t co OR1IkVzzgO

Second function (and next functions) you have to run on data['cleaned']第二个 function (和下一个功能)你必须在data['cleaned']

data['cleaned'] = data['tweet'].apply(non_ascii)
data['cleaned'] = data['cleaned'].apply(lower)
data['cleaned'] = data['cleaned'].apply(clean_links)
data['cleaned'] = data['cleaned'].apply(clean_html)
data['cleaned'] = data['cleaned'].apply(punct)

OR you should chain it或者你应该把它锁起来

data['cleaned'] = data['tweet'].apply(non_ascii).apply(lower).apply(clean_links).apply(clean_html).apply(punct)

OR you should put all functions in one function and run apply() only once或者您应该将所有功能放在一个 function 中并只运行一次apply()

def clean(text):
    text = non_ascii(text)
    text = lower(text)
    text = clean_links(text)
    text = clean_html(text)
    text = punct(text)
    return text

data['cleaned'] = data['tweet'].apply(clean)

EDIT:编辑:

Instead of text.lower() you can use str.lower(text) and you don't have to create own function lower()代替text.lower()您可以使用str.lower(text)并且您不必创建自己的 function lower()

Your regex doesn't match to links so I used something little better 'http(s)?://\w+(\.\w+){1,}(/\w+)*' - but it may not work with more complex links and you should use regex suggested in comments.您的正则表达式与链接不匹配,所以我使用了更好'http(s)?://\w+(\.\w+){1,}(/\w+)*' - 但它可能不适用于更多复杂的链接,您应该使用评论中建议的正则表达式。

Stackoverflow doesn't allow to use https:// t.co/ OR1IkVzzgO in code so you have to remove spaces from link. Stackoverflow 不允许在代码中使用https:// t.co/ OR1IkVzzgO ,因此您必须从链接中删除空格。


Minimal working code with example data带有示例数据的最少工作代码

import re
import nltk.tokenize
import pandas as pd

def non_ascii(s):
    return "".join(i for i in s if ord(i)<128)

def clean_links(text):
    txt = re.compile('http(s)?://\w+(\.\w+){1,}(/\w+)*')
    return txt.sub(r'', text)

def clean_html(text):
    html = re.compile('<.*?>')
    return html.sub(r'', text)

def punct(text):
    token = nltk.tokenize.RegexpTokenizer(r'\w+')
    text = token.tokenize(text)
    text = " ".join(text)
    return text

def clean(text):
    text = non_ascii(text)
    text = str.lower(text)
    text = clean_links(text)
    text = clean_html(text)
    text = punct(text)
    return text

# --- main ---

data = pd.DataFrame({
    'tweet': ['Example https://stackoverflow.com/ 
. And <tag>other</tag> line https:// t.co/ OR1IkVzzgO. Any question?']
})

data['cleaned'] = data['tweet'].apply(clean)

print(data.to_string())

EDIT:编辑:

More universal version which gets list of functions获取功能列表的更通用版本

def clean(text, *functions):
    for func in functions:
        text = func(text)
    return text

data['cleaned'] = data['tweet'].apply(clean, args=[non_ascii, str.lower, clean_links, clean_html, punct])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM