简体   繁体   English

如何从文本中删除一定长度的数字?

[英]How to remove certain length of digits from text?

I want to purify my text by removing certain length of digits from it, so I define rule for it. 我想通过从文本中删除一定长度的数字来净化文本,因此我为文本定义了规则。 I think isdigit is good for dealing with, but if I used this it will discard all digits in the text. 我认为isdigit很适合处理,但是如果我使用它,它将丢弃文本中的所有数字。 in my test, last 10 digits are not contributed to the text, so I could remove it. 在我的测试中,最后10位数字没有贡献给文本,因此我可以将其删除。 Here is that I tried: 这是我尝试过的:

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

cols = ['c1', 'c2', 'c3', 'c4']
make_me = []
for url in urls:
    lst = url.split("/")
    # your business rules go here
    make_me.append([x for x in lst if not x.isdigit() and not x == ""])

df = pd.DataFrame(make_me, columns=cols)
df

res=[]
for i in df.c4: 
    lst=i.split("-") 
    res.append([''.join(x) for x in lst if not x.isdigit()])

my attempt discarded all digit in text. 我的尝试丢弃了文本中的所有数字。 I simply want this kind of output: 我只是想要这种输出:

tax march donald trump protest
list 2018 oscar nominations

how should I write the rule to get this output? 我应该如何编写规则以获取此输出? Any idea? 任何想法?

Assuming you want to extract urls of the same format, use regular expressions 假设您要提取相同格式的网址,请使用正则表达式

import re

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']
news = []
regex =re.compile(r'/news/(.*)-')
for url in urls:
    extract_id = regex.search(url)
    if extract_id:
        data = extract_id.group(1)
        news.append(data.replace('-',' '))

print(news)

Output 输出量

['tax march donald trump protest', 'list 2018 oscar nominations']

Edited format to suit the question. 编辑格式以适应问题。

A pure python way of doing without additional modules looks like this: 没有其他模块的纯python方式如下所示:

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

for x in urls:
    print(' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]))

# tax march donald trump protest
# list 2018 oscar nominations

If you need a list of output, use a list-comprehension: 如果需要输出列表,请使用list-comprehension:

[' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]) for x in urls]

There can be many approaches to this. 可以有很多方法。 Use .rfind('-') to get rightmost index of '-' and then slice your string. 使用.rfind('-')来获取.rfind('-')最右索引,然后对字符串进行切片。 After that you can process the string further. 之后,您可以进一步处理字符串。

In this case you have a very specific rule that would help you - just remove the last 10 characters from the last interesting element. 在这种情况下,您有一条非常具体的规则可以帮助您-只需从最后一个有趣的元素中删除最后10个字符即可。 In this case lst[-2] = lst[-2][:-12] right before the make_me.append call would do the trick. 在这种情况下,在make_me.append调用之前的lst[-2] = lst[-2][:-12]可以解决问题。

If you do want to make it with regex, I'd use the end-of-line marker, $, to make sure the digits were at the end. 如果您确实想使用正则表达式,则可以使用行尾标记$来确保数字在末尾。 It would look like lst = re.sub('[0-9]{10}/$','',url) 看起来像lst = re.sub('[0-9]{10}/$','',url)

after importing re , of course. 导入re ,当然。 This reads as: 内容为:

re.sub is a substitution method in the regular expressions module, and it changes the matches to the regex in the first parameter with the content in the second parameter; re.sub是正则表达式模块中的一种替换方法,它将第一个参数中与正则表达式的匹配更改为第二个参数中的内容; the third parameter is the string where you want to make the substitution. 第三个参数是您要替换的字符串。

The regex I wrote matches "a sequence of 10 characters which match any of 0123456789, followed by a / and the end of the string". 我写的正则表达式匹配“由10个字符组成的序列,它们与0123456789中的任何一个匹配,后跟一个/和字符串的结尾”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM