如何从文本中删除一定长度的数字？

Question

I want to purify my text by removing certain length of digits from it, so I define rule for it. 我想通过从文本中删除一定长度的数字来净化文本，因此我为文本定义了规则。 I think isdigit is good for dealing with, but if I used this it will discard all digits in the text. 我认为isdigit很适合处理，但是如果我使用它，它将丢弃文本中的所有数字。 in my test, last 10 digits are not contributed to the text, so I could remove it. 在我的测试中，最后10位数字没有贡献给文本，因此我可以将其删除。 Here is that I tried: 这是我尝试过的：

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

cols = ['c1', 'c2', 'c3', 'c4']
make_me = []
for url in urls:
    lst = url.split("/")
    # your business rules go here
    make_me.append([x for x in lst if not x.isdigit() and not x == ""])

df = pd.DataFrame(make_me, columns=cols)
df

res=[]
for i in df.c4: 
    lst=i.split("-") 
    res.append([''.join(x) for x in lst if not x.isdigit()])

my attempt discarded all digit in text. 我的尝试丢弃了文本中的所有数字。 I simply want this kind of output: 我只是想要这种输出：

tax march donald trump protest
list 2018 oscar nominations

how should I write the rule to get this output? 我应该如何编写规则以获取此输出？ Any idea? 任何想法？

Answer 1

Assuming you want to extract urls of the same format, use regular expressions 假设您要提取相同格式的网址，请使用正则表达式

import re

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']
news = []
regex =re.compile(r'/news/(.*)-')
for url in urls:
    extract_id = regex.search(url)
    if extract_id:
        data = extract_id.group(1)
        news.append(data.replace('-',' '))

print(news)

Output 输出量

['tax march donald trump protest', 'list 2018 oscar nominations']

Edited format to suit the question. 编辑格式以适应问题。

Answer 2

A pure python way of doing without additional modules looks like this: 没有其他模块的纯python方式如下所示：

urls = ['variety.com/2017/biz/news/tax-march-donald-trump-protest-1202031487/',
        'variety.com/2018/film/news/list-2018-oscar-nominations-1202668757/']

for x in urls:
    print(' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]))

# tax march donald trump protest
# list 2018 oscar nominations

If you need a list of output, use a list-comprehension: 如果需要输出列表，请使用list-comprehension：

[' '.join(x.rsplit('/', 2)[-2].split('-')[:-1]) for x in urls]

Answer 3

There can be many approaches to this. 可以有很多方法。 Use .rfind('-') to get rightmost index of '-' and then slice your string. 使用.rfind('-')来获取.rfind('-')最右索引，然后对字符串进行切片。 After that you can process the string further. 之后，您可以进一步处理字符串。

Answer 4

In this case you have a very specific rule that would help you - just remove the last 10 characters from the last interesting element. 在这种情况下，您有一条非常具体的规则可以帮助您-只需从最后一个有趣的元素中删除最后10个字符即可。 In this case lst[-2] = lst[-2][:-12] right before the make_me.append call would do the trick. 在这种情况下，在make_me.append调用之前的lst[-2] = lst[-2][:-12]可以解决问题。

If you do want to make it with regex, I'd use the end-of-line marker, $, to make sure the digits were at the end. 如果您确实想使用正则表达式，则可以使用行尾标记$来确保数字在末尾。 It would look like lst = re.sub('[0-9]{10}/$','',url) 看起来像lst = re.sub('[0-9]{10}/$','',url)

after importing re , of course. 导入re ，当然。 This reads as: 内容为：

re.sub is a substitution method in the regular expressions module, and it changes the matches to the regex in the first parameter with the content in the second parameter; re.sub是正则表达式模块中的一种替换方法，它将第一个参数中与正则表达式的匹配更改为第二个参数中的内容； the third parameter is the string where you want to make the substitution. 第三个参数是您要替换的字符串。

The regex I wrote matches "a sequence of 10 characters which match any of 0123456789, followed by a / and the end of the string". 我写的正则表达式匹配“由10个字符组成的序列，它们与0123456789中的任何一个匹配，后跟一个/和字符串的结尾”。

如何从文本中删除一定长度的数字？

问题描述

4 个解决方案

解决方案1
1 2019-03-26 03:57:34

解决方案2
1 已采纳 2019-03-26 04:14:04

解决方案3
0 2019-03-26 03:54:07

解决方案4
0 2019-03-26 03:54:33

如何从文本中删除一定长度的数字？

问题描述

4 个解决方案

解决方案1 1 2019-03-26 03:57:34

解决方案2 1 已采纳 2019-03-26 04:14:04

解决方案3 0 2019-03-26 03:54:07

解决方案4 0 2019-03-26 03:54:33

解决方案1
1 2019-03-26 03:57:34

解决方案2
1 已采纳 2019-03-26 04:14:04

解决方案3
0 2019-03-26 03:54:07

解决方案4
0 2019-03-26 03:54:33