繁体   English   中英

根据空格数创建自定义空格标签以进行NLP预处理

[英]Create custom whitespace tag based on number of spaces for NLP pre-processing

为了避免被错误地标记为重复的(不过,如果错过了我的谷歌搜索的东西,我会快乐地证明是错误的),我已经做了我自己的一些研究,发现这个迄今为止关于处理空白:

我在网络上可以找到的很多东西似乎都针对(1)查找空白并将其替换为静态的东西,(2)定量量化给定字符串(而不是大块)中的空白的方法。

很难找到的是如何沿着字符串滑动,如何在到达空白部分时停止以及如何使用取决于该空白大小的变量替换字符串的该部分。

我的问题:

我正在做一些NLP工作,我的数据通常在值之间有空白量(有时在行的开头)

例如:

field_header field_value field_code\\n

. Sometimes there are gaps at the beginning too.

数据还包含一些标准文本,中间有单个空格:

There are standard sentences which are embedded in the documents as well.\\n

我想替换所有大于单个空格的空格,所以我的文档现在看起来像这样:

field_head WS_10 field_value WS_4 field_code\\n

. WS_6 Sometimes WS_3 there are gaps WS_6 at the beginning too.

There are standard sentences which are embedded in the documents as well.\\n

其中WS_n是一个令牌,表示每个单词之间的空格量(n> = 2),并由两侧的空格填充。

我试图使用正则表达式查找空格,并使用.count()分别计算空格的数量-但这显然行不通。 我知道如何使用re.sub ,但是它不允许根据正则表达式选择的内容进行特定的替换。

s = 'Some part      of my     text file   with irregular     spacing.\n'
pattern = '\ {2,}'

subsitution = ' WS_'+str(???.count(' '))+' '

re.sub(pattern, substitution, s)

如果上面的示例完成了应有的工作,我将得到:

'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_6 spacing.\\n'

没有正则表达式:

s1 = 'Some part      of my     text file   with irregular     spacing.\n'
s2 = '          Some part      of my     text file   with irregular     spacing.\n'

def fix_sentence(sentence: str) -> str:
    ws_1st_char = True  # used to properly count whitespace at the beginning of the sentence
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                if ws_1st_char: z = count
                else: z = count + 1
                new_sentence = new_sentence + f'WS_{z} '
            new_sentence = new_sentence + f'{x} '
            count = 0
            ws_1st_char = False
        else:
            count+=1
    return new_sentence.rstrip(' ')

fixed1 = fix_sentence(s1)
fixed2 = fix_sentence(s2)

print(fixed1)
>>> 'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

print(fixed2)
>>> 'WS_10 Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

如果句子开头没有空格:

def fix_sentence(sentence: str) -> str:
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                new_sentence = new_sentence + f'WS_{count + 1} '
            new_sentence = new_sentence + f'{x} '
            count = 0
        else:
            count+=1
    return new_sentence.rstrip(' ')
import re

def replace_whitespace(string):
    while True:
        whitespace = re.search("\s{2,}", string)
        if whitespace:
            whitespace = whitespace.group()
            string = re.sub(f"(?<=\S){whitespace}(?=\S)", f" WS_{len(whitespace)} ", string)
        else:
            break
    return string


string = "Some part      of my     text file   with irregular     spacing.\n"
print(replace_whitespace(string))

此函数查找空格并将其替换为给定的字符串。 re.sub不能将regex用作repl (替换)参数,因此该值将在找到的循环中进行计算。 即使正则表达式也无法计算字符串的长度。

输出:
Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\\n

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM