简体   繁体   English

根据空格数创建自定义空格标签以进行NLP预处理

[英]Create custom whitespace tag based on number of spaces for NLP pre-processing

To avoid being incorrectly flagged as a duplicate (though if I have missed something in my Google searches, I will be happily proven wrong), I have done some research on my own and found this so far in regards to handling whitespace: 为了避免被错误地标记为重复的(不过,如果错过了我的谷歌搜索的东西,我会快乐地证明是错误的),我已经做了我自己的一些研究,发现这个迄今为止关于处理空白:

A lot of what I could find on the web seems to be geared towards (1) finding whitespace and replacing it with something static, (2) methods of quantifying whitespace in a given string in totality, not in chunks. 我在网络上可以找到的很多东西似乎都针对(1)查找空白并将其替换为静态的东西,(2)定量量化给定字符串(而不是大块)中的空白的方法。

What has been difficult to find is how to slide along a string, stop when a section of whitespace has been reached, and replace that section of the string with a variable which depends on how large that whitespace is. 很难找到的是如何沿着字符串滑动,如何在到达空白部分时停止以及如何使用取决于该空白大小的变量替换字符串的该部分。

My question: 我的问题:

I am doing some NLP work and my data often has discrete amount of whitespace between values (and sometimes at the very beginning of the line) 我正在做一些NLP工作,我的数据通常在值之间有空白量(有时在行的开头)

eg: 例如:

field_header field_value field_code\\n

. Sometimes there are gaps at the beginning too.

The data also contains some standard text with single spaces in between: 数据还包含一些标准文本,中间有单个空格:

There are standard sentences which are embedded in the documents as well.\\n

I want to replace all whitespace that is larger than a single space so my document now looks something like this: 我想替换所有大于单个空格的空格,所以我的文档现在看起来像这样:

field_head WS_10 field_value WS_4 field_code\\n

. WS_6 Sometimes WS_3 there are gaps WS_6 at the beginning too.

There are standard sentences which are embedded in the documents as well.\\n

Where WS_n is a token which represents the amount (n >= 2) of whitespace between each word and is padded by a space on either side. 其中WS_n是一个令牌,表示每个单词之间的空格量(n> = 2),并由两侧的空格填充。

I tried to find the whitespace using regex and separately count the number of whitespaces using .count() - but that obviously doesn't work. 我试图使用正则表达式查找空格,并使用.count()分别计算空格的数量-但这显然行不通。 I know how to use re.sub , but it doesn't allow for specific substitutions which depend on what is picked up by the regex. 我知道如何使用re.sub ,但是它不允许根据正则表达式选择的内容进行特定的替换。

s = 'Some part      of my     text file   with irregular     spacing.\n'
pattern = '\ {2,}'

subsitution = ' WS_'+str(???.count(' '))+' '

re.sub(pattern, substitution, s)

If the above example did what it was supposed to, I'd get back: 如果上面的示例完成了应有的工作,我将得到:

'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_6 spacing.\\n'

Without regular expressions: 没有正则表达式:

s1 = 'Some part      of my     text file   with irregular     spacing.\n'
s2 = '          Some part      of my     text file   with irregular     spacing.\n'

def fix_sentence(sentence: str) -> str:
    ws_1st_char = True  # used to properly count whitespace at the beginning of the sentence
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                if ws_1st_char: z = count
                else: z = count + 1
                new_sentence = new_sentence + f'WS_{z} '
            new_sentence = new_sentence + f'{x} '
            count = 0
            ws_1st_char = False
        else:
            count+=1
    return new_sentence.rstrip(' ')

fixed1 = fix_sentence(s1)
fixed2 = fix_sentence(s2)

print(fixed1)
>>> 'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

print(fixed2)
>>> 'WS_10 Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

If there is never white space at the beginning of the sentence: 如果句子开头没有空格:

def fix_sentence(sentence: str) -> str:
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                new_sentence = new_sentence + f'WS_{count + 1} '
            new_sentence = new_sentence + f'{x} '
            count = 0
        else:
            count+=1
    return new_sentence.rstrip(' ')
import re

def replace_whitespace(string):
    while True:
        whitespace = re.search("\s{2,}", string)
        if whitespace:
            whitespace = whitespace.group()
            string = re.sub(f"(?<=\S){whitespace}(?=\S)", f" WS_{len(whitespace)} ", string)
        else:
            break
    return string


string = "Some part      of my     text file   with irregular     spacing.\n"
print(replace_whitespace(string))

This function finds the whitespace and replaces it with the given string. 此函数查找空格并将其替换为给定的字符串。 re.sub cannot use regex for the repl (replace) parameter so the value is calculated in the loop as it is found. re.sub不能将regex用作repl (替换)参数,因此该值将在找到的循环中进行计算。 Even if it could regex cannot count the length of a string. 即使正则表达式也无法计算字符串的长度。

Output: 输出:
Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\\n

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM