简体   繁体   中英

Create custom whitespace tag based on number of spaces for NLP pre-processing

To avoid being incorrectly flagged as a duplicate (though if I have missed something in my Google searches, I will be happily proven wrong), I have done some research on my own and found this so far in regards to handling whitespace:

A lot of what I could find on the web seems to be geared towards (1) finding whitespace and replacing it with something static, (2) methods of quantifying whitespace in a given string in totality, not in chunks.

What has been difficult to find is how to slide along a string, stop when a section of whitespace has been reached, and replace that section of the string with a variable which depends on how large that whitespace is.

My question:

I am doing some NLP work and my data often has discrete amount of whitespace between values (and sometimes at the very beginning of the line)

eg:

field_header field_value field_code\\n

. Sometimes there are gaps at the beginning too.

The data also contains some standard text with single spaces in between:

There are standard sentences which are embedded in the documents as well.\\n

I want to replace all whitespace that is larger than a single space so my document now looks something like this:

field_head WS_10 field_value WS_4 field_code\\n

. WS_6 Sometimes WS_3 there are gaps WS_6 at the beginning too.

There are standard sentences which are embedded in the documents as well.\\n

Where WS_n is a token which represents the amount (n >= 2) of whitespace between each word and is padded by a space on either side.

I tried to find the whitespace using regex and separately count the number of whitespaces using .count() - but that obviously doesn't work. I know how to use re.sub , but it doesn't allow for specific substitutions which depend on what is picked up by the regex.

s = 'Some part      of my     text file   with irregular     spacing.\n'
pattern = '\ {2,}'

subsitution = ' WS_'+str(???.count(' '))+' '

re.sub(pattern, substitution, s)

If the above example did what it was supposed to, I'd get back:

'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_6 spacing.\\n'

Without regular expressions:

s1 = 'Some part      of my     text file   with irregular     spacing.\n'
s2 = '          Some part      of my     text file   with irregular     spacing.\n'

def fix_sentence(sentence: str) -> str:
    ws_1st_char = True  # used to properly count whitespace at the beginning of the sentence
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                if ws_1st_char: z = count
                else: z = count + 1
                new_sentence = new_sentence + f'WS_{z} '
            new_sentence = new_sentence + f'{x} '
            count = 0
            ws_1st_char = False
        else:
            count+=1
    return new_sentence.rstrip(' ')

fixed1 = fix_sentence(s1)
fixed2 = fix_sentence(s2)

print(fixed1)
>>> 'Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

print(fixed2)
>>> 'WS_10 Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\n'

If there is never white space at the beginning of the sentence:

def fix_sentence(sentence: str) -> str:
    count, new_sentence = 0, ''
    for x in sentence.split(' '):
        if x != '':
            if count != 0:
                new_sentence = new_sentence + f'WS_{count + 1} '
            new_sentence = new_sentence + f'{x} '
            count = 0
        else:
            count+=1
    return new_sentence.rstrip(' ')
import re

def replace_whitespace(string):
    while True:
        whitespace = re.search("\s{2,}", string)
        if whitespace:
            whitespace = whitespace.group()
            string = re.sub(f"(?<=\S){whitespace}(?=\S)", f" WS_{len(whitespace)} ", string)
        else:
            break
    return string


string = "Some part      of my     text file   with irregular     spacing.\n"
print(replace_whitespace(string))

This function finds the whitespace and replaces it with the given string. re.sub cannot use regex for the repl (replace) parameter so the value is calculated in the loop as it is found. Even if it could regex cannot count the length of a string.

Output:
Some part WS_6 of my WS_5 text file WS_3 with irregular WS_5 spacing.\\n

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM