简体   繁体   中英

Deleting large continuous blocks of lines separated by empty lines

I have a large dataset, composed of text in the following form.

Vil [SENT]2 [POS]AUX [NUM]4 [DEP]aux O
du [SENT]2 [POS]PRON [NUM]4 [DEP]nsubj O
gerne [SENT]2 [POS]ADV [NUM]4 [DEP]advmod O
arbejde [SENT]2 [POS]VERB [NUM]0 [DEP]root O
med [SENT]2 [POS]ADP [NUM]9 [DEP]case O
et [SENT]2 [POS]DET [NUM]9 [DEP]det O
globalt [SENT]2 [POS]ADV [NUM]8 [DEP]advmod O
anerkendt [SENT]2 [POS]VERB [NUM]9 [DEP]amod O
brand [SENT]2 [POS]NOUN [NUM]4 [DEP]obl O
? [SENT]2 [POS]PUNCT [NUM]4 [DEP]punct O
XXX. [SENT]3 [POS]ADP [NUM]2 [DEP]case O
XXX [SENT]3 [POS]PROPN [NUM]0 [DEP]root O
XXX [SENT]3 [POS]NOUN [NUM]2 [DEP]flat O
, [SENT]3 [POS]PUNCT [NUM]2 [DEP]punct O
XXX [SENT]3 [POS]PROPN [NUM]2 [DEP]flat O
. [SENT]3 [POS]PUNCT [NUM]2 [DEP]punct O

Deltidsjob [SENT]4 [POS]NOUN [NUM]0 [DEP]root O
i [SENT]4 [POS]ADP [NUM]3 [DEP]case O
XXX [SENT]4 [POS]PROPN [NUM]1 [DEP]nmod O
XXX [SENT]4 [POS]NOUN [NUM]1 [DEP]nmod O
XXX [SENT]4 [POS]ADJ [NUM]6 [DEP]amod O
XXX [SENT]4 [POS]PROPN [NUM]1 [DEP]nmod O
. [SENT]4 [POS]PUNCT [NUM]1 [DEP]punct O

I am using python and would like to delete "line blocks" that are longer than a certain threshold. That is, the length from one empty line to the next crosses the threshold. I thought about iterating over lines, keeping a count and keeping track with a while loop, but can anyone come up with a more elegant solution?

sample = """Vil [SENT]2 [POS]AUX [NUM]4 [DEP]aux O
du [SENT]2 [POS]PRON [NUM]4 [DEP]nsubj O
gerne [SENT]2 [POS]ADV [NUM]4 [DEP]advmod O
arbejde [SENT]2 [POS]VERB [NUM]0 [DEP]root O
med [SENT]2 [POS]ADP [NUM]9 [DEP]case O
et [SENT]2 [POS]DET [NUM]9 [DEP]det O
globalt [SENT]2 [POS]ADV [NUM]8 [DEP]advmod O
anerkendt [SENT]2 [POS]VERB [NUM]9 [DEP]amod O
brand [SENT]2 [POS]NOUN [NUM]4 [DEP]obl O
? [SENT]2 [POS]PUNCT [NUM]4 [DEP]punct O
XXX. [SENT]3 [POS]ADP [NUM]2 [DEP]case O
XXX [SENT]3 [POS]PROPN [NUM]0 [DEP]root O
XXX [SENT]3 [POS]NOUN [NUM]2 [DEP]flat O
, [SENT]3 [POS]PUNCT [NUM]2 [DEP]punct O
XXX [SENT]3 [POS]PROPN [NUM]2 [DEP]flat O
. [SENT]3 [POS]PUNCT [NUM]2 [DEP]punct O

Deltidsjob [SENT]4 [POS]NOUN [NUM]0 [DEP]root O
i [SENT]4 [POS]ADP [NUM]3 [DEP]case O
XXX [SENT]4 [POS]PROPN [NUM]1 [DEP]nmod O
XXX [SENT]4 [POS]NOUN [NUM]1 [DEP]nmod O
XXX [SENT]4 [POS]ADJ [NUM]6 [DEP]amod O
XXX [SENT]4 [POS]PROPN [NUM]1 [DEP]nmod O
. [SENT]4 [POS]PUNCT [NUM]1 [DEP]punct O"""

samples = sample.split('\n\n')
threshold = 300

remaining_samples = []

for s in samples:
    if len(s) < threshold:
        remaining_samples.append(s)

print(remaining_samples)

I would do this in two steps, first you find the position of the empty lines, then extracting blocks that are smaller than you threshold:

cleaned_lines = []
empty_lines = [i for i, line in enumerate(lines) if not line.strip()]
for start, end in zip(empty_lines[:-1], empty_lines[1:]):
    if end-start < THRESHOLD:
        cleaned_lines.extend(lines[start + 1:end])

I assume that your file starts and end with an empty line a delimiter for a block, but if that is not the case it is as easy as adding -1 at the beginning of empty_lines and len(lines) at the end.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM