简体   繁体   中英

Split string into list of two words, repeating the last word

I need to split a string into a list of each two words, but repeating the last word of each pair of words. Here is what I tried, by using examples I found for other questions:

line = """Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua."""

def split_line(in_line):
    line_sp = line.split(" ")
    line_two = [" ".join(line_sp[i:i + 2]) for i in range(0, len(line_sp), 2)]
    return line_two

print(split_line(line))

This results into:

['Lorem ipsum', 'dolor sit', 'amet, consectetur', 'adipiscing elit,', 'sed do', 'eiusmod tempor', 'incididunt ut', 'labore et', 'dolore magna', 'aliqua.']

But what I actually need is this:

['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet', 'amet, consectetur', 'consectetur adipiscing', ...]

How can I make it work? Thanks!

You can use zip on the following two slices of words:

words = line.split()
print(list(map(' '.join, zip(words[:-1], words[1:]))))

This outputs:

['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']

Simple for loop

l = line.split(' ')
result = []
for i in range(len(l) - 1):
    result.append(l[i] + ' ' + l[i+1])
print(result) 
# ['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.', 'Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,', 'amet, consectetur', 'consectetur adipiscing', 'adipiscing elit,', 'elit, sed', 'sed do', 'do eiusmod', 'eiusmod tempor', 'tempor incididunt', 'incididunt ut', 'ut labore', 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']

What you are looking for is nltk.bigrams()

import nltk
bigrm = list(nltk.bigrams(line.split()))

You can start with constructing a list of words in the line

words = line.split()

then you can make a list of lists containing consequential pairs with slicing

pairs = [words[i:i + 2] for i in range(len(words))]

finally, you can take each pair and joint it with ' '

result = [" ".join(pair) for pair in pairs if len(pair) > 1]

You can try something like, I dont know syntax in python so answering in java. may be you can convert it to python

String line = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.";
    String[] split = line.split(" ");
    String [] line_two = new String[split.length-1];

    for (int i = 1; i < split.length; i++) {
        line_two[i-1] =split[i-1] +" "+split[i];
    }

You can use a lazy generator with zip :

def split_line(in_line):
    line_sp = line.split()
    yield from map(' '.join, zip(line_sp, line_sp[1:]))

print(list(split_line(line)))

['Lorem ipsum', 'ipsum dolor', 'dolor sit', 'sit amet,',
 ...
 'labore et', 'et dolore', 'dolore magna', 'magna aliqua.']

You can try it with regex, too:

rslt=[ " ".join(tup) for tup in re.findall(r"(\w+)\W+(?=(\w+))",line) ]

\\w+ one or more word characters;

(\\w+) we capture the matched pattern;

\\W+ one or more non-word characters;

(?=(\\w+)) look ahead as (?=...), but don't step forward, however capture the next word.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM