简体   繁体   中英

Python - grouping words into sets of 3

I'm trying to create a multidimensional array that contains the words in a string - the word before that word (unless at the beginning of the string, blank), the word, and the following word (unless at the end of the string, blank)

I've tried the following code:

def parse_group_words(text):
    groups = []
    words = re_sub("[^\w]", " ",  text).split()
    number_words = len(words)
    for i in xrange(number_words):
        print i
        if i == 0:
            groups[i][0] = ""
            groups[i][1] = words[i]
            groups[i][2] = words[i+1]
        if i > 0 and i != number_words:
            groups[i][0] = words[i-1]
            groups[i][1] = words[i]
            groups[i][2] = words[i+1]
        if i == number_words:
            groups[i][0] = words[i-1]
            groups[i][1] = words[i]
            groups[i][2] = ""            
    print groups

print parse_group_words("this is an example of text are you ready")

But I'm getting:

0

Traceback (most recent call last):
  File "/home/akf/program.py", line 82, in <module>
    print parse_group_words("this is an example of text are you ready")
  File "/home/akf/program.py", line 69, in parse_group_words
    groups[i][0] = ""
IndexError: list index out of range

Any idea how to fix this?

Here's a generic way to do it for arbitrary sized windows, using Python collections and itertools:

import re
import collections
import itertools

def window(seq, n=3):
    d = collections.deque(maxlen=n)
    for x in itertools.chain(('', ), seq, ('', )):
        d.append(x)
        if len(d) >= n:
            yield tuple(d)

def windows(text, n=3):
    return list(window((x.group() for x in re.finditer(r'\w+', text)), n=n))

What about...:

import itertools, re

def parse_group_words(text):
    groups = []
    words = re.finditer(r'\w+', text)
    prv, cur, nxt = itertools.tee(words, 3)
    next(cur); next(nxt); next(nxt)
    for previous, current, thenext in itertools.izip(prv, cur, nxt):
        # in Py 3, use `zip` in lieu of itertools.izip
        groups.append([previous.group(0), current.group(0), thenext.group(0)])
    print(groups)

parse_group_words('tanto va la gatta al lardo che ci lascia')

This is almost what you require -- it emits:

[['tanto', 'va', 'la'], ['va', 'la', 'gatta'], ['la', 'gatta', 'al'], ['gatta', 'al', 'lardo'], ['al', 'lardo', 'che'], ['lardo', 'che', 'ci'], ['che', 'ci', 'lascia']]

...missing the last-required group ['ci', 'lascia', ''] .

To fix it, just before the print , you could add:

groups.append([groups[-1][1], groups[-1][2], ''])

This feels like a midly-unsavory hack -- I can't easily find an elegant way to have this last group "just emerge" from the general logic of the rest of the function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM