简体   繁体   中英

Python! Finding pairs depending on maximum distance from words in list

I am writing a program that analyzes words in text files. I have been able to parse all words in the text file and append them to a list after grueling code. I have now hit a bump in this code. I am now supposed to find the pairs of words(for every word) that does not exceed the maximum distance in indices. Here is the input and the list of strings I was able to get:

dist_max = int(input('Enter the maximum distance between words ==> '))

list_for_pairs = ['station', 'apple', 'chivalry', 'mansion', 'bear', \
                  'website', 'vest', 'amazing', 'mansion', 'apple', 'card', \
                  'station', 'card', 'book', 'same', 'tree', 'honor', \
                  'leaf', 'trace', 'tractor', 'bucket', 'bread', 'pears', 'book', \
                  'tractor', 'mouse', 'mansion', 'scratch', 'matter', 'trace']

In this case, the maximum distance should be 2, and for example, for the word 'amazing' in the list, the pairs that 'amazing' should pair up with would be 'website', 'vest', 'mansion', and 'apple'. This is because the maximum distance is 2, and all words are within that range in the list. This is also an example output.

The pairs must be alphabetically ordered, with only the first and last 5 appearing, but it should say how many there are total. Finally my code:

pair_list = []
for i in range(len(list_for_pairs)+1):
    range_pos = int(range(0, dist_max)) # This is the range for the maximum distance
    # between words in the positive (+) direction
    range_neg = int(range(0, dist_max, -1))# This is the range for the maximum distance
    # between words in the negative (-) direction
    pair_list.append('({} {})'.format(list_for_pairs[i], list_for_pairs[range_pos]))
    pair_list.append('({} {})'.format(list_for_pairs[i], list_for_pairs[range_neg]))

It's not much, but basically, I want to make a list to put all the pairs in, which will make the length part easier, and I need to make sure I don't add anything if the maximum distance is out of the list range. Any tips are appreciated, thank you in advance!

Use:

pair_list = []
for i in range(len(list_for_pairs)):
   if i > 0:
        for j in range(max(0, i - 2)):
             pair_list.append('({} {})'.format(list_for_pairs[i], list_for_pairs[j]))
   if i < len(list_for_pairs) - 1):
        for j in range(i + 1, min(len(list_for_pairs), i + 2)):
             pair_list.append('({} {})'.format(list_for_pairs[i], list_for_pairs[range_neg]))

For each i, j goes from 1. i - 2 to i - 1 and 2. i + 1 to i + 2, if exists.

You could have a nested for loop that is an offset from the current index plus and minus dist_max. Then just make sure the offset isn't 0 and would be in bounds.

pair_list = []
for i, word in enumerate(list_for_pairs):
    for offset in range(-dist_max, dist_max+1):
        if offset and 0 <= i + offset < len(list_for_pairs): # Ignore when offset is 0 or would be out of bounds
            otherword = list_for_pairs[i + offset]
            pair_list.append((word, otherword))

print(pair_list)

This constructs the whole list of pairs. Note that I use a set to eliminate the duplicates.


pairs = set()
for i in range(len(list_for_pairs)):
    for j in range(-dist_max,dist_max+1):
        if not j:
            continue
        if 0 <= i+j < len(list_for_pairs):
            w1, w2 = list_for_pairs[i], list_for_pairs[i+j]
            if w1 > w2:
                w2,w1 = w1,w2
            pairs.add( (w1,w2) )
pairs = sorted(list(pairs))
#print(pairs)
print(len(pairs), "distinct pairs")
for i in range(5):
    print( pairs[i][0], pairs[i][1])
print("...")
for i in range(-5,0):
    print( pairs[i][0], pairs[i][1])

Output:

C:\tmp>python x.py  
Enter the maximum distance between words ==> 2
54 distinct pairs   
apples bakery       
apples basket       
apples bike         
apples truck        
bakery basket       
...                 
puppy weather       
safety vest         
scratch trash       
track truck         
vest whistle        
                    
C:\tmp>             

You don't need to search before and behind, since the pairs are added alphabetically irrespective of order. In your list, replicated below, notice that there is no need to analyze 'weather + challenge' and 'challenge + weather' twice.

list_for_pairs = ['weather', 'puppy', 'challenge', 'house', 'whistle', \
                  'nation', 'vest', 'safety', 'house', 'puppy', 'card', \
                  'weather', 'card', 'bike', 'equality', 'justice', 'pride', \
                  'orange', 'track', 'truck', 'basket', 'bakery', 'apples', 'bike', \
                  'truck', 'horse', 'house', 'scratch', 'matter', 'trash']
dist_max = 2

If your list does not contain duplicates, you don't need a set to avoid duplication. All you need to do is not to add the duplicates: A simple implementation would look like this:

pairs = []
for i in range(dist_max, len(list_for_pairs)):
    for j in range(i - dist_max, i):
        pair = list_for_pairs[i], list_for_pairs[j]
        if pair[1] < pair[0]:
            pair = pair[::-1]
        pairs.append(pair)
pairs.sort()

This is well suited for a list comprehension, especially if you use sorted instead of manually swapping the pair:

pairs = sorted(sorted([list_for_pairs[i], list_for_pairs[j]]) 
         for i in range(dist_max, len(list_for_pairs)) for j in range(i - dist_max, i))

You can replace [list_for_pairs[i], list_for_pairs[j]] , with list_for_pairs[j:i+1:ij] . In my opinion it looks prettier, though I'm not sure there's any other advantage to doing that:

pairs = sorted(sorted(list_for_pairs[j:i+1:i-j]) for i in range(dist_max, len(list_for_pairs)) for j in range(i - dist_max, i))

Since in practice your list does contain duplicates, you can use a set to aggregate the result. Since sets are unordered, can sort it after-the fact:

pairs = sorted(set(sorted(list_for_pairs[j:i+1:i-j])
               for i in range(dist_max, len(list_for_pairs)) for j in range(i - dist_max, i)))

As a fun corrolary, you can also use itertools.groupby to remove duplicates once the list has been sorted:

pairs = sorted(sorted(list_for_pairs[j:i+1:i-j])
               for i in range(dist_max, len(list_for_pairs)) for j in range(i - dist_max, i))
pairs = [k for k, g in groupby(pairs)]

Notice that you can write that last one as a one-liner too, but I think it's too long to be easily legible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM