简体   繁体   中英

Replace two consecutive items in the list with one item

To analyze the text, we transform it into a list P1 of words. Then we apply the Bigram methods and get a list X of couples of words (ai,bi) such that ai and bi occur one after another in P1 quite a lot of times. How to get in Python 3 a list P2 from P1 so that every two items ai and bi if they go one after another in P1 and (ai,bi ) from X would be replaced by one element ai_bi? My ultimate goal is to prepare the text as a list of words for analysis in Word2Vec. I have my own code and it works but I think it will be slow on big texts.

import nltk
from nltk.collocations import *
import re
import gensim
bigram_measures = nltk.collocations.BigramAssocMeasures()
sentences=["Total internal reflection ! is the;phenomenon",
"Abrasive flow machining :is an ? ( interior surface finishing process)",
"Technical Data[of Electrical Discharge wire cutting and] Cutting Machine",
"The greenhouse effect. is the process by which, radiation from a {planet atmosphere warms }the planet surface",
"Absolute zero!is the lowest limit ;of the thermodynamic temperature scale:",
"The term greenhouse effect ?is mentioned (a lot)",
"[An interesting] effect known as total internal reflection.",
"effect on impact energies ,Electrical discharge wire cutting of ADI",
"{Absolute zero represents} the coldest possible temperature",
"total internal reflection at an air water interface",
"What is Electrical Discharge wire cutting Machining and how does it work",
"Colder than Absolute Zero",
"A Mathematical Model for  Electrical Discharge Wire Cutting Machine Parameters"]
P1=[]
for f in sentences:
    f1=gensim.utils.simple_preprocess (f.lower())
    P1.extend(f1)
print("First 100 items from P1")
print(P1[:100])
#  bigram
finder = BigramCollocationFinder.from_words(P1)
# filter only bigrams that appear 2+ times
finder.apply_freq_filter(2)
# return the all bi-grams with the highest PMI
X=finder.nbest(bigram_measures.pmi, 10000)
print()
print("Number of bigrams= ",len(X))
print("10 first bigrams with the highest PMI")
print(X[:10])
# replace ai and bi which are one after another in P1  and (ai,bi) in X  =>>  with ai_bi
P2=[]
n=len(P1)
i=0
while i<n:
    P2.append(P1[i])
    if i<n-2:
        for c in X:
            if c[0]==P1[i] and c[1]==P1[i+1]:
                P2[len(P2)-1]=c[0]+"_"+c[1]
                i+=1    # skip second item of couple from X  
                break
    i+=1
print()
print( "first 50 items from P2 - results")
print(P2[:50])

I guess you are looking for something like this.

P2 = []
prev = P1[0]
for this in P1[1:]:
    P2.append(prev + "_" + this)
    prev = this

This implements a simple sliding window where the previous token is pasted next to the current token.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM