用一個項目替換列表中的兩個連續項目

Question

為了分析文本，我們將其轉換為單詞列表P1。 然后，我們應用Bigram方法並獲得單詞對（ai，bi）的列表X，這樣ai和bi就會在P1中很多次出現。 如何在Python 3中從P1獲取一個列表P2，以便如果兩個項ai和bi在P1中一個接一個，並且從X的（ai，bi）被一個元素ai_bi替換？ 我的最終目標是將文本准備為單詞列表，以便在Word2Vec中進行分析。 我有自己的代碼，並且可以運行，但是我認為在大文本上會很慢。

import nltk
from nltk.collocations import *
import re
import gensim
bigram_measures = nltk.collocations.BigramAssocMeasures()
sentences=["Total internal reflection ! is the;phenomenon",
"Abrasive flow machining :is an ? ( interior surface finishing process)",
"Technical Data[of Electrical Discharge wire cutting and] Cutting Machine",
"The greenhouse effect. is the process by which, radiation from a {planet atmosphere warms }the planet surface",
"Absolute zero!is the lowest limit ;of the thermodynamic temperature scale:",
"The term greenhouse effect ?is mentioned (a lot)",
"[An interesting] effect known as total internal reflection.",
"effect on impact energies ,Electrical discharge wire cutting of ADI",
"{Absolute zero represents} the coldest possible temperature",
"total internal reflection at an air water interface",
"What is Electrical Discharge wire cutting Machining and how does it work",
"Colder than Absolute Zero",
"A Mathematical Model for  Electrical Discharge Wire Cutting Machine Parameters"]
P1=[]
for f in sentences:
    f1=gensim.utils.simple_preprocess (f.lower())
    P1.extend(f1)
print("First 100 items from P1")
print(P1[:100])
#  bigram
finder = BigramCollocationFinder.from_words(P1)
# filter only bigrams that appear 2+ times
finder.apply_freq_filter(2)
# return the all bi-grams with the highest PMI
X=finder.nbest(bigram_measures.pmi, 10000)
print()
print("Number of bigrams= ",len(X))
print("10 first bigrams with the highest PMI")
print(X[:10])
# replace ai and bi which are one after another in P1  and (ai,bi) in X  =>>  with ai_bi
P2=[]
n=len(P1)
i=0
while i<n:
    P2.append(P1[i])
    if i<n-2:
        for c in X:
            if c[0]==P1[i] and c[1]==P1[i+1]:
                P2[len(P2)-1]=c[0]+"_"+c[1]
                i+=1    # skip second item of couple from X  
                break
    i+=1
print()
print( "first 50 items from P2 - results")
print(P2[:50])

Answer 1

我想您正在尋找類似的東西。

P2 = []
prev = P1[0]
for this in P1[1:]:
    P2.append(prev + "_" + this)
    prev = this

這實現了一個簡單的滑動窗口，其中先前的令牌粘貼在當前令牌的旁邊。

用一個項目替換列表中的兩個連續項目

問題描述

1 個解決方案

解決方案1
2 2019-04-02 09:04:06

用一個項目替換列表中的兩個連續項目

問題描述

1 個解決方案

解決方案1 2 2019-04-02 09:04:06

解決方案1
2 2019-04-02 09:04:06