用 Python 在句子列表中形成单词的双元组

Question

我有一个句子列表：

text = ['cant railway station','citadel hotel',' police stn'].

我需要形成二元对并将它们存储在一个变量中。 问题是当我这样做时，我得到的是一对句子而不是单词。 这是我所做的：

text2 = [[word for word in line.split()] for line in text]
bigrams = nltk.bigrams(text2)
print(bigrams)

这产生

[(['cant', 'railway', 'station'], ['citadel', 'hotel']), (['citadel', 'hotel'], ['police', 'stn'])

火车站和城堡酒店不能合二为一。 我想要的是

[([cant],[railway]),([railway],[station]),([citadel,hotel]), and so on...

第一个句子的最后一个词不应与第二个句子的第一个词合并。 我该怎么做才能让它发挥作用？

Answer 1

使用列表推导式和zip ：

>>> text = ["this is a sentence", "so is this one"]
>>> bigrams = [b for l in text for b in zip(l.split(" ")[:-1], l.split(" ")[1:])]
>>> print(bigrams)
[('this', 'is'), ('is', 'a'), ('a', 'sentence'), ('so', 'is'), ('is', 'this'), ('this',     
'one')]

Answer 2

from nltk import word_tokenize 
from nltk.util import ngrams


text = ['cant railway station', 'citadel hotel', 'police stn']
for line in text:
    token = nltk.word_tokenize(line)
    bigram = list(ngrams(token, 2)) 

    # the '2' represents bigram...you can change it to get ngrams with different size

Answer 3

与其将您的文本转换为字符串列表，不如将每个句子作为一个字符串单独开始。 我还删除了标点符号和停用词，如果与您无关，只需删除这些部分：

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import WordPunctTokenizer
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

要使用它，请这样做：

for line in sentence:
    features = get_bigrams(line)
    # train set here

请注意，这更进一步，实际上对二元组进行了统计评分（这将在训练模型时派上用场）。

Answer 4

没有 nltk：

ans = []
text = ['cant railway station','citadel hotel',' police stn']
for line in text:
    arr = line.split()
    for i in range(len(arr)-1):
        ans.append([[arr[i]], [arr[i+1]]])


print(ans) #prints: [[['cant'], ['railway']], [['railway'], ['station']], [['citadel'], ['hotel']], [['police'], ['stn']]]

Answer 5

>>> text = ['cant railway station','citadel hotel',' police stn']
>>> bigrams = [(ele, tex.split()[i+1]) for tex in text  for i,ele in enumerate(tex.split()) if i < len(tex.split())-1]
>>> bigrams
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

使用枚举和拆分功能。

Answer 6

读取数据集

df = pd.read_csv('dataset.csv', skiprows = 6, index_col = "No")

收集所有可用月份

df["Month"] = df["Date(ET)"].apply(lambda x : x.split('/')[0])

每月创建所有推文的代币

tokens = df.groupby("Month")["Contents"].sum().apply(lambda x : x.split(' '))

每月创建 bigrams

bigrams = tokens.apply(lambda x : list(nk.ngrams(x, 2)))

每月计算 bigrams

count_bigrams = bigrams.apply(lambda x : list(x.count(item) for item in x))

将结果包装在整洁的数据框中

month1 = pd.DataFrame(data = count_bigrams[0], index= bigrams[0], columns= ["Count"])
month2 = pd.DataFrame(data = count_bigrams[1], index= bigrams[1], columns= ["Count"])

Answer 7

只是修复丹的代码：

def get_bigrams(myString):
    tokenizer = WordPunctTokenizer()
    tokens = tokenizer.tokenize(myString)
    stemmer = PorterStemmer()
    bigram_finder = BigramCollocationFinder.from_words(tokens)
    bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 500)

    for bigram_tuple in bigrams:
        x = "%s %s" % bigram_tuple
        tokens.append(x)

    result = [' '.join([stemmer.stem(w).lower() for w in x.split()]) for x in tokens if x.lower() not in stopwords.words('english') and len(x) > 8]
    return result

Answer 8

最好的方法是使用“zip”函数来生成 n-gram。 其中2 in range函数是克数

test = [1,2,3,4,5,6,7,8,9]
print(test[0:])
print(test[1:])
print(list(zip(test[0:],test[1:])))
%timeit list(zip(*[test[i:] for i in range(2)]))

开/关：

[1, 2, 3, 4, 5, 6, 7, 8, 9]  
[2, 3, 4, 5, 6, 7, 8, 9]  
[(1, 2), (2, 3), (3, 4), (4, 5), (5, 6), (6, 7), (7, 8), (8, 9)]  
1000000 loops, best of 3: 1.34 µs per loop

Answer 9

有很多方法可以解决它，但我是这样解决的：

>>text = ['cant railway station','citadel hotel',' police stn']
>>text2 = [[word for word in line.split()] for line in text]
>>text2
[['cant', 'railway', 'station'], ['citadel', 'hotel'], ['police', 'stn']]
>>output = []
>>for i in range(len(text2)):
    output = output+list(bigrams(text2[i]))
>>#Here you can use list comphrension also
>>output
[('cant', 'railway'), ('railway', 'station'), ('citadel', 'hotel'), ('police', 'stn')]

Answer 10

我认为最好和最通用的方法如下：

n      = 2
ngrams = []

for l in L:
    for i in range(n,len(l)+1):
        ngrams.append(l[i-n:i])

或者换句话说：

ngrams = [ l[i-n:i] for l in L for i in range(n,len(l)+1) ]

这应该适用于任何n和任何序列l 。 如果没有长度为n ngram，则返回空列表。

用 Python 在句子列表中形成单词的双元组

问题描述

10 个解决方案

解决方案1
49 已采纳 2014-02-18 05:04:29

解决方案2
10 2018-02-19 18:30:32

解决方案3
8 2014-02-18 04:55:32

解决方案4
5 2014-02-18 05:00:53

解决方案5
3 2014-02-18 06:21:45

解决方案6
2 2018-05-09 20:00:15

读取数据集

收集所有可用月份

每月创建所有推文的代币

每月创建 bigrams

每月计算 bigrams

将结果包装在整洁的数据框中

解决方案7
1 2016-10-02 20:34:01

解决方案8
1 2020-08-27 15:45:32

解决方案9
0 2018-11-28 20:42:54

解决方案10
0 2019-10-21 10:09:14

用 Python 在句子列表中形成单词的双元组

问题描述

10 个解决方案

解决方案1 49 已采纳 2014-02-18 05:04:29

解决方案2 10 2018-02-19 18:30:32

解决方案3 8 2014-02-18 04:55:32

解决方案4 5 2014-02-18 05:00:53

解决方案5 3 2014-02-18 06:21:45

解决方案6 2 2018-05-09 20:00:15

读取数据集

收集所有可用月份

每月创建所有推文的代币

每月创建 bigrams

每月计算 bigrams

将结果包装在整洁的数据框中

解决方案7 1 2016-10-02 20:34:01

解决方案8 1 2020-08-27 15:45:32

解决方案9 0 2018-11-28 20:42:54

解决方案10 0 2019-10-21 10:09:14

解决方案1
49 已采纳 2014-02-18 05:04:29

解决方案2
10 2018-02-19 18:30:32

解决方案3
8 2014-02-18 04:55:32

解决方案4
5 2014-02-18 05:00:53

解决方案5
3 2014-02-18 06:21:45

解决方案6
2 2018-05-09 20:00:15

解决方案7
1 2016-10-02 20:34:01

解决方案8
1 2020-08-27 15:45:32

解决方案9
0 2018-11-28 20:42:54

解决方案10
0 2019-10-21 10:09:14