使用Python，如何仅将不同的消息添加到列表？

Question

Certified Python Noob. 认证的Python Noob。 Please bear with me. 请多多包涵。

In multiple files of a million (or so) lines of text, I'm going to filter out distinct ones. 在一百万行（或大约一百万行）文本的多个文件中，我将过滤掉不同的文件。 That is, even if of those millions contain nothing but 15 distinct lines, the code should return 15 lines. 也就是说，即使那几百万个只包含15条不同的行，代码也应返回15行。

Read line from file, put it in a list if it doesn't exist in there, output list to another file. 从文件中读取行，如果其中不存在，则将其放在列表中，然后将列表输出到另一个文件。 Sounds simple? 听起来很简单？

There's a small thing though: 不过有一件小事：

I'm looking for messages , not strings/substrings, or what have you. 我在寻找消息，而不是字符串/子字符串，或者您有什么。 I'll explain below. 我会在下面解释。

The Problem: 问题：

Suppose we have the following lines in the file: 假设我们在文件中包含以下几行：

Random 2345 随机2345
Hello World 你好，世界
Your code is 91939 您的代码是91939
Your code is 54879 您的代码是54879
Your code is 79865 您的代码是79865
Pancakes 2451 薄煎饼2451
Your verification code is 123456 您的验证码是123456
Your verification code is 789101 您的验证码是789101

Realistically, should I do a simple if line doesn't exist in myList, add line to myList , that would still return duplicates. 实际上， if line doesn't exist in myList, add line to myList ，我应该做一个简单的if line doesn't exist in myList, add line to myList ，仍然会返回重复项。 The output should be: 输出应为：

Random 2345 随机2345
Hello World 你好，世界
Your code is 91939 您的代码是91939
Pancakes 2451 薄煎饼2451
Your verification code is 123456 您的验证码是123456

What I'm going to try: 我要尝试的是：

Now, the numbers don't matter, so I'm might be able to get away with simply using regex or something to look for all numbers in the line, replace them with nothing, and then compare that to the list (which also had all numbers erased). 现在，数字无关紧要，所以我也许可以简单地使用正则表达式或其他方法查找行中的所有数字，将其替换为空，然后将其与列表进行比较（后者也有所有数字都删除了）。

Crude, but it's the simplest I could think of. 粗略，但这是我能想到的最简单的方法。

Still with me? 还在我这儿？

More Problems: 更多问题：

Now comes the hard part. 现在是困难的部分。 In addition to the above list, say we had the following: 除了上面的列表，说我们还有以下内容：

Hi, my name is Lance. 嗨，我叫兰斯 Pleased to meet you. 很高兴见到你。
Hi, my name is Jenny. 嗨，我叫珍妮。 Pleased to meet you. 很高兴见到你。
Yo, dawg. w Name's John Erik. 名字是约翰·埃里克（John Erik）。 Don't touch my fries. 不要碰我的薯条。
Yo, dawg. w Name's James. 名字是詹姆斯。 Don't touch my fries. 不要碰我的薯条。
I like turtles 53669 我喜欢乌龟53669
Stefan commented on your video. Stefan对您的视频发表了评论。
n00bpwn3rz commented on your video. n00bpwn3rz对您的视频发表了评论。
RJ wants to talk to you. RJ想和您谈谈。
Jenny liked your photo. 珍妮喜欢你的照片。
Pi wants to talk to you. 皮想和你说话
Pi says visit my website at www.google.com 皮说，请访问我的网站www.google.com
John Erik says visit my website at www.johniscool.com 约翰·埃里克（John Erik）说，访问我的网站www.johniscool.com
James made fruity ice cubes. 詹姆斯做了果味的冰块。

And the output should be the following: 输出应为以下内容：

Hi, my name is Lance. 嗨，我叫兰斯 Pleased to meet you. 很高兴见到你。
Yo, dawg. w Name's John Erik. 名字是约翰·埃里克（John Erik）。 Don't touch my fries. 不要碰我的薯条。
I like turtles 53669 我喜欢乌龟53669
Stefan commented on your video. Stefan对您的视频发表了评论。
RJ wants to talk to you. RJ想和您谈谈。
Jenny liked your photo. 珍妮喜欢你的照片。
Pi says visit my website at www.google.com 皮说，请访问我的网站www.google.com
James made fruity ice cubes. 詹姆斯做了果味的冰块。

My brain hurts. 我的脑袋疼。 Not only do I have to take in names as variables, I have to watch out for websites too. 我不仅需要将名称作为变量，而且还必须注意网站。

Now, suppose I dissect the line into Chars, loop-compare it to items from the list - also dissected into Chars - and if it hits X number of positives (Char from line = Char from list_item), I don't add it to the list. 现在，假设我将这条线分解为Chars，将其与列表中的项进行循环比较-也分解为Chars-如果它达到X个正数（line中的Char = list_item中的Char），则不将其添加到名单。 Is that feasible (as in, accurate)? 这可行（准确）吗？ How do I do that in code? 如何在代码中做到这一点？ Something like this, perhaps? 大概是这样吗？

line_char[] = line       #My Name is Jayson
list_char[] = list_item  #My Name is Lance

if (list_char[] contains some sequence of line_char[]):
     #My Name is Jayson = My Name is Lance (12 TRUE [My Name is ], 6 FALSE [Lance/Jayson]; 12 > 6)
     line exists in list
else:
     add line to list

Any other ideas? 还有其他想法吗？ This is probably more of a logic question, but I'd like to do this in Python, so I'll just take its advantages and limitations into account. 这可能更多是一个逻辑问题，但是我想在Python中执行此操作，因此我只考虑其优点和局限性。

The Code so far: 到目前为止的代码：

Nothing to see here, folks. 伙计们，在这里什么也看不到。

import os

in_path = "../aggregator/"
out_path = "../aggregator_output/"
# For server: for filename in os.listdir(in_path):
# For local: for filename in list_path:
list_path = os.listdir(in_path)
del list_path[0]
for filename in list_path:
    in_base, in_ext = os.path.splitext(filename)
    in_file = os.path.join(in_path, filename)
    out_file = os.path.join(out_path, in_base + "_cleaned.csv")
    print "Processing " + in_file
    print "Writing to " + out_file
    dirty_file = open(in_file, "rb").read().split("\n")
    clean_file = open(out_file "wb")
    list_unique = []
    for line in dirty_file:
        temp_line = re.sub('",', '^', line)
        delimited = temp_line.split(",")
        message = delimited[2]

So far, all of my code is nothing more but to filter the right line from the file (3rd column). 到目前为止，我所有的代码只不过是过滤文件的右行（第3列）。

I'd really appreciate some help on this, as this is a rather interesting problem, though one I can't solve myself. 我真的很感谢您的帮助，因为这是一个相当有趣的问题，尽管我无法解决自己的问题。

Thanks. 谢谢。

PS - Commented out part of code pre-for-loop is to take in account that annoying .DS-store crap hidden file on a Mac, which breaks the rest of the code. PS-代码循环前注释掉的一部分是考虑到Mac上令人讨厌的.DS存储废话隐藏文件，这会破坏其余代码。 I do testing on a Mac, and do the actual thing on an ubuntu server. 我在Mac上进行测试，并在ubuntu服务器上进行实际操作。

Answer 1

From what I see, you just want to keep the first different line when there are possible "duplicates" where you don't care about numbers or names... 据我所知，当可能的“重复项”不关心数字或名称时，您只想保留第一行。

Why not : 为什么不：

Look at the first words of the line, if you find a new sequence of words you add it to the list 查看该行的第一个单词，如果找到新的单词序列，则将其添加到列表中
Compare two strings representing two lines and define an interval from whom you know two lines are different. 比较代表两行的两个字符串，并定义一个间隔，从中您知道两行是不同的。

For example, between these two lines : 例如，在这两行之间：

Hi, my name is Lance. 嗨，我叫兰斯 Pleased to meet you. 很高兴见到你。
Hi, my name is Jenny. 嗨，我叫珍妮。 Pleased to meet you. 很高兴见到你。

The only difference is Lance vs Jenny. 唯一的区别是Lance vs Jenny。

You can then code a compare function (because it doesn't exist in Python) based on the difference of the sum of ASCII code for all characters of the line. 然后，您可以根据该行所有字符的ASCII码总和之差，编写一个比较函数（因为它在Python中不存在）。 And say : two lines are similar if their "hash" is close. 并说：如果“散列”接近，则两行相似。

Here is a sample of code for calculating the hash of a line : 这是用于计算行的哈希值的代码示例：

class myString(str):
  def __hash__(self):
    count = 0
    for c in self:
      count += ord(c)
    return count

a = myString('Hi, my name is Lance. Pleased to meet you.')
b = myString('Hi, my name is Jenny. Pleased to meet you.')
c = myString("Yo, dawg. Name's John Erik. Don't touch my fries.")

hash(a) = 3624
hash(b) = 3657
hash(c) = 4148

Hope it will help ! 希望对您有所帮助！ Note that you can have problems with this solution with sentences that have the same sequence of characters, for example : 请注意，对于具有相同字符序列的句子，此解决方案可能会出现问题，例如：

hash(myString('abc')) = 294
hash(myString('bac')) = 294

Answer 2

Since you are dealing with English sentences, I was wondering if nltk could be used for this. 由于您正在处理英语句子，所以我想知道是否可以将nltk用于此目的。 It provides a Parts of Speech (POS) tagger that can be used to find the POS in a sentence. 它提供了词性（POS）标记器，可用于在句子中查找POS。 Those lines with same sequence of tags are probably "similar" lines (This can also be improved further by comparing the actual tokens). 那些具有相同标签序列的行可能是 “相似”行（也可以通过比较实际标记来进一步改进）。

I tried it out for some example sentences from your question and looks like it's worth giving a try 我从您的问题中尝试了一些示例句子，看起来值得尝试

import nltk

def pos_tags(text):
    return nltk.pos_tag(nltk.word_tokenize(text))

>>> pos_tags("Hi, my name is Lance. Pleased to meet you.")
[('Hi', 'NNP'),
(',', ','),
('my', 'PRP$'),
('name', 'NN'),
('is', 'VBZ'),
('Lance.', 'NNP'),
('Pleased', 'NNP'),
('to', 'TO'),
('meet', 'VB'),
('you', 'PRP'),
('.', '.')]

>>> pos_tags("Hi, my name is Jenny. Pleased to meet you")
[('Hi', 'NNP'),
(',', ','),
('my', 'PRP$'),
('name', 'NN'),
('is', 'VBZ'),
('Jenny.', 'NNP'),
('Pleased', 'NNP'),
('to', 'TO'),
('meet', 'VB'),
('you', 'PRP'),
('.', '.')]

The POS tags for each can then be encoded as strings and compared. 然后可以将每个POS标签标记为字符串并进行比较。 If they are same there's a good chance that the lines are similar and can be grouped together. 如果它们相同，则很有可能这些线相似并且可以组合在一起。

>>> '-'.join([t[1] for t in pos_tags("Hi, my name is Jenny. Pleased to meet you")])
'NNP-,-PRP$-NN-VBZ-NNP-NNP-TO-VB-PRP-.'

>>> '-'.join([t[1] for t in pos_tags("Hi, my name is Lance. Pleased to meet you")])
'NNP-,-PRP$-NN-VBZ-NNP-NNP-TO-VB-PRP-.'

I am however not sure how it will perform on a million lines of text. 但是，我不确定它将如何处理一百万行文本。

Answer 3

This approach is using a set and compare it with all known sets. 这种方法使用一个集合，并将其与所有已知集合进行比较。 If half of the words are in a given set it is assumed to be the same and skipped. 如果一半的单词在给定的集合中，则假定它们相同且被跳过。

You have to give a clear definition when to sentences are similar so this can work. 当句子相似时，您必须给出明确的定义，这样才能起作用。

a = """Hi, my name is Lance. Pleased to meet you.
Hi, my name is Jenny. Pleased to meet you.
Yo, dawg. Name's John Erik. Don't touch my fries.
Yo, dawg. Name's James. Don't touch my fries.
I like turtles 53669
Stefan commented on your video.
n00bpwn3rz commented on your video.
RJ wants to talk to you.
Jenny liked your photo.
Pi wants to talk to you.
Pi says visit my website at www.google.com
John Erik says visit my website at www.johniscool.com
James made fruity ice cubes."""


dirty_list = a.split('\n')
clean_list = [] # list of sets containing 'unique sets'
clean_list_pure = [] # list of the original sentences stored as sets in clean_list eg the output
for line in dirty_list:
    line_set = set(line.strip().split(' '))
    if all(len(line_set.intersection(clean_set)) < len(line_set)/2 for clean_set in clean_list):
        clean_list.append(line_set)
        clean_list_pure.append(line.strip())

for cl in clean_list_pure:
    print cl

as output here we get: 作为输出，我们得到：

Hi, my name is Lance. 嗨，我叫兰斯 Pleased to meet you. 很高兴见到你。

Yo, dawg. w Name's John Erik. 名字是约翰·埃里克（John Erik）。 Don't touch my fries. 不要碰我的薯条。

I like turtles 53669 我喜欢乌龟53669

Stefan commented on your video. Stefan对您的视频发表了评论。

Jenny liked your photo. 珍妮喜欢你的照片。

Pi says visit my website at www.google.com 皮说，请访问我的网站www.google.com

James made fruity ice cubes. 詹姆斯做了果味的冰块。

使用Python，如何仅将不同的消息添加到列表？

问题描述

3 个解决方案

解决方案1
1 2014-07-22 07:48:26

解决方案2
1 2014-07-22 08:21:44

解决方案3
1 已采纳 2014-07-22 08:33:15

使用Python，如何仅将不同的消息添加到列表？

问题描述

3 个解决方案

解决方案1 1 2014-07-22 07:48:26

解决方案2 1 2014-07-22 08:21:44

解决方案3 1 已采纳 2014-07-22 08:33:15

解决方案1
1 2014-07-22 07:48:26

解决方案2
1 2014-07-22 08:21:44

解决方案3
1 已采纳 2014-07-22 08:33:15