删除部分相似的推文

Question

I am new to python and also to stackoverlfow. 我是python和stackoverlfow的新手。 I have a csv file with three columns (ID, Date_Of_creation, Text). 我有一个包含三列（ID，Date_Of_creation，文本）的csv文件。 There are almost 25,000 entries in the file. 文件中几乎有25,000个条目。 I have to remove the duplicate tweets (text column) and the code below works fine to remove duplicates: 我必须删除重复的tweet（文本列），下面的代码可以很好地删除重复的代码：

import csv

csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')

csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile)
cleanData = set()

for row in csvReader:
    #print(row[3])
    if row[3] in cleanData: continue
    cleanData.add(row[3])
    csvWriter.writerow(row)

print(cleanData)
csvOutputFile.close()
csvInputFile.close()

This code is removing all the duplicates with corresponding IDS and creation date. 此代码将删除所有具有相应IDS和创建日期的重复项。 As a second step of the analysis, i noticed that there are some retweets that don't have the original tweets in the data set. 作为分析的第二步，我注意到有一些转发数据集中没有原始推文。 I want to keep those retweets. 我想保留这些转发。 In simple, i want to remove all the duplicates, whether its a tweet or retweet, from the Text column. 简单来说，我想从“文本”列中删除所有重复项，无论是推文还是转推。 For Example: 例如：

"It will not be easy for them to handle the situation at this stage:…" “在现阶段，他们要处理这种情况并不容易：……”

"RT @ReutersLobby: It will not be easy for them to handle the situation at this stage:…" “ RT @ReutersLobby：他们现阶段要解决这个问题并不容易：……”

As the above tweet and retweet shows that "RT @ReutresLobby:" is extra in retweet. 如上述推文和转推所述，“ RT @ReutresLobby：”在转推中是多余的。 So the above code will not remove this retweet from the final set. 因此，上面的代码不会从最终集合中删除此转发。 I want to remove all such tweets that are a copy of a another tweet because the focus is on text of the tweet and creation time and not on other fields. 我想删除所有其他推文的副本，因为重点是推文的文本和创建时间，而不是其他字段。 I tried to search for it but could not find anything related on the forum.I hope someone will help me out with this problem.. 我试图搜索它，但是在论坛上找不到任何相关内容。希望有人可以帮助我解决这个问题。

Answer 1

I think it's a pretty quick fix: 我认为这是一个非常快速的解决方案：

import csv
import re

csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')

csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile)
cleanData = set()

for row in csvReader:
    #print(row[3])
    if row[3] in cleanData or re.sub('^RT @.*: ', '', row[3]) in cleanData:
        continue
    cleanData.add(row[3])
    csvWriter.writerow(row)

print(cleanData)
csvOutputFile.close()
csvInputFile.close()

The condition I added sees if the tweet, when stripped of the retweet prefix, exists already in the cleaned set. 我添加的条件是，如果清除了该推文前缀，则该推文中是否已经存在该推文。

删除部分相似的推文

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-04-23 17:01:35

删除部分相似的推文

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-04-23 17:01:35

解决方案1
0 已采纳 2018-04-23 17:01:35