简体   繁体   English

删除部分相似的推文

[英]removing tweets with partial similarity

I am new to python and also to stackoverlfow. 我是python和stackoverlfow的新手。 I have a csv file with three columns (ID, Date_Of_creation, Text). 我有一个包含三列(ID,Date_Of_creation,文本)的csv文件。 There are almost 25,000 entries in the file. 文件中几乎有25,000个条目。 I have to remove the duplicate tweets (text column) and the code below works fine to remove duplicates: 我必须删除重复的tweet(文本列),下面的代码可以很好地删除重复的代码:

import csv

csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')

csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile)
cleanData = set()

for row in csvReader:
    #print(row[3])
    if row[3] in cleanData: continue
    cleanData.add(row[3])
    csvWriter.writerow(row)

print(cleanData)
csvOutputFile.close()
csvInputFile.close()

This code is removing all the duplicates with corresponding IDS and creation date. 此代码将删除所有具有相应IDS和创建日期的重复项。 As a second step of the analysis, i noticed that there are some retweets that don't have the original tweets in the data set. 作为分析的第二步,我注意到有一些转发数据集中没有原始推文。 I want to keep those retweets. 我想保留这些转发。 In simple, i want to remove all the duplicates, whether its a tweet or retweet, from the Text column. 简单来说,我想从“文本”列中删除所有重复项,无论是推文还是转推。 For Example: 例如:

"It will not be easy for them to handle the situation at this stage:…" “在现阶段,他们要处理这种情况并不容易:……”

"RT @ReutersLobby: It will not be easy for them to handle the situation at this stage:…" “ RT @ReutersLobby:他们现阶段要解决这个问题并不容易:……”

As the above tweet and retweet shows that "RT @ReutresLobby:" is extra in retweet. 如上述推文和转推所述,“ RT @ReutresLobby:”在转推中是多余的。 So the above code will not remove this retweet from the final set. 因此,上面的代码不会从最终集合中删除此转发。 I want to remove all such tweets that are a copy of a another tweet because the focus is on text of the tweet and creation time and not on other fields. 我想删除所有其他推文的副本,因为重点是推文的文本和创建时间,而不是其他字段。 I tried to search for it but could not find anything related on the forum.I hope someone will help me out with this problem.. 我试图搜索它,但是在论坛上找不到任何相关内容。希望有人可以帮助我解决这个问题。

I think it's a pretty quick fix: 我认为这是一个非常快速的解决方案:

import csv
import re

csvInputFile = open('inputFile.csv', 'r',encoding="utf-8", newline='')
csvOutputFile = open('outputFile.csv', 'w', encoding="utf-8", newline='')

csvReader = csv.reader(csvInputFile)
csvWriter = csv.writer(csvOutputFile)
cleanData = set()

for row in csvReader:
    #print(row[3])
    if row[3] in cleanData or re.sub('^RT @.*: ', '', row[3]) in cleanData:
        continue
    cleanData.add(row[3])
    csvWriter.writerow(row)

print(cleanData)
csvOutputFile.close()
csvInputFile.close()

The condition I added sees if the tweet, when stripped of the retweet prefix, exists already in the cleaned set. 我添加的条件是,如果清除了该推文前缀,则该推文中是否已经存在该推文。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM