简体   繁体   English

Python将文件保存到csv

[英]Python save file to csv

I have the following code that gets in Twitter tweets and should process the data and after that save into a new file. 我在Twitter推文中有以下代码,应该处理数据,然后将其保存到新文件中。

This is the code: 这是代码:

#import regex
import re

#start process_tweet
def processTweet(tweet):
    # process the tweets

    #Convert to lower case
    tweet = tweet.lower()
    #Convert www.* or https?://* to URL
    tweet = re.sub('((www\.[\s]+)|(https?://[^\s]+))','URL',tweet)
    #Convert @username to AT_USER
    tweet = re.sub('@[^\s]+','AT_USER',tweet)
    #Remove additional white spaces
    tweet = re.sub('[\s]+', ' ', tweet)
    #Replace #word with word
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet)
    #trim
    tweet = tweet.strip('\'"')
    return tweet
#end

#Read the tweets one by one and process it
input = open('withoutEmptylines.csv', 'rb')
output = open('editedTweets.csv','wb')

line = input.readline()

while line:
    processedTweet = processTweet(line)
    print (processedTweet)
    output.write(processedTweet)
    line = input.readline()

input.close()
output.close()

My data in the input file looks like this, so each tweet in one line: 我在输入文件中的数据如下所示,因此每条推文都在一行中:

She wants to ride my BMW the go for a ride in my BMW lol http://t.co/FeoNg48AQZ
BMW Sees U.S. As Top Market For 2015 i8 http://t.co/kkFyiBDcaP

my function is working good, but I am not happy with the output which looks like this: 我的函数运行良好,但是我对如下所示的输出不满意:

she wants to ride my bmw the go for a ride in my bmw lol URL rt AT_USER Ðun bmw es mucho? yo: bmw. -AT_USER veeergaaa!. hahahahahahahahaha nos hiciste la noche caray! 

so it puts everything in one row and not each tweet in one row as was the format in the input file. 因此,它会将所有内容都排在一行中,而不是将每条推文都排在一行中,就像输入文件中的格式一样。

Has someone an idea to get each tweet in one line? 是否有人想将每条推文排成一行?

With a example file like this: 带有这样的示例文件:

tweet number one
tweet number two
tweet number three

This code: 这段代码:

file = open('tweets.txt')
for line in file:
   print line

Produces this output: 产生以下输出:

tweet number one

tweet number two

tweet number three

Python is reading in the endlines just fine, but your script is replacing them via regular expression substitution. Python可以很好地读取结尾处的内容,但是您的脚本正在通过正则表达式替换来替换它们。

this regex substitution: 此正则表达式替代:

tweet = re.sub('[\s]+', ' ', tweet)

Is converting all of your white space characters (eg tabs and new lines) into single spaces. 正在将所有空白字符(例如,制表符和换行符)转换为单个空格。

Either add a endline onto the tweet before you output it, or modify your regex to not substitute endlines like so: 在输出前,在tweet上添加结尾行,或者修改正则表达式以不替换结尾行,如下所示:

tweet = re.sub('[ ]+', ' ', tweet)

EDIT: I put my test substitution command in there. 编辑:我把我的测试替换命令放在那里。 the suggestion has been fixed. 该建议已得到解决。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM