使用“ |”将带有推文的文本文件解析为csv 分隔符

Question

I have a .txt file containing geotagged tweets. 我有一个.txt文件，其中包含经过地理标记的推文。 The information is delimited by '|' 信息以“ |”定界 character. 字符。 The information (which can be perceived as columns) are datetime, latitude, longitude and tweet_text. 信息（可以看作是列）是日期时间，纬度，经度和tweet_text。

Date_time|latitude|longitude|tweet_text
Mon Jan 01 09:09:57 +0000 2018|-37.8140362|144.9644232|terima kasih 2017 yang ohsem. semoga 2018 akan lebih baik lagi.-selamat tahun baru rakan-rakanâ€¦ 
Mon Jan 01 09:15:54 +0000 2018|-37.81639|144.9655|we love christmas and new year proposals! happy new year to everyone celebrating love this yearâ€¦ 
Mon Jan 01 09:42:08 +0000 2018|-37.818|144.985|@michaelpaynter entertaining everyone at yarra park nye event #melbourne| #nye #musicâ€¦ 
Mon Jan 01 09:45:16 +0000 2018|-37.818|144.985|@emilyurbandiva and brother @jwilliamsimusik entertaining everyone at yarra park nye eventâ€|¦

Initially, I used 最初，我用

data = pd.read_csv('MelbCBD_scs2018_new.txt',sep="|", header=None)

, but it threw parsing error whenever the tweet_text had '|'. ，但只要tweet_text具有“ |”，就会引发解析错误。

I tried manually cleaning tweet_text but it is too much work for large files. 我尝试手动清理tweet_text，但是对于大文件来说，这是太多的工作。 Hence I changed the argument parameter of read_csv. 因此，我更改了read_csv的参数参数。

data = pd.read_csv('MelbCBD_scs2018_new.txt',sep="|", header=None, quoting=csv.QUOTE_NONE,error_bad_lines=False)

But it displays the following warning and skips those lines (essentially the tweets) which I do not want. 但它显示以下警告，并跳过了我不希望的那些行（本质上是推文）。

b'Skipping line 340: expected 4 fields, saw 5

I would ideally like a code that removes any special character after encountering 3 '|' 理想情况下，我希望代码遇到3'|'后删除任何特殊字符 characters in each line of the .txt file ie the tweet_text column and parses it into the .csv file, without skipping any line. .txt文件的每一行中的字符，即tweet_text列，并将其解析为.csv文件，而无需跳过任何行。

Answer 1

So it was just a matter of specifying the number of columns by specifying the column names. 因此，只需要通过指定列名来指定列数即可。

data = pd.read_csv('MelbCBD_scs2018_new.txt', sep="|", names = ["Date_time", "latitude", "longitude","tweet_text"], header=None, quoting=csv.QUOTE_NONE,error_bad_lines=False)

Now, this returns every single line without skipping and stores it in the dataframe named 'data'. 现在，这将返回每一行而不跳过，并将其存储在名为“ data”的数据框中。

使用“ |”将带有推文的文本文件解析为csv 分隔符

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-12 02:53:49

使用“ |”将带有推文的文本文件解析为csv 分隔符

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-12 02:53:49

解决方案1
1 已采纳 2019-08-12 02:53:49