简体   繁体   English

使用python将csv中的json列移动到另一个文件

[英]Move json column in csv to another file with python

I have a test.dat file that contains this 5 column: 我有一个test.dat文件,其中包含以下5列:

  • ['user_id', 'item_id', 'rating', 'scraping_time', 'tweet_in_json_format'] ['user_id','item_id','rating','scraping_time','tweet_in_json_format']

I want to Move this three columns to test2.csv : 我想将这三列移动test2.csv

  • ['user_id','scraping_time', 'tweet_in_json_format'] ['user_id','scraping_time','tweet_in_json_format']

Here is an example of one row of test.dat: 这是一排test.dat的示例:

user_id,item_id,rating,scraping_time,tweet_in_json_format
819099800,0993846,10,1391278544,{"contributors": null, "truncated": false, "text": "", "in_reply_to_status_id": null, "id": 426902385735520256, "favorite_count": 0, "source": "<a href=\"http://itunes.apple.com/us/app/imdb-movies-tv/id342792525?mt=8&uo=4\" rel=\"nofollow\">IMDb Movies & TV on iOS</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [{"indices": [61, 66], "text": "IMDb"}], "urls": [{"url": "http://tweeter.com/xQuwO8KJP1", "indices": [38, 60], "expanded_url": "http://www.imdb.com/title/tt0993846", "display_url": "imdb.com/title/tt0993846"}]}, "in_reply_to_screen_name": null, "id_str": "426902385735520256", "retweet_count": 0, "in_reply_to_user_id": null, "favorited": false, "user": {"follow_request_sent": false, "profile_use_background_image": true, "id": 819099800, "verified": false, "profile_text_color": "333333", "profile_image_url_https": "https://pbs.twimg.com/profile_images/420936276607791104/KVrTuNU9_normal.jpeg", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "geo_enabled": false, "entities": {"description": {"urls": []}}, "followers_count": 116, "protected": false, "location": "in my dreams ", "default_profile_image": false, "id_str": "819099800", "lang": "ar", "utc_offset": -36000, "statuses_count": 1169, "description": "\u0646\u0628\u0649 \u0627\u0644\u0623\u062c\u0631 .", "friends_count": 86, "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/420936276607791104/KVrTuNU9_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/812779345/23ae1c7da01b05a6d5d9b0be28fe14c9.jpeg", "profile_background_color": "C0DEED", "profile_banner_url": "https://pbs.twimg.com/profile_banners/819099800/1390618207", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/812779345/23ae1c7da01b05a6d5d9b0be28fe14c9.jpeg", "name": "vivo per lei ", "is_translation_enabled": false, "profile_background_tile": false, "favourites_count": 155, "screen_name": "Orkida__", "url": null, "created_at": "Wed Sep 12 08:08:06 +0000 2012", "contributors_enabled": false, "time_zone": "Hawaii", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": false, "listed_count": 0}, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Sat Jan 25 02:20:34 +0000 2014", "in_reply_to_status_id_str": null, "place": null}

The problem is that the'tweet_in_json_format' is json and Pandas can't consider that into a column.... 问题在于'tweet_in_json_format'是json,Pandas不能将其视为一列。

How can i do this ? 我怎样才能做到这一点 ?

Your main problem is that your input isn't actually CSV - if it were, the JSON data in the last column would have to be quoted so that its internal commas are not interpreted as CSV delimiters. 您的主要问题是您的输入实际上不是CSV-如果是,则必须在最后一列中引用JSON数据,以便其内部逗号不被解释为CSV分隔符。

If you simply want to perform the transformation you describe, and you can be confident of the input format remaining the same (ie user ID, item ID, rating, scraping time and JSON data in that order, separated with commas), then this can be achieved relatively simply without needing Pandas (which is really overkill for the job): 如果您只是想执行您描述的转换,并且可以确信输入格式保持不变(即用户ID,商品ID,等级,抓取时间和JSON数据按该顺序排列,并以逗号分隔),则可以不需要熊猫就可以相对简单地实现(这对于这项工作来说实在是太过分了):

with open('test.dat') as f_in, open('test2.csv', 'w') as f_out:
    for line in f_in:
        parts = line.split(',', 4)
        f_out.write('{},{},{}\n'.format(parts[0], parts[3], parts[4]))

In short, this opens the input and output files, then for each line in the input file it splits it at most four times on the commas, which separates the line into its various fields without mangling the JSON. 简而言之,这将打开输入和输出文件,然后针对输入文件中的每一行在逗号中最多将其分割四次,从而将该行分为多个字段,而无需处理JSON。 It then writes the first, fourth and fifth fields (corresponding to user ID, scraping time and JSON data) to the output file, separated with commas. 然后,它将第一,第四和第五个字段(对应于用户ID,抓取时间和JSON数据)写入输出文件,并以逗号分隔。

Please note that this is a slightly brittle solution, as it will break if the column order changes. 请注意,这是一种较易碎的解决方案,因为如果更改列顺序,它将中断。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM