繁体   English   中英

使用python将csv中的json列移动到另一个文件

[英]Move json column in csv to another file with python

我有一个test.dat文件,其中包含以下5列:

  • ['user_id','item_id','rating','scraping_time','tweet_in_json_format']

我想将这三列移动test2.csv

  • ['user_id','scraping_time','tweet_in_json_format']

这是一排test.dat的示例:

user_id,item_id,rating,scraping_time,tweet_in_json_format
819099800,0993846,10,1391278544,{"contributors": null, "truncated": false, "text": "", "in_reply_to_status_id": null, "id": 426902385735520256, "favorite_count": 0, "source": "<a href=\"http://itunes.apple.com/us/app/imdb-movies-tv/id342792525?mt=8&uo=4\" rel=\"nofollow\">IMDb Movies & TV on iOS</a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [{"indices": [61, 66], "text": "IMDb"}], "urls": [{"url": "http://tweeter.com/xQuwO8KJP1", "indices": [38, 60], "expanded_url": "http://www.imdb.com/title/tt0993846", "display_url": "imdb.com/title/tt0993846"}]}, "in_reply_to_screen_name": null, "id_str": "426902385735520256", "retweet_count": 0, "in_reply_to_user_id": null, "favorited": false, "user": {"follow_request_sent": false, "profile_use_background_image": true, "id": 819099800, "verified": false, "profile_text_color": "333333", "profile_image_url_https": "https://pbs.twimg.com/profile_images/420936276607791104/KVrTuNU9_normal.jpeg", "profile_sidebar_fill_color": "DDEEF6", "is_translator": false, "geo_enabled": false, "entities": {"description": {"urls": []}}, "followers_count": 116, "protected": false, "location": "in my dreams ", "default_profile_image": false, "id_str": "819099800", "lang": "ar", "utc_offset": -36000, "statuses_count": 1169, "description": "\u0646\u0628\u0649 \u0627\u0644\u0623\u062c\u0631 .", "friends_count": 86, "profile_link_color": "0084B4", "profile_image_url": "http://pbs.twimg.com/profile_images/420936276607791104/KVrTuNU9_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/812779345/23ae1c7da01b05a6d5d9b0be28fe14c9.jpeg", "profile_background_color": "C0DEED", "profile_banner_url": "https://pbs.twimg.com/profile_banners/819099800/1390618207", "profile_background_image_url": "http://a0.twimg.com/profile_background_images/812779345/23ae1c7da01b05a6d5d9b0be28fe14c9.jpeg", "name": "vivo per lei ", "is_translation_enabled": false, "profile_background_tile": false, "favourites_count": 155, "screen_name": "Orkida__", "url": null, "created_at": "Wed Sep 12 08:08:06 +0000 2012", "contributors_enabled": false, "time_zone": "Hawaii", "profile_sidebar_border_color": "C0DEED", "default_profile": false, "following": false, "listed_count": 0}, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Sat Jan 25 02:20:34 +0000 2014", "in_reply_to_status_id_str": null, "place": null}

问题在于'tweet_in_json_format'是json,Pandas不能将其视为一列。

我怎样才能做到这一点 ?

您的主要问题是您的输入实际上不是CSV-如果是,则必须在最后一列中引用JSON数据,以便其内部逗号不被解释为CSV分隔符。

如果您只是想执行您描述的转换,并且可以确信输入格式保持不变(即用户ID,商品ID,等级,抓取时间和JSON数据按该顺序排列,并以逗号分隔),则可以不需要熊猫就可以相对简单地实现(这对于这项工作来说实在是太过分了):

with open('test.dat') as f_in, open('test2.csv', 'w') as f_out:
    for line in f_in:
        parts = line.split(',', 4)
        f_out.write('{},{},{}\n'.format(parts[0], parts[3], parts[4]))

简而言之,这将打开输入和输出文件,然后针对输入文件中的每一行在逗号中最多将其分割四次,从而将该行分为多个字段,而无需处理JSON。 然后,它将第一,第四和第五个字段(对应于用户ID,抓取时间和JSON数据)写入输出文件,并以逗号分隔。

请注意,这是一种较易碎的解决方案,因为如果更改列顺序,它将中断。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM