简体   繁体   English

使用Python将JSON数据拆分为多个文件

[英]Split JSON data into multiple files using Python

I have this file where i have parsed from the internet. 我有从互联网上解析过的文件。 In there it consists of a json formatted file. 其中包含一个json格式的文件。

I am trying to split this file into smaller parts. 我正在尝试将此文件拆分为较小的部分。

for example : 例如 :

Original file : 原始文件:

{
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "nextPage": [
   {
    "title": "Google Custom Search - pagerank",
    "totalResults": "14700000",
    "searchTerms": "pagerank",
    "count": 10,
    "startIndex": 11,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "017576662512468239146:omuauf_lfve"
   }
  ],
  "request": [
   {
    "title": "Google Custom Search - pagerank",
    "totalResults": "14700000",
    "searchTerms": "pagerank",
    "count": 10,
    "startIndex": 1,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "017576662512468239146:omuauf_lfve"
   }
  ]
 },
 "context": {
  "title": "CS Curriculum",
  "facets": [
   [
    {
     "label": "lectures",
     "anchor": "Lectures",
     "label_with_op": "more:lectures"
    }
   ],
   [
    {
     "label": "assignments",
     "anchor": "Assignments",
     "label_with_op": "more:assignments"
    }
   ],
   [
    {
     "label": "reference",
     "anchor": "Reference",
     "label_with_op": "more:reference"
    }
   ]
  ]
 },
 "searchInformation": {
  "searchTime": 0.239874,
  "formattedSearchTime": "0.24",
  "totalResults": "14700000",
  "formattedTotalResults": "14,700,000"
 },
 "items": [
  {
   "kind": "customsearch#result",
   "title": "Lecture slides on PageRank",
   "htmlTitle": "Lecture slides on \u003cb\u003ePageRank\u003c/b\u003e",
   "link": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
   "displayLink": "www.cs.utexas.edu",
   "snippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms & PageRank. \nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.",
   "htmlSnippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms & \u003cb\u003ePageRank\u003c/b\u003e. \u003cbr\u003e\nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.",
   "cacheId": "CwgPK6hTEZQJ",
   "mime": "application/vnd.ms-powerpoint",
   "fileFormat": "Microsoft Powerpoint",
   "formattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
   "htmlFormattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-\u003cb\u003epagerank\u003c/b\u003e.ppt",
   "pagemap": {
    "metatags": [
     {
      "author": "jhebert",
      "last saved by": "Google"
     }
    ]
   }
  },
  {
   "kind": "customsearch#result",
   "title": "The PageRank Citation Ranking: Bringing Order to the Web January ...",
   "htmlTitle": "The \u003cb\u003ePageRank\u003c/b\u003e Citation Ranking: Bringing Order to the Web January \u003cb\u003e...\u003c/b\u003e",
   "link": "http://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/pagerank.pdf",
   "displayLink": "www.cis.upenn.edu",
   "snippet": "Jan 29, 1998 ... We compare PageRank to an idealized random Web surfer. We show how to ... \nThis ranking, called PageRank, helps search engines and.",
   "htmlSnippet": "Jan 29, 1998 \u003cb\u003e...\u003c/b\u003e We compare \u003cb\u003ePageRank\u003c/b\u003e to an idealized random Web surfer. We show how to ... \u003cbr\u003e\nThis ranking, called \u003cb\u003ePageRank\u003c/b\u003e, helps search engines and.",
   "cacheId": "akmuPYNhiKMJ",
   "mime": "application/pdf",
   "fileFormat": "PDF/Adobe Acrobat",
   "formattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../pagerank.pdf",
   "htmlFormattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../\u003cb\u003epagerank\u003c/b\u003e.pdf",
   "pagemap": {
    "cse_image": [
     {
      "src": "x-raw-image:///9a2d934c7c41f83c4c97c3fb9a4cb4cc8fbcb453aaf1002ed6f970005773aa0e"
     }
    ],
    "cse_thumbnail": [
     {
      "width": "262",
      "height": "193",
      "src": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQcCouA-BJlMWA0HZNMSxsXzbqIZzgu6tXXRqiuse2sttpJaNK2b0cNbm4"
     }
    ],
    "metatags": [
     {
      "producer": "AFPL Ghostscript 7.0",
      "creator": "dvipsk 5.58f Copyright 1986, 1994 Radical Eye Software",
      "title": "prpaperdraft.dvi"
     }
    ]
   }
  },
  {
   "kind": "customsearch#result",
   "title": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES ...",
   "htmlTitle": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES \u003cb\u003e...\u003c/b\u003e",
   "link": "http://stanford.edu/class/math51/PageRank.pdf",
   "displayLink": "stanford.edu",
   "snippet": "Google's method1 is called the PageRank algorithm and was developed by \nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
   "htmlSnippet": "Google's method1 is called the \u003cb\u003ePageRank\u003c/b\u003e algorithm and was developed by \u003cbr\u003e\nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
   "cacheId": "RKV6ZEmHrjUJ",
   "mime": "application/pdf",
   "fileFormat": "PDF/Adobe Acrobat",
   "formattedUrl": "stanford.edu/class/math51/PageRank.pdf",
   "htmlFormattedUrl": "stanford.edu/class/math51/\u003cb\u003ePageRank\u003c/b\u003e.pdf",
   "pagemap": {
    "metatags": [
     {
      "producer": "pdfTeX-1.40.13",
      "creator": "TeX",
      "creationdate": "D:20130604152429-07'00'",
      "moddate": "D:20130604152429-07'00'",
      "fullbanner": "This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012) kpathsea version 6.1.0"
     }
    ]
   }
  },

File after processing 处理后的文件

{u'snippet': u'Distributed Computing Seminar. {u'snippet':u'分布式计算研讨会。 Lecture 5: Graph Algorithms & PageRank. 讲座5:图算法和PageRank。 \\nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. \\ n克里斯托夫·比西格利亚(Christophe Bisciglia),亚伦·金博尔(Aaron Kimball)和塞拉·米歇尔斯·塞勒维(Sierra Michels-Slettvet)。 Summer 2007.', u'title': u'Lecture slides on PageRank'} {u'snippet': u'Jan 29, 1998 ... We compare PageRank to an idealized random Web surfer. 2007年夏季。',u'title':u'PageRank上的幻灯片'} {u'snippet':u'1998年1月29日,我们将PageRank与理想的随机Web冲浪者进行了比较。 We show how to ... \\nThis ranking, called PageRank, helps search engines and.', u'title': u'The PageRank Citation Ranking: Bringing Order to the Web January ...'} {u'snippet': u"Google's method1 is called the PageRank algorithm and was developed by \\nGoogle founders Sergey Brin and Larry Page while they were graduate students.", u'title': u'MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES ...'} 我们展示了如何... \\ n此排名称为PageRank,可帮助搜索引擎搜索。','u'title':u'PageRank引用排名:将订单带入网络1月...'} {u'snippet': u“ Google的method1被称为PageRank算法,由\\ nGoogle创始人谢尔盖·布林和拉里·佩奇在研究生时期开发。”,u'title':u'MATH 51讲义:GOOGLE如何排名网页...' }

into three different text files/.txt files/.json files 分为三个不同的文本文件/.txt文件/.json文件

each beggining with {u'snippet' ... '} 每个以{u'snippet'...'}开头的

an trying to do this to run a text comparison process 尝试执行此操作以运行文本比较过程

PS: I have edited out the only parts that i need which are title and snippet parts. PS:我已经编辑了我唯一需要的部分,即标题和摘要部分。 Thus i might have lost the json formatting in those processes. 因此,我可能在这些进程中丢失了json格式。

Since you seem to be able to find the correct parts of your input, you should be able to write it to independent files. 由于您似乎能够找到输入的正确部分,因此您应该能够将其写入独立的文件中。 I assume that you have some kind of loop where you find the relevant data: 我假设您在某种程度上可以找到相关数据的循环:

fileno = 1
while True:  # or whatever you use to loop over your input
    # parse input
    # ...
    # have the 'snippet'-part in a variable
    with open('file_{:02d}.txt'.format(fileno), 'w') as f:
        fileno += 1
        f.write(snippet_var + "\n")

This will give you numbered files, starting from 1 with leading zeros. 这将给您编号文件,从1开始,以零开头。 If my assumptions are wrong, please update your question to display your current way of doing things. 如果我的假设是错误的,请更新您的问题以显示您当前的处事方式。

Furthermore, I would advise against your "preprocessing" if all you want to do is extracting the "snippet" property of JSON objects. 此外,如果您要提取的是JSON对象的“ snippet”属性,则建议不要进行“预处理”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM