使用Python将JSON数据拆分为多个文件

Question

I have this file where i have parsed from the internet. 我有从互联网上解析过的文件。 In there it consists of a json formatted file. 其中包含一个json格式的文件。

I am trying to split this file into smaller parts. 我正在尝试将此文件拆分为较小的部分。

for example : 例如：

Original file : 原始文件：

{
 "kind": "customsearch#search",
 "url": {
  "type": "application/json",
  "template": "https://www.googleapis.com/customsearch/v1?q={searchTerms}&num={count?}&start={startIndex?}&lr={language?}&safe={safe?}&cx={cx?}&cref={cref?}&sort={sort?}&filter={filter?}&gl={gl?}&cr={cr?}&googlehost={googleHost?}&c2coff={disableCnTwTranslation?}&hq={hq?}&hl={hl?}&siteSearch={siteSearch?}&siteSearchFilter={siteSearchFilter?}&exactTerms={exactTerms?}&excludeTerms={excludeTerms?}&linkSite={linkSite?}&orTerms={orTerms?}&relatedSite={relatedSite?}&dateRestrict={dateRestrict?}&lowRange={lowRange?}&highRange={highRange?}&searchType={searchType}&fileType={fileType?}&rights={rights?}&imgSize={imgSize?}&imgType={imgType?}&imgColorType={imgColorType?}&imgDominantColor={imgDominantColor?}&alt=json"
 },
 "queries": {
  "nextPage": [
   {
    "title": "Google Custom Search - pagerank",
    "totalResults": "14700000",
    "searchTerms": "pagerank",
    "count": 10,
    "startIndex": 11,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "017576662512468239146:omuauf_lfve"
   }
  ],
  "request": [
   {
    "title": "Google Custom Search - pagerank",
    "totalResults": "14700000",
    "searchTerms": "pagerank",
    "count": 10,
    "startIndex": 1,
    "inputEncoding": "utf8",
    "outputEncoding": "utf8",
    "safe": "off",
    "cx": "017576662512468239146:omuauf_lfve"
   }
  ]
 },
 "context": {
  "title": "CS Curriculum",
  "facets": [
   [
    {
     "label": "lectures",
     "anchor": "Lectures",
     "label_with_op": "more:lectures"
    }
   ],
   [
    {
     "label": "assignments",
     "anchor": "Assignments",
     "label_with_op": "more:assignments"
    }
   ],
   [
    {
     "label": "reference",
     "anchor": "Reference",
     "label_with_op": "more:reference"
    }
   ]
  ]
 },
 "searchInformation": {
  "searchTime": 0.239874,
  "formattedSearchTime": "0.24",
  "totalResults": "14700000",
  "formattedTotalResults": "14,700,000"
 },
 "items": [
  {
   "kind": "customsearch#result",
   "title": "Lecture slides on PageRank",
   "htmlTitle": "Lecture slides on \u003cb\u003ePageRank\u003c/b\u003e",
   "link": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
   "displayLink": "www.cs.utexas.edu",
   "snippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms & PageRank. \nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. Summer 2007.",
   "htmlSnippet": "Distributed Computing Seminar. Lecture 5: Graph Algorithms &amp; \u003cb\u003ePageRank\u003c/b\u003e. \u003cbr\u003e\nChristophe Bisciglia, Aaron Kimball, &amp; Sierra Michels-Slettvet. Summer 2007.",
   "cacheId": "CwgPK6hTEZQJ",
   "mime": "application/vnd.ms-powerpoint",
   "fileFormat": "Microsoft Powerpoint",
   "formattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-pagerank.ppt",
   "htmlFormattedUrl": "https://www.cs.utexas.edu/users/novak/lec5-\u003cb\u003epagerank\u003c/b\u003e.ppt",
   "pagemap": {
    "metatags": [
     {
      "author": "jhebert",
      "last saved by": "Google"
     }
    ]
   }
  },
  {
   "kind": "customsearch#result",
   "title": "The PageRank Citation Ranking: Bringing Order to the Web January ...",
   "htmlTitle": "The \u003cb\u003ePageRank\u003c/b\u003e Citation Ranking: Bringing Order to the Web January \u003cb\u003e...\u003c/b\u003e",
   "link": "http://www.cis.upenn.edu/~mkearns/teaching/NetworkedLife/pagerank.pdf",
   "displayLink": "www.cis.upenn.edu",
   "snippet": "Jan 29, 1998 ... We compare PageRank to an idealized random Web surfer. We show how to ... \nThis ranking, called PageRank, helps search engines and.",
   "htmlSnippet": "Jan 29, 1998 \u003cb\u003e...\u003c/b\u003e We compare \u003cb\u003ePageRank\u003c/b\u003e to an idealized random Web surfer. We show how to ... \u003cbr\u003e\nThis ranking, called \u003cb\u003ePageRank\u003c/b\u003e, helps search engines and.",
   "cacheId": "akmuPYNhiKMJ",
   "mime": "application/pdf",
   "fileFormat": "PDF/Adobe Acrobat",
   "formattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../pagerank.pdf",
   "htmlFormattedUrl": "www.cis.upenn.edu/~mkearns/teaching/.../\u003cb\u003epagerank\u003c/b\u003e.pdf",
   "pagemap": {
    "cse_image": [
     {
      "src": "x-raw-image:///9a2d934c7c41f83c4c97c3fb9a4cb4cc8fbcb453aaf1002ed6f970005773aa0e"
     }
    ],
    "cse_thumbnail": [
     {
      "width": "262",
      "height": "193",
      "src": "https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcQcCouA-BJlMWA0HZNMSxsXzbqIZzgu6tXXRqiuse2sttpJaNK2b0cNbm4"
     }
    ],
    "metatags": [
     {
      "producer": "AFPL Ghostscript 7.0",
      "creator": "dvipsk 5.58f Copyright 1986, 1994 Radical Eye Software",
      "title": "prpaperdraft.dvi"
     }
    ]
   }
  },
  {
   "kind": "customsearch#result",
   "title": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES ...",
   "htmlTitle": "MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES \u003cb\u003e...\u003c/b\u003e",
   "link": "http://stanford.edu/class/math51/PageRank.pdf",
   "displayLink": "stanford.edu",
   "snippet": "Google's method1 is called the PageRank algorithm and was developed by \nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
   "htmlSnippet": "Google&#39;s method1 is called the \u003cb\u003ePageRank\u003c/b\u003e algorithm and was developed by \u003cbr\u003e\nGoogle founders Sergey Brin and Larry Page while they were graduate students.",
   "cacheId": "RKV6ZEmHrjUJ",
   "mime": "application/pdf",
   "fileFormat": "PDF/Adobe Acrobat",
   "formattedUrl": "stanford.edu/class/math51/PageRank.pdf",
   "htmlFormattedUrl": "stanford.edu/class/math51/\u003cb\u003ePageRank\u003c/b\u003e.pdf",
   "pagemap": {
    "metatags": [
     {
      "producer": "pdfTeX-1.40.13",
      "creator": "TeX",
      "creationdate": "D:20130604152429-07'00'",
      "moddate": "D:20130604152429-07'00'",
      "fullbanner": "This is pdfTeX, Version 3.1415926-2.4-1.40.13 (TeX Live 2012) kpathsea version 6.1.0"
     }
    ]
   }
  },

File after processing 处理后的文件

{u'snippet': u'Distributed Computing Seminar. {u'snippet'：u'分布式计算研讨会。 Lecture 5: Graph Algorithms & PageRank. 讲座5：图算法和PageRank。 \\nChristophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet. \\ n克里斯托夫·比西格利亚（Christophe Bisciglia），亚伦·金博尔（Aaron Kimball）和塞拉·米歇尔斯·塞勒维（Sierra Michels-Slettvet）。 Summer 2007.', u'title': u'Lecture slides on PageRank'} {u'snippet': u'Jan 29, 1998 ... We compare PageRank to an idealized random Web surfer. 2007年夏季。'，u'title'：u'PageRank上的幻灯片'} {u'snippet'：u'1998年1月29日，我们将PageRank与理想的随机Web冲浪者进行了比较。 We show how to ... \\nThis ranking, called PageRank, helps search engines and.', u'title': u'The PageRank Citation Ranking: Bringing Order to the Web January ...'} {u'snippet': u"Google's method1 is called the PageRank algorithm and was developed by \\nGoogle founders Sergey Brin and Larry Page while they were graduate students.", u'title': u'MATH 51 LECTURE NOTES: HOW GOOGLE RANKS WEB PAGES ...'} 我们展示了如何... \\ n此排名称为PageRank，可帮助搜索引擎搜索。'，'u'title'：u'PageRank引用排名：将订单带入网络1月...'} {u'snippet'： u“ Google的method1被称为PageRank算法，由\\ nGoogle创始人谢尔盖·布林和拉里·佩奇在研究生时期开发。”，u'title'：u'MATH 51讲义：GOOGLE如何排名网页...' }

into three different text files/.txt files/.json files 分为三个不同的文本文件/.txt文件/.json文件

each beggining with {u'snippet' ... '} 每个以{u'snippet'...'}开头的

an trying to do this to run a text comparison process 尝试执行此操作以运行文本比较过程

PS: I have edited out the only parts that i need which are title and snippet parts. PS：我已经编辑了我唯一需要的部分，即标题和摘要部分。 Thus i might have lost the json formatting in those processes. 因此，我可能在这些进程中丢失了json格式。

Answer 1

Since you seem to be able to find the correct parts of your input, you should be able to write it to independent files. 由于您似乎能够找到输入的正确部分，因此您应该能够将其写入独立的文件中。 I assume that you have some kind of loop where you find the relevant data: 我假设您在某种程度上可以找到相关数据的循环：

fileno = 1
while True:  # or whatever you use to loop over your input
    # parse input
    # ...
    # have the 'snippet'-part in a variable
    with open('file_{:02d}.txt'.format(fileno), 'w') as f:
        fileno += 1
        f.write(snippet_var + "\n")

This will give you numbered files, starting from 1 with leading zeros. 这将给您编号文件，从1开始，以零开头。 If my assumptions are wrong, please update your question to display your current way of doing things. 如果我的假设是错误的，请更新您的问题以显示您当前的处事方式。

Furthermore, I would advise against your "preprocessing" if all you want to do is extracting the "snippet" property of JSON objects. 此外，如果您要提取的是JSON对象的“ snippet”属性，则建议不要进行“预处理”。

使用Python将JSON数据拆分为多个文件

问题描述

1 个解决方案

解决方案1
0 2015-01-11 21:39:05

使用Python将JSON数据拆分为多个文件

问题描述

1 个解决方案

解决方案1 0 2015-01-11 21:39:05

解决方案1
0 2015-01-11 21:39:05