简体   繁体   English

使用pandoc将html转换为json

[英]Converting html to json with pandoc

I'm trying to take html and generate some json that keeps the same structure. 我正在尝试使用html并生成一些保持相同结构的json。

I'm trying to use pandoc, as i've had some success in transforming things from format A to format B using pandoc before. 我正在尝试使用pandoc,因为我在使用pandoc之前将事物从格式A转换为格式B方面取得了一些成功。

I'm trying to convert this file: 我正在尝试转换此文件:

example.html example.html的

<p>Hello guys! What's up?</p>

Using the command: 使用命令:

pandoc -f html -t json example.html

What i expect is something like: 我期望的是:

[{ "p": "Hello guys! What's up?"}]

What i get is: 我得到的是:

[
  { "Para":
    [
      {"t": "Str", "c": "Hello"},
      {"t": "Space"},
      {"t": "Str", "c": "guys!"},
      {"t": "Space"},
      {"t": "Str", "c": "What's"},
      {"t": "Space"},
      {"t": "Str", "c": "up?"}
    ]
  }
]

The problem seems to be that when pandoc reads the text content, it separates every word based on the space character and makes an array out of it, while i expected pandoc to understand that the whole string is a single element. 问题似乎是当pandoc读取文本内容时,它会根据空格字符分隔每个单词并从中生成一个数组,而我希望pandoc能够理解整个字符串是单个元素。

I'm a beginner at pandoc and I've not been able to find out how to tweak that behavior. 我是pandoc的初学者,我无法找到如何调整这种行为。

Do you have an idea of how I can get the desired output? 你知道我如何获得所需的输出吗? Do you know another tool that can do this? 你知道另一种可以做到这一点的工具吗? The tool, or the language it's written in doesn't matter. 该工具或其编写的语言无关紧要。

Thanks. 谢谢。

Edit : You can test that behavior online on that pandoc online tool . 编辑 :您可以在该pandoc在线工具在线测试该行为。

Edit 2 : Workaround. 编辑2 :解决方法。 I couldn't find how to do the HTML->JSON conversion with pandoc. 我找不到如何使用pandoc进行HTML-> JSON转换。 As a workaround, i used the suggestion proposed in the comments, and implemented a solution using Himalaya , which is a node package. 作为一种解决方法,我使用了评论中提出的建议,并使用喜马拉雅实现了一个解决方案,这是一个节点包。 The result is exactly what i wished for, even though it's not using pandoc. 结果正是我所希望的,即使它没有使用pandoc。

Currently, the pandoc JSON representation is not very human-readable, but is auto-generated from the Haskell pandoc data types (aka document AST). 目前,pandoc JSON表示不是人类可读的,而是从Haskell pandoc数据类型(也称为文档AST)自动生成。 There is some discussion to change that eventually . 有一些讨论要最终改变

I guess you're looking for something like https://codebeautify.org/xmltojson ? 我想你正在寻找像https://codebeautify.org/xmltojson这样的东西? There also seem to be plenty of commandline-tools that do that . 似乎还有很多命令行工具可以做到这一点

Pandoc, It's a tool to convert documents, the json representation of the document, It's just another representation that Pandoc can handle for the AST (Abstract Syntax Tree) Pandoc,它是一个转换文档的工具,文档的json表示,它只是Pandoc可以为AST处理的另一种表示(抽象语法树)

Original Document --> Pandoc's AST --> Output Document
                   |                |
                pandoc           pandoc

Asking pandoc, to output a json , is to ask for the AST tree in it's json format, 问pandoc,输出一个json ,就是要求它的json格式的AST树,

If I understand correctly you would need something more like a xml to json converter like this Python xmljson module or an online tool like this one . 如果我理解正确你需要更像xmljson转换器的东西,就像这个Python xmljson模块或像这样的在线工具。

There are plenty of tools for that job as you picture it, just google XML to JSON convert. 当你想象它时,有很多工具可以用于这项工作,只需谷歌XML到JSON转换。

The json representation of the AST used in pandoc, it normally used to output it from pandoc, and pipe it into another program that can handle json files, so you can alter the AST and make filters to manipulate the structure of your document. 在pandoc中使用的AST的json表示,它通常用于从pandoc输出它,并将其传递到另一个可以处理json文件的程序中,因此您可以更改AST并使过滤器来操作文档的结构。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM