解析 HTML 并从标签的字符串中提取数据（JSON 格式）

Question

From a Html file a tag was extracted从 Html 文件中提取了一个标签

from bs4 import BeautifulSoup
html=open("centrimo.html")
parsed_html=BeautifulSoup(html)
script_data=parsed_cen.script

Now from the string contained in the script tag I would like to extract the information in the variables "sequences", "neg_sequences", "seqs" and "nseqs".现在，我想从脚本标签中包含的字符串中提取变量“sequences”、“neg_sequences”、“seqs”和“nseqs”中的信息。

    <script type="text/javascript">
  //@JSON_VAR data
  var data = {
    "version": "5.3.3",
    "revision": "1667d7719daf2af1693cc039ba463bc4d2304d23",
    "release": "Sun Feb 7 15:39:52 2021 -0800",
    "program": "CentriMo",
    "options": {
      "cd": false,
      "neg_sequences": true,
      "noseq": false,
      "mcc": false
    },
    "seqlen": 101,
    "tested": 435,
    "alphabet": {
      "name": "DNA",
      "like": "dna",
      "ncore": 4
    },
    "background": [0.2788, 0.2212, 0.2212, 0.2788],
    "sequences": [
      "AT1G04100.1_CDS", "AT1G05860.1_CDS", "AT1G13910.1_CDS",
      "AT1G21065.1_CDS", "AT1G26190.1_CDS", "AT1G32940.1_CDS",
      "AT1G50575.1_CDS", "AT1G55810.1_CDS", "AT1G66430.1_CDS",
      "AT1G71430.1_CDS", "AT1G77170.1_CDS", "AT1G78610.1_CDS",
      "AT2G02955.1_CDS", "AT2G16280.1_CDS", "AT2G17080.1_CDS",
      "AT2G19620.1_CDS", "AT2G19640.1_CDS", "AT2G30840.1_CDS",
      "AT2G39450.1_CDS", "AT2G41380.1_CDS", "AT2G42580.1_CDS",
      "AT3G01680.1_CDS", "AT3G05680.1_CDS", "AT3G20110.1_CDS",
      "AT3G20260.1_CDS", "AT3G21360.1_CDS", "AT3G23070.1_CDS",
      "AT3G23590.1_CDS", "AT3G46820.1_CDS", "AT3G48250.1_CDS",
      "AT3G61200.1_CDS", "AT4G08510.1_CDS", "AT4G15070.1_CDS",
      "AT4G24670.1_CDS", "AT4G25450.1_CDS", "AT4G28600.1_CDS",
      "AT4G31910.1_CDS", "AT4G34810.1_CDS", "AT4G35030.3_CDS",
      "AT4G37170.1_CDS", "AT4G38630.1_CDS", "AT4G39720.1_CDS",
      "AT5G07340.1_CDS", "AT5G12970.1_CDS", "AT5G13470.1_CDS",
      "AT5G18950.1_CDS", "AT5G22840.1_CDS", "AT5G25590.1_CDS",
      "AT5G27395.1_CDS", "AT5G53370.1_CDS", "AT5G63610.1_CDS",
      "AT5G64830.1_CDS", "AT5G64900.1_CDS", "AT5G67620.1_CDS"
    ],
    "neg_sequences": [
      "AT1G01600.1_CDS", "AT2G32480.1_CDS", "AT2G41740.1_CDS",
      "AT3G19490.1_CDS", "AT3G24030.1_CDS", "AT3G25580.1_CDS",
      "AT3G48330.1_CDS", "AT3G59220.1_CDS", "AT4G13340.1_CDS",
      "AT4G33590.1_CDS", "AT5G03080.1_CDS", "AT5G23700.1_CDS",
      "AT5G41010.1_CDS"
    ],
    "motifs": [
      {
        "db": 2,
        "id": "ath-miR419",
        "alt": "MIMAT0001327",
        "consensus": "CAACATCCTCAGCATTCATAA",
        "len": 21,
        "motif_evalue": "0.0e+000",
        "motif_nsites": 20,
        "n_tested": 40,
        "score_threshold": 5,
        "url": "http://www.mirbase.org/cgi-bin/mature.pl?mature_acc=MIMAT0001327",
        "pwm": [
          [0.164036, 0.479478, 0.163749, 0.192738],  
          [0.479764, 0.163749, 0.192452, 0.164036]
        ],
        "total_sites": 10,
        "sites": [
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
        ],
        "neg_total_sites": 2,
        "neg_sites": [
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
          0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
        ],
        "seqs": [0, 1, 13, 15, 16, 23, 27, 36, 44, 48],
        "neg_seqs": [3, 10],
        "peaks": [
          {
            "center": 0,
            "fisher_log_adj_pvalue": 0
          }
        ]
      }
    ]
  };
</script>

I tried to convert the object into a json type object but I got the following error,我试图将对象转换为 json 类型的对象，但出现以下错误，

import json
j_script = json.loads(script_data.string)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.7/json/__init__.py", line 348, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.7/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.7/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 2 column 7 (char 7)

Thanks in advance提前致谢

PS.附注。 an example of a complete html file that I would like to parse can be found ( here )可以找到我想要解析的完整 html 文件的示例（此处）

Edit: In the original post I mentioned that I got an indentation error.编辑：在原帖中，我提到我遇到了缩进错误。 That happened after trying to manually edit the json object by removing all the white spaces, "\\n" characters.这是在尝试通过删除所有空格“\\n”字符来手动编辑 json 对象之后发生的。 Although I don't think it fundamentally changes the question, I apologize for the mistake虽然我认为它不会从根本上改变问题，但我为错误道歉

[UPDATE] I was able to adapt the answer in this post as follows [更新] 我能够调整这篇文章中的答案如下

tmp=script_data.string.partition('=')
j_tmp=tmp[2].replace(";\n    ","")  
j_script=json.loads(j_tmp)

The second line is a bit clumsy (I couldn't adapt the answer in this other post ) but overall it does the trick.第二行是一个有点笨拙（我无法适应这个其他的答案后），但总体而言，它的伎俩。 Now I'm trying to obtain the 'seqs' data which is contained in the "motifs" list.现在我试图获取包含在“motifs”列表中的“seqs”数据。

Help with the second line in the code above will be much appreciated对上面代码中第二行的帮助将不胜感激

Answer 1

Below is the best answer that I could come up with.以下是我能想到的最佳答案。

from bs4 import BeautifulSoup
import json
html=open("<HTML File>")
parsed_html=BeautifulSoup(html)
script_data=parsed_html.script
tmp=script_data.string.partition('=')
 
#Replace the last part of the tupple (tmp)
j_tmp=tmp[2].replace(";\n    ","")  
 
j_script=json.loads(j_tmp)
 
seq_id=j_script["sequences"]
 
#Loop over the dictionaries looking for "seqs"
for idict in j_script["motifs"]:
    if "seqs" in idict:
        dict_tmp=idict
      
seq_n=dict_tmp["seqs"]

The code still needs improvement but I think it can do the job now.代码仍然需要改进，但我认为它现在可以完成这项工作。

解析 HTML 并从标签的字符串中提取数据（JSON 格式）

问题描述

1 个解决方案

解决方案1
0 2021-07-15 15:22:45

解析 HTML 并从标签的字符串中提取数据（JSON 格式）

问题描述

1 个解决方案

解决方案1 0 2021-07-15 15:22:45

解决方案1
0 2021-07-15 15:22:45