簡體   English   中英

如何在腳本標簽中解析json var

[英]how to parsing json var inside script tag

我正在嘗試抓取散點圖中顯示的數據, 網址為https://www.proteinatlas.org/ENSG00000167286-CD3D/pathology/tissue/renal+cancer

javascript在

'<script> var plot = $('#scatter6001').scatterPlot({"Alive (n=651)":{"symbol":"circle","data":[{"x":0.407889650408,"y":12.811,"tooltip":"TCGA-KL-8324-01A<br>Female\/ Stage ii \/ Alive<br>FPKM: 0.4<br>Living days: 4676 (12.8 years)","class":"stage_ii sex_f best_low median_low"},{"x":0.587835812523,"y":8.0795,"tooltip":"TCGA-KL-8334-01A<br>Female\/ Stage iii \/ Alive<br>FPKM: 0.6<br>Living days: 2949 (8.1years)","class":"stage_iii sex_f best_low median_low"}'...});

我的問題是如何解析TCGA-XX-XXXX-XXX中的信息,性別,階段,生與死,FPKM和生存天數? 以及如何將這些信息保存在csv文件中?

這是我完成的代碼。

page = urlopen("https://www.proteinatlas.org/ENSG00000167286-CD3D/pathology/tissue/prostate+cancer#imid_3605750")
content = page.read()
soup = BeautifulSoup(content,'lxml')

table = soup.find('div', {'id':'scatter6001'})
print(table)

p = re.search(r"var plot = (.*?);",soup).group(1)
print(p)

該代碼有一些錯誤,這是

追溯(最近一次通話最近):文件“ scrap2.py”,第24行,在p = re.search(r“ var plot =(。*?);”,soup).group(1)文件“ C:\\ “ Python34 \\ lib \\ re.py”,行170,在搜索中返回_compile(pattern,flags).search(string)TypeError:預期的字符串或緩沖區

如何解決此問題並報廢我想要的數據?

謝謝

遵守絕對不是一個好主意,但是它將獲取您要獲取的數據。

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.proteinatlas.org/ENSG00000167286-CD3D/pathology/tissue/prostate+cancer#imid_3605750")
soup = BeautifulSoup(page.text,'lxml')
item_text = soup.select('#scatter6001 script')[0].text
for x in range(1,100):            #be sure to put the highest range in place of 100
    item = item_text.split('tooltip')[x].split("class")[0].replace('"','').replace(',','').replace(':','').replace("<br>"," ").replace("/","").replace("\\","")
    print(item)

部分輸出:

TCGA-KK-A7B3-01A Male  Stage not reported  Alive FPKM 5.5 Living days 899 (2.5 years)
TCGA-G9-6347-01A Male  Stage not reported  Alive FPKM 14.2 Living days 2089 (5.7 years)
TCGA-KC-A4BL-01A Male  Stage not reported  Alive FPKM 3.8 Living days 934 (2.6 years)
TCGA-KK-A7AQ-01A Male  Stage not reported  Alive FPKM 2.6 Living days 1610 (4.4 years)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM