I'm trying to scrape the data shown in the scatter plot in https://www.proteinatlas.org/ENSG00000167286-CD3D/pathology/tissue/renal+cancer
The javascript is in
'<script> var plot = $('#scatter6001').scatterPlot({"Alive (n=651)":{"symbol":"circle","data":[{"x":0.407889650408,"y":12.811,"tooltip":"TCGA-KL-8324-01A<br>Female\/ Stage ii \/ Alive<br>FPKM: 0.4<br>Living days: 4676 (12.8 years)","class":"stage_ii sex_f best_low median_low"},{"x":0.587835812523,"y":8.0795,"tooltip":"TCGA-KL-8334-01A<br>Female\/ Stage iii \/ Alive<br>FPKM: 0.6<br>Living days: 2949 (8.1years)","class":"stage_iii sex_f best_low median_low"}'...});
My question is how to parse the information in TCGA-XX-XXXX-XXX, gender,stage, living or dead, FPKM and Living days? And how to save those information in the csv file?
This is the code I have done.
page = urlopen("https://www.proteinatlas.org/ENSG00000167286-CD3D/pathology/tissue/prostate+cancer#imid_3605750")
content = page.read()
soup = BeautifulSoup(content,'lxml')
table = soup.find('div', {'id':'scatter6001'})
print(table)
p = re.search(r"var plot = (.*?);",soup).group(1)
print(p)
The code has some error, which is
Traceback (most recent call last): File "scrap2.py", line 24, in p = re.search(r"var plot = (.*?);",soup).group(1) File "C:\\Python34\\lib\\re.py", line 170, in search return _compile(pattern, flags).search(string) TypeError: expected string or buffer
How to solve this problem and scrap the data I want?
Thanks
It's definitely not a good idea to comply but it will fetch you the data you are after.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.proteinatlas.org/ENSG00000167286-CD3D/pathology/tissue/prostate+cancer#imid_3605750")
soup = BeautifulSoup(page.text,'lxml')
item_text = soup.select('#scatter6001 script')[0].text
for x in range(1,100): #be sure to put the highest range in place of 100
item = item_text.split('tooltip')[x].split("class")[0].replace('"','').replace(',','').replace(':','').replace("<br>"," ").replace("/","").replace("\\","")
print(item)
Partial output:
TCGA-KK-A7B3-01A Male Stage not reported Alive FPKM 5.5 Living days 899 (2.5 years)
TCGA-G9-6347-01A Male Stage not reported Alive FPKM 14.2 Living days 2089 (5.7 years)
TCGA-KC-A4BL-01A Male Stage not reported Alive FPKM 3.8 Living days 934 (2.6 years)
TCGA-KK-A7AQ-01A Male Stage not reported Alive FPKM 2.6 Living days 1610 (4.4 years)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.