简体   繁体   English

Python Web 爬虫 IEEE

[英]Python Web Scraper IEEE

I am trying to retrieve keywords of a particular IEEE document.我正在尝试检索特定 IEEE 文档的关键字。 I came across this code here我在这里遇到了这段代码

        ieee_content = requests.get(link, timeout=180)
        soup = BeautifulSoup(ieee_content.text, 'lxml')
        tag = soup.find_all('script')
        #metadata = "".join(re.findall('global.document.metadata=(.*)', tag[9].text)).replace(";", '').replace('global.document.metadata=', '')
        for i in tag[9]:
            metadata_format = re.compile(r'global.document.metadata=.*', re.MULTILINE)
            metadata = re.findall(metadata_format, i)
            if len(metadata) != 0:
               # convert the list 
               convert_to_json = json.dumps(metadata)
               x = json.loads(convert_to_json)
               s = x[0].replace("'", '"').replace(";", '')

The problem is that my metadata variable is always empty.问题是我的元数据变量总是空的。 I tried to iterate across all tags rather than using tag[9], but metadata is still empty in all cases.我尝试遍历所有标签而不是使用标签 [9],但在所有情况下元数据仍然为空。 I tried using 'xml' instead of 'lmxl' as well but the result is the same.我也尝试使用“xml”而不是“lmxl”,但结果是一样的。 I'd appreciate some help with this.我会很感激一些帮助。

import json
import re
from pprint import pprint

import requests
from bs4 import BeautifulSoup

ieee_content = requests.get("https://ieeexplore.ieee.org/document/7845555", timeout=180)
soup = BeautifulSoup(ieee_content.content, "html.parser")
scripts = soup.find_all("script")

pattern = re.compile(r"(?<=\"keywords\":)\[{.*?}\]")
keywords_dict = {}
for i, script in enumerate(scripts):
    keywords = re.findall(pattern, str(script.string))
    if len(keywords) == 1:
        raw_keywords_list = json.loads(keywords[0])
        for keyword_type in raw_keywords_list:
            keywords_dict[keyword_type["type"].strip()] = [kwd.strip() for kwd in keyword_type["kwd"]]

pprint(keywords_dict)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM