Python Beautiful Soup抓取和解析

Question

There is a Java Script page I am attempting to scrape with BeautifulSoup 我正在尝试使用BeautifulSoup抓取一个Java脚本页面

bb2_addLoadEvent(function() {
    for ( i=0; i < document.forms.length; i++ ) {
        if (document.forms[i].method == 'post') {
            var myElement = document.createElement('input');
            myElement.setAttribute('type', 'hidden');
            myElement.name = 'bb2_screener_';
            myElement.value = '1568090530 122.44.202.205 122.44.202.205';
            document.forms[i].appendChild(myElement);
        }

I would like to obtain the value of "myElement.value", but I am not familiar with how to do so( If it is even possible with BeautifulSoup) 我想获取“ myElement.value”的值，但我不熟悉该方法（如果使用BeautifulSoup甚至可能）

Ive tried : 我试过了：

soup = BeautifulSoup(a.text, 'html.parser')
h = soup.find('type')   ...('div') ... ('input') ... even ('var')
    print(soup)

and NO Luck :( 没有运气:(

Is there a way of obtaining the value? 有没有获得价值的方法？ If so how? 如果可以，怎么办？

Answer 1

It would help to know more about the myElement.value across different pages. 有助于更多了解不同页面上的myElement.value。 You might get away with a simple character set and lead string as shown in regex below. 您可能会遇到一个简单的字符集和引导字符串，如下面的正则表达式所示。 I would like to tighten it up but would need more examples ..... perhaps those number lengths are fixed and repeating ? 我想收紧它，但需要更多示例.....也许这些数字长度是固定的并且重复？ ..... then something like p = re.compile(r"myElement\\.value = '(\\d{10}(?:(\\s\\d{3}\\.\\d{2}\\.\\d{3}\\.\\d{3}){2}))';") <= then take group 1. .....然后类似p = re.compile(r"myElement\\.value = '(\\d{10}(?:(\\s\\d{3}\\.\\d{2}\\.\\d{3}\\.\\d{3}){2}))';") <=然后参加第1组。

import re

s = '''bb2_addLoadEvent(function() {
    for ( i=0; i < document.forms.length; i++ ) {
        if (document.forms[i].method == 'post') {
            var myElement = document.createElement('input');
            myElement.setAttribute('type', 'hidden');
            myElement.name = 'bb2_screener_';
            myElement.value = '1568090530 122.44.202.205 122.44.202.205';
            document.forms[i].appendChild(myElement);
        }'''

p = re.compile(r"myElement\.value = '([\d\s\.]+)';")
print(p.findall(s)[0])

@SIM also kindly proposed: @SIM还建议：

p = re.compile(r"value[^']+'([^']*)'"

Answer 2

If myElement.value = is static, this can be achieved with a simple regular expression: 如果myElement.value =是静态的，则可以使用一个简单的正则表达式来实现：

value = re.compile(r"myElement\.value = '([^']+)'").search(str).group(1)

This matches myElement.value = ' , followed by non- ' characters, followed by another ' , where all the non- ' characters are captured in a group. 这与myElement.value = '匹配，后跟非'字符，然后是另一个' ，其中所有非'字符都在一个组中捕获。 Then the group(1) extracts the group from the match. 然后， group(1)从匹配项中提取组。

If the string may contain escaped ' s as well, eg: 如果字符串也可能包含转义' ，例如：

myElement.value = 'foo \' bar';

then alternate \\. 然后替换\\. with [^'] : 与[^'] ：

myElement\.value = '((?:\\.|[^'])+)'

https://regex101.com/r/Tdarel/1 https://regex101.com/r/Tdarel/1

Python Beautiful Soup抓取和解析

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-09-10 06:28:52

解决方案2
0 2019-09-10 06:28:06

Python Beautiful Soup抓取和解析

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-09-10 06:28:52

解决方案2 0 2019-09-10 06:28:06

解决方案1
2 已采纳 2019-09-10 06:28:52

解决方案2
0 2019-09-10 06:28:06