[英]Python Beautiful Soup Scraping & Parsing
There is a Java Script page I am attempting to scrape with BeautifulSoup 我正在尝试使用BeautifulSoup抓取一个Java脚本页面
bb2_addLoadEvent(function() {
for ( i=0; i < document.forms.length; i++ ) {
if (document.forms[i].method == 'post') {
var myElement = document.createElement('input');
myElement.setAttribute('type', 'hidden');
myElement.name = 'bb2_screener_';
myElement.value = '1568090530 122.44.202.205 122.44.202.205';
document.forms[i].appendChild(myElement);
}
I would like to obtain the value of "myElement.value", but I am not familiar with how to do so( If it is even possible with BeautifulSoup) 我想获取“ myElement.value”的值,但我不熟悉该方法(如果使用BeautifulSoup甚至可能)
Ive tried : 我试过了 :
soup = BeautifulSoup(a.text, 'html.parser')
h = soup.find('type') ...('div') ... ('input') ... even ('var')
print(soup)
and NO Luck :( 没有运气:(
Is there a way of obtaining the value? 有没有获得价值的方法? If so how?
如果可以,怎么办?
It would help to know more about the myElement.value across different pages. 有助于更多了解不同页面上的myElement.value。 You might get away with a simple character set and lead string as shown in regex below.
您可能会遇到一个简单的字符集和引导字符串,如下面的正则表达式所示。 I would like to tighten it up but would need more examples ..... perhaps those number lengths are fixed and repeating ?
我想收紧它,但需要更多示例.....也许这些数字长度是固定的并且重复? ..... then something like
p = re.compile(r"myElement\\.value = '(\\d{10}(?:(\\s\\d{3}\\.\\d{2}\\.\\d{3}\\.\\d{3}){2}))';")
<= then take group 1. .....然后类似
p = re.compile(r"myElement\\.value = '(\\d{10}(?:(\\s\\d{3}\\.\\d{2}\\.\\d{3}\\.\\d{3}){2}))';")
<=然后参加第1组。
import re
s = '''bb2_addLoadEvent(function() {
for ( i=0; i < document.forms.length; i++ ) {
if (document.forms[i].method == 'post') {
var myElement = document.createElement('input');
myElement.setAttribute('type', 'hidden');
myElement.name = 'bb2_screener_';
myElement.value = '1568090530 122.44.202.205 122.44.202.205';
document.forms[i].appendChild(myElement);
}'''
p = re.compile(r"myElement\.value = '([\d\s\.]+)';")
print(p.findall(s)[0])
@SIM also kindly proposed: @SIM还建议:
p = re.compile(r"value[^']+'([^']*)'"
If myElement.value =
is static, this can be achieved with a simple regular expression: 如果
myElement.value =
是静态的,则可以使用一个简单的正则表达式来实现:
value = re.compile(r"myElement\.value = '([^']+)'").search(str).group(1)
This matches myElement.value = '
, followed by non- '
characters, followed by another '
, where all the non- '
characters are captured in a group. 这与
myElement.value = '
匹配,后跟非'
字符,然后是另一个'
,其中所有非'
字符都在一个组中捕获。 Then the group(1)
extracts the group from the match. 然后,
group(1)
从匹配项中提取组。
If the string may contain escaped '
s as well, eg: 如果字符串也可能包含转义
'
,例如:
myElement.value = 'foo \' bar';
then alternate \\.
然后替换
\\.
with [^']
: 与
[^']
:
myElement\.value = '((?:\\.|[^'])+)'
https://regex101.com/r/Tdarel/1 https://regex101.com/r/Tdarel/1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.