简体   繁体   English

无法从标签的内联样式属性中抓取数据

[英]Trouble web scraping data from a tag's inline style attribute

So I have a couple of spans with inline styles:所以我有几个带有内联样式的跨度:

<span style="...;width:8px;..."></span>
<span style="...;width:16px;..."></span>
<span style="...;width:13px;..."></span>
<span style="...;width:20px;..."></span>
<span style="...;width:0px;..."></span> //width=0px
<span style="...;width:5px;..."></span>
<span style="...;width:3px;..."></span>
<span style="...;width:90px;..."></span>
<span style="...;width:200px;..."></span>

I want to extract the "px" value and store it into an array.我想提取“px”值并将其存储到数组中。 When we hit a span with width=0px , that signifies the end of that array.当我们命中一个width=0px的跨度时,这表示该数​​组的结尾。 So the above will look like this:所以上面看起来像这样:

array1 = [8, 16, 13, 20]

array2 = [5, 3, 90, 200]

We can use an arraylist of integer arrays to store the data.我们可以使用整数数组的数组列表来存储数据。

What I have so far is very basic: Elements spanWidths= doc.select("span");到目前为止我所拥有的是非常基本的: Elements spanWidths= doc.select("span");

So far this produces: "border:...;width:8px;..."到目前为止,这会产生: "border:...;width:8px;..."

I believe that we use regex to solve this but I'm not very accustomed to it.我相信我们使用正则表达式来解决这个问题,但我不太习惯。 Any help?有什么帮助吗?

The regex would be \\bwidth\\s*:\\s*(\\d+)px .正则表达式将是\\bwidth\\s*:\\s*(\\d+)px Then take the value from the first capture group.然后从第一个捕获组中获取值。 That is, call .group(1) on the resulting match.也就是说,在结果匹配上调用.group(1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM