限制python搜索的文本区域

Question

我想搜索并计算一个字符串出现在webscrape中的次数。 但是我想在webscrape中的x和y之间搜索。

在下面的示例webscrape中，谁能告诉我在MAIN FISHERMAN和SECONDARY FISHERMAN之间计算SEA BASS的最简单方法。

<p style="color: #555555;
    font-family: Arial,Helvetica,sans-serif;
    font-size: 12px;
    line-height: 18px;">June 21, 2013  By FISH PPL Admin  </small>

</div>
<!-- Post Body Copy -->

<div class="post-bodycopy clearfix"><p>MAIN FISHERMAN &#8211; </p>
<p><strong>CHAMP</strong> &#8211; Pedro 00777<br />
BAIT &#8211; LOCATION1 &#8211; 2:30 &#8211; SEA BASS (3 LBS 11/4)<br />
MULTI – LOCATION2 &#8211; 7:30 &#8211; COD (3 LBS 13/8)<br />
LURE – LOCATION5 &#8211; 3:20 &#8211; RUDD (2 LBS 6/1)</p>
<p>JOE BLOGGS <a href="url">url</a><br />
BAIT &#8211; LOCATION4 &#8211; 4:45 &#8211; ROACH (5 LBS 3/1)<br />
MULTI – LOCATION2 &#8211; 5:50 &#8211; PERCH (3 LBS 6/1)<br />
LURE – LOCATION1 &#8211; 3:45 &#8211; PIKE (2 LBS 5/1) </p>
BAIT &#8211; LOCATION1 &#8211; 2:30 &#8211; SEA BASS (3 LBS 11/4)<br />
MULTI – LOCATION1 &#8211; 3:45 &#8211; JUST THE JUDGE (3 LBS 3/1)<br />
LURE – LOCATION3 &#8211; 8:25 &#8211; SCHOOL FEES (2 LBS 7/1)</p>
<div class="post-bodycopy clearfix"><p>SECONDARY FISHERMAN &#8211; </p>
<p><strong>SPOON &#8211; <a href="url">url</a></strong><br />
BAIT &#8211; LOCATION1 &#8211; 2:30 &#8211; SEA BASS (3 LBS 11/4)<br />
MULTI – LOCATION2 &#8211; 7:30 &#8211; COD (3 LBS 7/4)<br />
LURE – LOCATION1 &#8211; 4:25 &#8211; TROUT (2 LBS 5/1)</p>

我试图使用以下代码来实现此目的，但无济于事。

html = website.read()

pattern_to_exclude_unwanted_data = re.compile('MAIN FISHERMAN(.*)SECONDARY FISHERMAN')

excluding_unwanted_data = re.findall(pattern_to_exclude_unwanted_data, html)

print excluding_unwanted_data("SEA BASS")

Answer 1

分两个步骤进行：

提取MAIN FISHERMAN和SECONDARY FISHERMAN之间的子字符串。
计数海鲈鱼

像这样：

relevant = re.search(r"MAIN FISHERMAN(.*)SECONDARY FISHERMAN", html, re.DOTALL).group(1)
found = relevant.count("SEA BASS")

Answer 2

如果要使用'MAIN FISHERMAN'和'SECONDARY FISHERMAN'作为标记来查找<div>元素以在其中计算'SEA BASS' ：

import re
from bs4 import BeautifulSoup # $ pip install beautifulsoup4

soup = BeautifulSoup(html)
inbetween = False
count = 0
for div in soup.find_all('div', ["post-bodycopy", "clearfix"]):
    if not inbetween:
       inbetween = div.find(text=re.compile('MAIN FISHERMAN')) # check start
    else: # inbetween
       inbetween = not div.find(text=re.compile('SECONDARY FISHERMAN')) # end
    if inbetween:
       count += len(div.find_all(text=re.compile('SEA BASS')))

print(count)

Answer 3

伪代码（未经测试）：

count = 0
enabled = false
for line in file:
  if 'MAIN FISHERMAN' in line:
    enabled = true
  elif enabled and 'SEA BASS' in line:
    count += 1
  elif 'SECONDARY FISHERMAN' in line:
    enabled = false

限制python搜索的文本区域

问题描述

3 个解决方案

解决方案1
6 已采纳 2013-06-21 19:06:56

解决方案2
4 2013-06-21 19:44:39

解决方案3
2 2013-06-21 18:56:49

限制python搜索的文本区域

问题描述

3 个解决方案

解决方案1 6 已采纳 2013-06-21 19:06:56

解决方案2 4 2013-06-21 19:44:39

解决方案3 2 2013-06-21 18:56:49

解决方案1
6 已采纳 2013-06-21 19:06:56

解决方案2
4 2013-06-21 19:44:39

解决方案3
2 2013-06-21 18:56:49