使用beautifulsoup進行網頁抓取：分隔值

Question

我使用beautifulsoup進行網頁抓取。 該網頁具有以下來源：

<a href="/en/Members/">
                            Courtney, John  (Dem)                       </a>,
<a href="/en/Members/">
                            Clinton, Hilary  (Dem)                      </a>,
<a href="/en/Members/">
                            Lee, Kevin  (Rep)                       </a>,

以下代碼有效。

for item in soup.find_all("a"):
    print item

但是，代碼返回以下內容：

Courtney, John  (Dem)
Clinton, Hilary  (Dem)
Lee, Kevin  (Rep)

我可以只收集名字嗎？ 那么隸屬信息分別呢？ 提前致謝。

Answer 1

您可以使用re.split()通過制作正則表達式模式來分割多個定界符上的字符串。 在這里我分開(和)

import re

for item in soup.find_all("a"):
    tokens = re.split('\(|\)', item)
    name = tokens[0].strip()
    affiliation = tokens[1].strip()
    print name
    print affiliation

來源： https : //docs.python.org/2/library/re.html#re.split

re.split()將返回一個看起來像這樣的列表：

>>> re.split('\(|\)', item)
['Courtney, John  ', 'Dem', '']

從名稱列表中獲取條目0 ，從末尾刪除空白。 抓取條目1的從屬關系，執行相同操作。

Answer 2

您可以使用：

from bs4 import BeautifulSoup

content = '''
<a href="/en/Members/">Courtney, John  (Dem)</a>
<a href="/en/Members/">Clinton, Hilary  (Dem)</a>,
<a href="/en/Members/">Lee, Kevin  (Rep)</a>
'''

politicians = []
soup = BeautifulSoup(content)
for item in soup.find_all('a'):
    name, party = item.text.strip().rsplit('(')
    politicians.append((name.strip(), party.strip()[:-1]))

由於名稱和隸屬關系信息均構成a標簽的文本內容，因此無法單獨收集它們。 您必須將它們作為字符串收集在一起，然后將它們分開。 我已經使用strip()函數刪除了不需要的空格，並使用rsplit('(')函數在出現左括號時拆分了文本內容。

產量

print(politicians)
[(u'Courtney, John', u'Dem)'),
 (u'Clinton, Hilary', u'Dem)'),
 (u'Lee, Kevin', u'Rep)')]

使用beautifulsoup進行網頁抓取：分隔值

問題描述

2 個解決方案

解決方案1
1 2015-09-07 20:47:14

解決方案2
1 已采納 2015-09-07 20:51:06

使用beautifulsoup進行網頁抓取：分隔值

問題描述

2 個解決方案

解決方案1 1 2015-09-07 20:47:14

解決方案2 1 已采納 2015-09-07 20:51:06

解決方案1
1 2015-09-07 20:47:14

解決方案2
1 已采納 2015-09-07 20:51:06