如何使用BS4从标签外部提取文本

Question

I am trying to scan a bunch of Wikipedia pages for statistics about WWII. 我正在尝试扫描一堆Wikipedia页面以获取有关第二次世界大战的统计信息。

I am using BeautifulSoup to try and get all of the statistics from the column on the right of the Wikipedia page. 我正在使用BeautifulSoup尝试从Wikipedia页面右侧的列中获取所有统计信息。 The code is listed below. 该代码在下面列出。 "links.csv" is a file with a bunch of link endings like "Battle_of_Leyte_Gulf". “ links.csv”是带有一串链接结尾的文件，例如“ Battle_of_Leyte_Gulf”。 I have tested with the <h2> tag and it is properly accessing all sites. 我已经使用<h2>标记进行了测试，它可以正确访问所有站点。

import requests
from bs4 import BeautifulSoup
import pandas
df=pandas.read_csv("links.csv")
links=df['links']
for url in links:
  # print("\n"+url+"\n")
  txt="https://en.wikipedia.org/wiki/"+url
  page=requests.get(txt)
  soup=BeautifulSoup(page.content, 'html.parser')
  tags = soup.find_all("br")
  for tag in tags:
    print(tag)

However, I noticed the text is not in the actual 但是，我发现该文本不在实际中
tag, and it is actually outside like listed. 标签，它实际上在外面，如所列。

"Sixth Army: "
<br>
"≈200,000"
<br>
<span class="flagicon">...</span>
"Air and naval forces: ≈120,000"

I want to know how I can get the actual text "Sixth Army: " and "≈200,000". 我想知道如何获得实际的文本“第六军：”和“≈200,000”。

link here: https://en.wikipedia.org/wiki/Battle_of_Leyte 链接在这里： https : //en.wikipedia.org/wiki/Battle_of_Leyte

Answer 1

You could isolate the td cell and then use next_sibling 您可以隔离td单元格，然后使用next_sibling

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://en.wikipedia.org/wiki/Battle_of_Leyte')
soup = bs(r.content, 'lxml')
visible_row = soup.select_one('.vevent tr:nth-of-type(12) td span')
print(visible_row.next_sibling)
print(visible_row.next_sibling.next_sibling.next_sibling)

如何使用BS4从标签外部提取文本

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-04-25 08:25:06

如何使用BS4从标签外部提取文本

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-04-25 08:25:06

解决方案1
0 已采纳 2019-04-25 08:25:06