[英]How to remove '\r\n\r\n' character from a list containing various strings while web scraping using BeautifulSoup in python?
我正在尝试从网络上抓取数据,但在这样做的同时,我的数据中出现了不寻常的字符(即'\r\n\r\n')。 目标是获取包含站点数据的数据框。
这是我的代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url = "https://www.hubertiming.com/results/2018MLK"
html = urlopen(url)
soup = BeautifulSoup(html, "lxml")
title = soup.title
print(title)
print(title.text)
links = soup.find_all('a', href = True)
for link in links:
print(link['href'])
data = []
allrows = soup.find_all("tr")
for row in allrows:
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text)
data.append(dataRow)
print(data)
我得到的输出如下:
[[], ['Finishers:', '191'], ['Male:', '78'], ['Female:', '113'], [], ['1', '1191', '\r\n\r\n MAX RANDOLPH\r\n\r\n ', 'M', '29', 'WASHINGTON', 'DC', '5:25', '16:48', '\r\n\r\n 1 of 78\r\n\r\n ', 'M 21-39', '\r\n\r\n 1 of 33\r\n\r\n ', '0:08', '16:56'], ['2', '1080', '\r\n\r\n NEED NAME KAISER RUNNER\r\n\r\n ', 'M', '25', 'PORTLAND', 'OR', '5:39', '17:31', '\r\n\r\n 2 of 78\r\n\r\n ', 'M 21-39', '\r\n\r\n 2 of 33\r\n\r\n ', '0:09', '17:40'], ['3', '1275', '\r\n\r\n DAN FRANEK\r\n\r\n ', 'M', '52', 'PORTLAND', 'OR', '5:53', '18:15', '\r\n\r\n 3 of 78\r\n\r\n ', 'M 40-54', '\r\n\r\n 1 of 27\r\n\r\n ', '0:07', '18:22'], ['4', '1223', '\r\n\r\n PAUL TAYLOR\r\n\r\n ', 'M', '54', 'PORTLAND', 'OR', '5:58', '18:31', '\r\n\r\n 4 of 78\r\n\r\n ', 'M 40-54', '\r\n\r\n 2 of 27\r\n\r\n ', '0:07', '18:38'], ['5', '1245', '\r\n\r\n THEO KINMAN\r\n\r\n ', 'M', '22', '', '', '6:17', '19:31', '\r\n\r\n 5 of 78\r\n\r\n ', 'M 21-39', '\r\n\r\n 3 of 33\r\n\r\n ', '0:09', '19:40'], ['6', '1185', '\r\n\r\n MELISSA GIRGIS\r\n\r\n ', 'F', '27', 'PORTLAND', 'OR', '6:20', '19:39', '\r\n\r\n 1 of 113\r\n\r\n ', 'F 21-39', '\r\n\r\n 1 of 53\r\n\r\n ', '0:07', '19:46'],...
df = pd.DataFrame(data)
print(df)
数据框如下:
0 1 2 \
0 None None None
1 Finishers: 191 None
2 Male: 78 None
3 Female: 113 None
4 None None None
.. ... ... ...
191 187 1254 \r\n\r\n CYNTHIA HARRIS\r\n...
192 188 1085 \r\n\r\n EBONY LAWRENCE\r\n...
193 189 1170 \r\n\r\n ANTHONY WILLIAMS\r...
194 190 2087 \r\n\r\n LEESHA POSEY\r\n\r...
195 191 1216 \r\n\r\n ZULMA OCHOA\r\n\r\...
3 4 5 6 7 8 \
0 None None None None None None
1 None None None None None None
2 None None None None None None
3 None None None None None None
4 None None None None None None
.. ... ... ... ... ... ...
191 F 64 PORTLAND OR 21:53 1:07:51
192 F 30 PORTLAND OR 22:00 1:08:12
193 M 39 PORTLAND OR 22:19 1:09:11
194 F 43 PORTLAND OR 30:17 1:33:53
195 F 40 GRESHAM OR 33:22 1:43:27
9 10 \
0 None None
1 None None
2 None None
3 None None
4 None None
.. ... ...
191 \r\n\r\n 110 of 113\r\n\r\n... F 55+
192 \r\n\r\n 111 of 113\r\n\r\n... F 21-39
193 \r\n\r\n 78 of 78\r\n\r\n ... M 21-39
194 \r\n\r\n 112 of 113\r\n\r\n... F 40-54
195 \r\n\r\n 113 of 113\r\n\r\n... F 40-54
11 12 13
0 None None None
1 None None None
2 None None None
3 None None None
4 None None None
.. ... ... ...
191 \r\n\r\n 14 of 14\r\n\r\n ... 1:19 1:09:10
192 \r\n\r\n 53 of 53\r\n\r\n ... 0:58 1:09:10
193 \r\n\r\n 33 of 33\r\n\r\n ... 0:08 1:09:19
194 \r\n\r\n 36 of 37\r\n\r\n ... 0:00 1:33:53
195 \r\n\r\n 37 of 37\r\n\r\n ... 0:00 1:43:27
[196 rows x 14 columns]
我似乎无法理解如何从我的数据中删除多余的字符。 请建议一种方法来做同样的事情。
@SergeyK 也提到过,我建议使用pandas
,这是常见的做法,并且在大多数情况下都可以使用(引擎盖下的 bs4),并且您可以在一行中获得结果
pd.read_html(url)[1] print(df)
如果您喜欢按照自己的方式行事,请选择更具体的并strip()
提到的文本:
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(urlopen('https://www.hubertiming.com/results/2018MLK'))
data = []
for row in soup.select('#individualResults tr:has(td)'):
row_list = row.find_all("td")
dataRow = []
for cell in row_list:
dataRow.append(cell.text.strip())
data.append(dataRow)
pd.DataFrame(data, columns=[h.text for h in soup.select('#individualResults th')])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.