繁体   English   中英

在 python 中使用 BeautifulSoup 进行网络抓取时,如何从包含各种字符串的列表中删除 '\r\n\r\n' 字符?

[英]How to remove '\r\n\r\n' character from a list containing various strings while web scraping using BeautifulSoup in python?

我正在尝试从网络上抓取数据,但在这样做的同时,我的数据中出现了不寻常的字符(即'\r\n\r\n')。 目标是获取包含站点数据的数据框。

这是我的代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = "https://www.hubertiming.com/results/2018MLK"  
html = urlopen(url)    

soup = BeautifulSoup(html, "lxml")
title = soup.title
print(title)
print(title.text)

links = soup.find_all('a', href = True)
for link in links:
    print(link['href'])

data = []
allrows = soup.find_all("tr")
for row in allrows:
    row_list = row.find_all("td")
    dataRow = []
    for cell in row_list:
        dataRow.append(cell.text)
    data.append(dataRow)
    
print(data)

我得到的输出如下:

[[], ['Finishers:', '191'], ['Male:', '78'], ['Female:', '113'], [], ['1', '1191', '\r\n\r\n                    MAX RANDOLPH\r\n\r\n                ', 'M', '29', 'WASHINGTON', 'DC', '5:25', '16:48', '\r\n\r\n                    1 of 78\r\n\r\n                ', 'M 21-39', '\r\n\r\n                    1 of 33\r\n\r\n                ', '0:08', '16:56'], ['2', '1080', '\r\n\r\n                    NEED NAME KAISER RUNNER\r\n\r\n                ', 'M', '25', 'PORTLAND', 'OR', '5:39', '17:31', '\r\n\r\n                    2 of 78\r\n\r\n                ', 'M 21-39', '\r\n\r\n                    2 of 33\r\n\r\n                ', '0:09', '17:40'], ['3', '1275', '\r\n\r\n                    DAN FRANEK\r\n\r\n                ', 'M', '52', 'PORTLAND', 'OR', '5:53', '18:15', '\r\n\r\n                    3 of 78\r\n\r\n                ', 'M 40-54', '\r\n\r\n                    1 of 27\r\n\r\n                ', '0:07', '18:22'], ['4', '1223', '\r\n\r\n                    PAUL TAYLOR\r\n\r\n                ', 'M', '54', 'PORTLAND', 'OR', '5:58', '18:31', '\r\n\r\n                    4 of 78\r\n\r\n                ', 'M 40-54', '\r\n\r\n                    2 of 27\r\n\r\n                ', '0:07', '18:38'], ['5', '1245', '\r\n\r\n                    THEO KINMAN\r\n\r\n                ', 'M', '22', '', '', '6:17', '19:31', '\r\n\r\n                    5 of 78\r\n\r\n                ', 'M 21-39', '\r\n\r\n                    3 of 33\r\n\r\n                ', '0:09', '19:40'], ['6', '1185', '\r\n\r\n                    MELISSA GIRGIS\r\n\r\n                ', 'F', '27', 'PORTLAND', 'OR', '6:20', '19:39', '\r\n\r\n                    1 of 113\r\n\r\n                ', 'F 21-39', '\r\n\r\n                    1 of 53\r\n\r\n                ', '0:07', '19:46'],...

df = pd.DataFrame(data)
    print(df)

数据框如下:

              0     1                                                  2  \
0          None  None                                               None   
1    Finishers:   191                                               None   
2         Male:    78                                               None   
3       Female:   113                                               None   
4          None  None                                               None   
..          ...   ...                                                ...   
191         187  1254  \r\n\r\n                    CYNTHIA HARRIS\r\n...   
192         188  1085  \r\n\r\n                    EBONY LAWRENCE\r\n...   
193         189  1170  \r\n\r\n                    ANTHONY WILLIAMS\r...   
194         190  2087  \r\n\r\n                    LEESHA POSEY\r\n\r...   
195         191  1216  \r\n\r\n                    ZULMA OCHOA\r\n\r\...   

        3     4         5     6      7        8  \
0    None  None      None  None   None     None   
1    None  None      None  None   None     None   
2    None  None      None  None   None     None   
3    None  None      None  None   None     None   
4    None  None      None  None   None     None   
..    ...   ...       ...   ...    ...      ...   
191     F    64  PORTLAND    OR  21:53  1:07:51   
192     F    30  PORTLAND    OR  22:00  1:08:12   
193     M    39  PORTLAND    OR  22:19  1:09:11   
194     F    43  PORTLAND    OR  30:17  1:33:53   
195     F    40   GRESHAM    OR  33:22  1:43:27   

                                                     9       10  \
0                                                 None     None   
1                                                 None     None   
2                                                 None     None   
3                                                 None     None   
4                                                 None     None   
..                                                 ...      ...   
191  \r\n\r\n                    110 of 113\r\n\r\n...    F 55+   
192  \r\n\r\n                    111 of 113\r\n\r\n...  F 21-39   
193  \r\n\r\n                    78 of 78\r\n\r\n  ...  M 21-39   
194  \r\n\r\n                    112 of 113\r\n\r\n...  F 40-54   
195  \r\n\r\n                    113 of 113\r\n\r\n...  F 40-54   

                                                    11    12       13  
0                                                 None  None     None  
1                                                 None  None     None  
2                                                 None  None     None  
3                                                 None  None     None  
4                                                 None  None     None  
..                                                 ...   ...      ...  
191  \r\n\r\n                    14 of 14\r\n\r\n  ...  1:19  1:09:10  
192  \r\n\r\n                    53 of 53\r\n\r\n  ...  0:58  1:09:10  
193  \r\n\r\n                    33 of 33\r\n\r\n  ...  0:08  1:09:19  
194  \r\n\r\n                    36 of 37\r\n\r\n  ...  0:00  1:33:53  
195  \r\n\r\n                    37 of 37\r\n\r\n  ...  0:00  1:43:27  

[196 rows x 14 columns]

我似乎无法理解如何从我的数据中删除多余的字符。 请建议一种方法来做同样的事情。

@SergeyK 也提到过,我建议使用pandas ,这是常见的做法,并且在大多数情况下都可以使用(引擎盖下的 bs4),并且您可以在一行中获得结果

pd.read_html(url)[1] print(df)

如果您喜欢按照自己的方式行事,请选择更具体的并strip()提到的文本:

for row in soup.select('#individualResults tr:has(td)'):
    row_list = row.find_all("td")
    dataRow = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
    data.append(dataRow)
例子
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

soup = BeautifulSoup(urlopen('https://www.hubertiming.com/results/2018MLK'))
data = []

for row in soup.select('#individualResults tr:has(td)'):
    row_list = row.find_all("td")
    dataRow = []
    for cell in row_list:
        dataRow.append(cell.text.strip())
    data.append(dataRow)
    
pd.DataFrame(data, columns=[h.text for h in soup.select('#individualResults th')])

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM