简体   繁体   中英

Cleaning accented unicode characters with Pandas read_html function

I'm downloading football data with pandas read_html function, but not struggling to clean the player names with all the accented characters. This is what I have so far:

import pandas as pd 
from unidecode import unidecode

shooting = pd.read_html("https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting")
for idx,table in enumerate(shooting):
 print("***************************")
 print(idx)
 print(table)
 
 shooting = table

for col in [('Unnamed: 1_level_0', 'Player')]:
    shooting[col] = shooting[col].apply(unidecode)
    shooting


shooting = table
 #print(shooting.droplevel(1))
  
table.to_csv (r'C:\Users\khabs\OneDrive\Documents\Python Testing\shooting.csv', index = False, header=True)
print (shooting) 

I think the issue is that the coding is messed before I even do the cleaning, but really not sure.

Any help would be greatly appreciated!!

Just use the encoding parameter in pandas .

import pandas as pd 

url = "https://widgets.sports-reference.com/wg.fcgi?css=1&site=fb&url=%2Fen%2Fcomps%2F9%2Fshooting%2FPremier-League-Stats&div=div_stats_shooting"
shooting = pd.read_html(url, header=1, encoding='utf8')[0]

However, that (and I'm assuming) will not get you what you want, as there are extra html characters in the response returned from that widget.

Just go after the actual html. The table is within the comments.

import requests
import pandas as pd

url = 'https://fbref.com/en/comps/9/shooting/Premier-League-Stats'
html = requests.get(url).text.replace('<!--', '').replace('-->', '')

shooting = pd.read_html(html, header=1)[-1]
shooting = shooting[shooting['Rk'].ne('Rk')]

Output:

print(shooting.head(10))
   Rk                  Player   Nation    Pos  ... npxG/Sh  G-xG np:G-xG  Matches
0   1        Brenden Aaronson   us USA  FW,MF  ...    0.03  -0.1    -0.1  Matches
1   2               Che Adams  sct SCO     FW  ...    0.09  +1.6    +1.6  Matches
2   3             Tyler Adams   us USA     MF  ...    0.01   0.0     0.0  Matches
3   4        Tosin Adarabioyo  eng ENG     DF  ...     NaN   0.0     0.0  Matches
4   5         Rayan Aït Nouri   fr FRA     DF  ...    0.08  -0.1    -0.1  Matches
5   6              Nathan Aké   nl NED     DF  ...    0.05  -0.2    -0.2  Matches
6   7        Thiago Alcántara   es ESP     MF  ...     NaN   0.0     0.0  Matches
7   8  Trent Alexander-Arnold  eng ENG     DF  ...    0.05  -0.2    -0.2  Matches
8   9                 Alisson   br BRA     GK  ...     NaN   0.0     0.0  Matches
9  10               Dele Alli  eng ENG  FW,MF  ...     NaN   0.0     0.0  Matches

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM