简体   繁体   中英

How to save scraped data to csv unsing pandas

I want to save my scraped data to csv file using pandas. But I keep getting a bug.

Here's my code:

import requests
from bs4 import BeautifulSoup
import pandas as pd

link = ("https://sofifa.com/team/1/arsenal/?&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc")
get_text = requests.get(link)
soup = BeautifulSoup(get_text.content, "lxml") 
table = soup.find("table", {"class":"table table-hover persist-area"})
table1 = table.get_text()

table1.to_csv("Arsenal_players.csv")

You need to enter more explanation before asking a question like type of error you get this will be more helpful to give the answer. Anyway I run your code and see the error as expected. well table1 variable only consist strings now because of

table1 = table.get_text()

so there is no function in your situation to enter all data in csv but you can find help here . But remember for next time be precise about your problem.

You need to first read the html into a pandas dataframe using read_html , and then use to_csv to write to a file. Here is an example:

import requests
from bs4 import BeautifulSoup
import pandas as pd

link = ("https://sofifa.com/team/1/arsenal/?&showCol%5B%5D=ae&showCol%5B%5D=hi&showCol%5B%5D=le&showCol%5B%5D=vl&showCol%5B%5D=wg&showCol%5B%5D=rc")
get_text = requests.get(link)
soup = BeautifulSoup(get_text.content, "lxml")
table = soup.find("table", {"class":"table table-hover persist-area"})

# produces a list of dataframes from the html, see docs for more options
dfs = pd.read_html(str(table)) 
dfs[0].to_csv("Arsenal_players.csv")

The read_html method has quite a few options that can change the behavior. You can also use it to read your link directly instead of first using requests/BeautifulSoup (it can do that under the hood).

It might look something like this, but this is untested because that link gives a 403 forbidden when I do this (perhaps they are blocking based on user agent):

dfs = pd.read_html(link, attrs={"class":"table table-hover persist-area"})

EDIT: since read_html doesn't allow you to specify a user agent, I believe this will end up being the most concise way for this particular link:

dfs = pd.read_html(
    requests.get(link).text,
    attrs={"class":"table table-hover persist-area"}
)
dfs[0].to_csv("Arsenal_players.csv")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM