简体   繁体   English

如何转换在线 Txt 文件 padas Dataframe

[英]How to Convert Online Txt file padas Dataframe

Im using requests and beautiful soup to navigate and download data from the Census Webpage.我使用请求和美丽的汤从人口普查网页导航和下载数据。 Im able to get the data into a result object, and if i want a soup object, but can not seem to convert it into a dataframe so that it can be appended with each of the other files.我能够将数据放入结果对象中,如果我想要一个汤对象,但似乎无法将其转换为数据框,以便可以将其附加到其他每个文件中。 It is stored online as a .txt file.它以 .txt 文件的形式在线存储。

from bs4 import BeautifulSoup
import pandas as pd
import csv
import requests 
from json import loads
from bs4.dammit import EncodingDetector 
url = 'https://www2.census.gov/econ/bps/Place/West%20Region/'
parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
region_soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
df = DataFrame()
for link in region_soup.find_all('a', href=True):
    links = str(link['href'])
    print(links)
    if links[-4:] == ".txt":
        result = requests.get(url + links).text
        df.append(pd.read_csv(result), ignore_index = True)

How do I convert the requests object into a dataframe, and define the column names etc如何将请求对象转换为数据框,并定义列名等

Off the bat, you import pandas as pd so you need use that when calling the DataFrame() method.马上,您将pandas导入为pd ,因此在调用DataFrame()方法时需要使用它。 Secondly, pandas is not parsing the text into a csv table.其次,pandas 不会将文本解析为 csv 表。 It would require a tad more manipulation to read in that text.阅读该文本需要更多操作。 Pandas can actually just read in the csv from a url though, so just do that directly.不过,Pandas 实际上可以从 url 读取 csv,所以直接这样做。

Finally you need to store the appended dataframe.最后,您需要存储附加的数据框。 So change所以改变

df.append(pd.read_csv(result), ignore_index = True)

to

df = df.append(pd.read_csv(result), ignore_index = True)

Code:代码:

from bs4 import BeautifulSoup
import pandas as pd
import csv
import requests 
from json import loads
from bs4.dammit import EncodingDetector 


url = 'https://www2.census.gov/econ/bps/Place/West%20Region/'
parser = 'html.parser'  # or 'lxml' (preferred) or 'html5lib', if installed
resp = requests.get(url)
http_encoding = resp.encoding if 'charset' in resp.headers.get('content-type', '').lower() else None
html_encoding = EncodingDetector.find_declared_encoding(resp.content, is_html=True)
encoding = html_encoding or http_encoding
region_soup = BeautifulSoup(resp.content, parser, from_encoding=encoding)
df = pd.DataFrame()
for link in region_soup.find_all('a', href=True):
    links = str(link['href'])
    print(links)
    if links[-4:] == ".txt":
        result = pd.read_csv(url + links)
        df = df.append(result, ignore_index = True)

Note:笔记:

You will get The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.你会得到The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead. The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.

So I'd rather:所以我宁愿:

df_list = []
for link in region_soup.find_all('a', href=True):
    links = str(link['href'])
    print(links)
    if links[-4:] == ".txt":
        result = pd.read_csv(url + links)
        df_list.append(result)
        
df = pd.concat(df_list, ignore_index=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM