简体   繁体   English

读取 pandas 中的空格分隔文本文件

[英]Read space separated text file in pandas

I am trying to read a text file present in this url into a pandas dataframe.我正在尝试将 url 中存在的文本文件读取到 pandas dataframe 中。 https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/recent/TU_Stundenwerte_Beschreibung_Stationen.txt https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/recent/TU_Stundenwerte_Beschreibung_Stationen.txt

It has uneven spacing between columns.它的列之间的间距不均匀。 I have tried sep='\s+', delim_whitespace=True but none of these are working.我试过sep='\s+', delim_whitespace=True但这些都不起作用。 Please suggest a way to read this text file into pandas dataframe.请建议一种将此文本文件读入 pandas dataframe 的方法。

The read_fwf function in pandas can read a file with a table of fixed-width formatted lines into a DataFrame. pandas中的 read_fwf function 可以将具有固定宽度格式行表的文件读取到 DataFrame 中。

The header line confuses the auto-width calculations so best to skip the header lines and explicitly add the column names so in this case the argument skiprows=2 is added. header 行混淆了自动宽度计算,因此最好跳过 header 行并显式添加列名,因此在这种情况下添加参数skiprows=2

import pandas as pd

url ='https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/hourly/air_temperature/recent/TU_Stundenwerte_Beschreibung_Stationen.txt'

df = pd.read_fwf(url, encoding="ansi", skiprows=2,
                 names=['Stations_id', 'von_datum', 'bis_datum', 'Stationshoehe',
                        'geoBreite', 'geoLaenge', 'Stationsname', 'Bundesland'])
print(df)

Output: Output:

     Stations_id  von_datum  bis_datum  Stationshoehe  geoBreite  geoLaenge           Stationsname           Bundesland
0              3   19500401   20110331            202    50.7827     6.0941                 Aachen  Nordrhein-Westfalen
1             44   20070401   20220920             44    52.9336     8.2370           Großenkneten        Niedersachsen
2             52   19760101   19880101             46    53.6623    10.1990   Ahrensburg-Wulfsdorf   Schleswig-Holstein
3             71   20091201   20191231            759    48.2156     8.9784        Albstadt-Badkap    Baden-Württemberg
4             73   20070401   20220920            340    48.6159    13.0506   Aldersbach-Kriestorf               Bayern
..           ...        ...        ...            ...        ...        ...                    ...                  ...
663        19171   20200901   20220920             13    54.0038     9.8553     Hasenkrug-Hardebek   Schleswig-Holstein
664        19172   20200901   20220920             48    54.0246     9.3880                 Wacken   Schleswig-Holstein

[665 rows x 8 columns]

If want to load the file locally and open it then just change the url to the local file name.如果要在本地加载文件并打开它,只需将 url 更改为本地文件名即可。

df = pd.read_fwf('TU_Stundenwerte_Beschreibung_Stationen.txt', encoding="ansi", skiprows=2,
                 names=['Stations_id', 'von_datum', 'bis_datum', 'Stationshoehe',
                 'geoBreite', 'geoLaenge', 'Stationsname', 'Bundesland'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM