简体   繁体   English

如何使用pandas读取url.txt文件

[英]how to read url .txt files using pandas

I have a problem reading files using pandas ( read_csv ).我在使用 pandas ( read_csv ) 读取文件时遇到问题。 I can do it using the built in, with open(...) , however it is much easier with pandas. I just need to read the data (numbers) between the ---- .我可以使用内置的with open(...)来做到这一点,但是使用 pandas 更容易。我只需要读取----之间的数据(数字)。 This is the LINK with one of my data url. There are more depending on the date that i insert.这是我的数据url之一的链接。根据我插入的日期,还有更多。 A sample of this is:一个例子是:

                   MONTHLY CLIMATOLOGICAL SUMMARY for JUN. 2020

NAME: Krieza Evias   CITY: Krieza Evias   STATE:  
ELEV:   119 m  LAT:  38° 24' 00" N  LONG:  24° 18' 00" E

                   TEMPERATURE (°C), RAIN  (mm), WIND SPEED (km/hr)

                                      HEAT  COOL        AVG
    MEAN                              DEG   DEG         WIND                 DOM
DAY TEMP  HIGH   TIME   LOW    TIME   DAYS  DAYS  RAIN  SPEED HIGH   TIME    DIR
------------------------------------------------------------------------------------
 1  18.2  22.4   10:20  13.5   23:50   1.0   0.9   0.0   4.5  33.8   12:30     E
 2  17.6  22.3   15:00  10.8    4:10   2.0   1.3   0.0   4.5  30.6   15:20     E
 3  18.1  21.9   12:20  14.1    3:40   1.3   1.1   1.0   4.2  24.1   14:40     E

Keep in mind that i cannot just use skiprows=8 and skipfooter=9 to get the data between the -------- , because not all files of this format have a specific number of footer ( skipfooter )or title ( skiprows ) to skip.请记住,我不能只使用skiprows=8skipfooter=9来获取--------之间的数据,因为并非所有这种格式的文件都有特定数量的页脚 ( skipfooter ) 或标题 ( skiprows ) 跳过。 Some have 2 or 3 and some others have 8-9 lines of footer or title to skip.有些有 2 或 3 行,有些有 8-9 行页脚或标题可以跳过。 But every file has 2 lines of -------- where the data are between them.但是每个文件都有 2 行--------数据位于它们之间。

I think you can't directly use read_csv but you could do this:我认为你不能直接使用read_csv但你可以这样做:

import urllib
from io import StringIO

count = 0
txt=""
data = urllib.request.urlopen(LINK)
for line in data:
    if "---" in line.decode('windows-1252'):
        count+=1
    
    elif count==1:
        txt+=line.decode('windows-1252')
    else:
        break
    
df = pd.read_csv(StringIO(txt), sep="\s+", header=None)

header is None because in your link column names are not in a row only but divided into multiple rows. header 是 None 因为在您的链接中,列名不仅在一行中,而且分为多行。 If they're fixed I suggest you to put them by hand such as ["DAY", "MEAN TEMP", ...] .如果它们是固定的,我建议您手动输入它们,例如["DAY", "MEAN TEMP", ...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM