[英]how to read url .txt files using pandas
I have a problem reading files using pandas ( read_csv
).我在使用 pandas (
read_csv
) 读取文件时遇到问题。 I can do it using the built in, with open(...)
, however it is much easier with pandas. I just need to read the data (numbers) between the ----
.我可以使用内置的
with open(...)
来做到这一点,但是使用 pandas 更容易。我只需要读取----
之间的数据(数字)。 This is the LINK with one of my data url. There are more depending on the date that i insert.这是我的数据url之一的链接。根据我插入的日期,还有更多。 A sample of this is:
一个例子是:
MONTHLY CLIMATOLOGICAL SUMMARY for JUN. 2020 NAME: Krieza Evias CITY: Krieza Evias STATE: ELEV: 119 m LAT: 38° 24' 00" N LONG: 24° 18' 00" E TEMPERATURE (°C), RAIN (mm), WIND SPEED (km/hr) HEAT COOL AVG MEAN DEG DEG WIND DOM DAY TEMP HIGH TIME LOW TIME DAYS DAYS RAIN SPEED HIGH TIME DIR ------------------------------------------------------------------------------------ 1 18.2 22.4 10:20 13.5 23:50 1.0 0.9 0.0 4.5 33.8 12:30 E 2 17.6 22.3 15:00 10.8 4:10 2.0 1.3 0.0 4.5 30.6 15:20 E 3 18.1 21.9 12:20 14.1 3:40 1.3 1.1 1.0 4.2 24.1 14:40 E
Keep in mind that i cannot just use skiprows=8
and skipfooter=9
to get the data between the --------
, because not all files of this format have a specific number of footer ( skipfooter
)or title ( skiprows
) to skip.请记住,我不能只使用
skiprows=8
和skipfooter=9
来获取--------
之间的数据,因为并非所有这种格式的文件都有特定数量的页脚 ( skipfooter
) 或标题 ( skiprows
) 跳过。 Some have 2 or 3 and some others have 8-9 lines of footer or title to skip.有些有 2 或 3 行,有些有 8-9 行页脚或标题可以跳过。 But every file has 2 lines of
--------
where the data are between them.但是每个文件都有 2 行
--------
数据位于它们之间。
I think you can't directly use read_csv
but you could do this:我认为你不能直接使用
read_csv
但你可以这样做:
import urllib
from io import StringIO
count = 0
txt=""
data = urllib.request.urlopen(LINK)
for line in data:
if "---" in line.decode('windows-1252'):
count+=1
elif count==1:
txt+=line.decode('windows-1252')
else:
break
df = pd.read_csv(StringIO(txt), sep="\s+", header=None)
header is None because in your link column names are not in a row only but divided into multiple rows. header 是 None 因为在您的链接中,列名不仅在一行中,而且分为多行。 If they're fixed I suggest you to put them by hand such as
["DAY", "MEAN TEMP", ...]
.如果它们是固定的,我建议您手动输入它们,例如
["DAY", "MEAN TEMP", ...]
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.