简体   繁体   English

pandas 从 url 读取 csv,起始行的 Z099FB995346F31C7549F6E40EDB0ZFZ 较少

[英]pandas read .csv from url, starting row has fewer header

I want to download a.csv file from this website (to directly download the csv, here ).我想从这个网站下载一个.csv文件(直接下载csv, 在这里)。 The problem I'm facing is, the row where i want to start importing has fewer columns than than rows in the later part, and I just cannot figure out how to read into pandas.我面临的问题是,我要开始导入的行的列比后面部分的行少,我只是不知道如何读入 pandas。

Indeed, this csv file is quite not beautiful.确实,这个 csv 文件相当不美观。

在此处输入图像描述

Here is how I want to import the csv in pandas:这是我想在 pandas 中导入 csv 的方式:

  1. Ignore the first row where there are "Trade Date"忽略有“交易日期”的第一行

  2. Separate data frame between sections(using for loop, separate wherever there is a blank row)部分之间的单独数据框(使用for循环,只要有空白行就分开)

  3. Store JPX Code(such as 16509005) and Instrument(such as FUT_TOPIX_2009) in additional columns.将 JPX 代码(例如 16509005)和仪器(例如 FUT_TOPIX_2009)存储在附加列中。

  4. Set headers ['institutions_sell_code', 'institutions_sell', 'institutions_sell_eng', 'amount_sell', 'institutions_buy_code', 'institutions_buy', 'institutions_buy_eng', 'amount_buy', 'JPX_code', 'instrument']设置标题 ['institutions_sell_code'、'institutions_sell'、'institutions_sell_eng'、'amount_sell'、'institutions_buy_code'、'institutions_buy'、'institutions_buy_eng'、'amount_buy'、'JPX_code'、'instrument']

So the expected outcome will be:所以预期的结果将是:

在此处输入图像描述

Here is my try.这是我的尝试。 I first tried to read the whole data into pandas:我首先尝试将整个数据读入 pandas:

import io
import pandas as pd
import requests
url = 'https://www.jpx.co.jp/markets/derivatives/participant-volume/nlsgeu000004vd5b-att/20200730_volume_by_participant_whole_day_J-NET.csv'
s=requests.get(url).content
colnames = ['institutions_sell_code', 'institutions_sell', 'institutions_sell_eng', 'amount_sell', 'institutions_buy_code', 'institutions_buy', 'institutions_buy_eng', 'amount_buy']
df=pd.read_csv(io.StringIO(s.decode('utf-8')), header=1, names = colnames)

ParserError: Error tokenizing data. C error: Expected 2 fields in line 6, saw 8

I assume this is because the header=1 has just two columns whereas other rows have eight.我认为这是因为 header=1 只有两列,而其他行有八列。 In fact when I set header=2 to exclude JPX Code and Instrument, it works.事实上,当我设置header=2以排除 JPX 代码和仪器时,它可以工作。 So how can I include the row with JPX Code and Instrument?那么如何在 JPX 代码和仪器中包含该行?

在此处输入图像描述

Pandas does not really support multiple documents in one CSV file like you have. Pandas 并不像您那样真正支持一个 CSV 文件中的多个文档。 What I have done to solve this, which worked fine, takes two steps:我为解决这个问题所做的工作很好,需要两个步骤:

  1. Call read_csv(use_cols=[0]) once to read the leftmost column.调用read_csv(use_cols=[0])一次以读取最左边的列。 Use this to determine where each table starts and ends.使用它来确定每个表的开始和结束位置。
  2. Open the file using open() just once, and for each table determined in step 1, call read_csv(skiprows=SKIP, nrows=ROWS) with appropriate values to read one table at a time.仅使用open()打开文件一次,对于步骤 1 中确定的每个表,调用read_csv(skiprows=SKIP, nrows=ROWS)并使用适当的值一次读取一个表。 This is the key: by only letting Pandas read the properly rectangular rows, it will not become angry at the unhygienic nature of your CSV file.这是关键:只让 Pandas 读取正确的矩形行,它不会对 CSV 文件的不卫生性质感到愤怒。

Opening the file just once is an optimization, to avoid scanning it over and over every time you execute step 2. You can actually use the same opened file object for step 1 as well, if you seek(0) to return to the beginning before beginning step 2.只打开一次文件是一种优化,以避免每次执行步骤 2 时一遍又一遍地扫描它。实际上,您也可以在步骤 1 中使用相同的打开文件 object,如果您seek(0)回到开头之前开始第 2 步。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM