简体   繁体   English

python 中有没有办法删除 csv 文件中的几行?

[英]Is there a way in python to delete several rows in an csv file?

I'm currently working on the download of the form.idx file from sec.gov for the first quarter of 2016. Since I'm only interested in the 10-Ks, I wanted to download the file as a.csv file and delete the useless rows.我目前正在从 sec.gov 下载 2016 年第一季度的 form.idx 文件。由于我只对 10-Ks 感兴趣,我想将文件下载为 .csv 文件并删除无用的行。 I tried to filter by the form type but that didn't work out.我尝试按表单类型进行过滤,但没有成功。

My code so far is the following:到目前为止,我的代码如下:

import requests
import os

years = [2016]

quarters = ['QTR1']

base_path = '/Users/xyz/Desktop'

current_dirs = os.listdir(path=base_path)

for yr in years:
    if str(yr) not in current_dirs:
        os.mkdir('/'.join([base_path, str(yr)]))
    
    current_files = os.listdir('/'.join([base_path, str(yr)]))
    
    for qtr in quarters:
        local_filename =  f'{yr}-{qtr}.csv'
        
    
        local_file_path = '/'.join([base_path, str(yr), local_filename])
        
        if local_filename in current_files:
            print(f'Skipping file for {yr}, {qtr} because it is already saved.')
            continue
        
        url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx'
        
        r = requests.get(url, stream=True)
        with open(local_file_path, 'wb') as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)

r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=";", encoding="utf-8")
r2.head()
filt = (r2 ['Form Type'] == '10-K')
r2_10K = r2.loc[filt]
r2_10K.head()
r2_10K.to_csv('/Users/xyz/Desktop/modified.csv')

The Error message I get is:
Traceback (most recent call last):

  File "<ipython-input-5-f84e3f81f3d1>", line 61, in <module>
    filt = (r2 ['Form Type'] == '10-K')

  File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/frame.py", line 2906, in __getitem__
    indexer = self.columns.get_loc(key)

  File "/Users/xyz/opt/anaconda3/envs/spyder-4.1.5_1/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err

KeyError: 'Form Type'

Maybe there's a way to just delete the rows I don't need in the file?也许有一种方法可以删除文件中我不需要的行? Otherwise, I'm also thankful for any kind of help on that problem.否则,我也很感谢在这个问题上提供的任何帮助。

Many thanks in advance.提前谢谢了。

Kind regards, Elena亲切的问候,埃琳娜

There are various ways by which you can delete rows from a csv files.您可以通过多种方式从 csv 文件中删除行。 The Pandas library in Python has any number of functions by which you can alter your data from csv file. Python 中的 Pandas 库具有任意数量的函数,您可以通过这些函数更改 csv 文件中的数据。 First of all import the Pandas library by the following code:首先通过以下代码导入 Pandas 库:

import pandas as pd

Read your csv file by the following code:通过以下代码读取您的 csv 文件:

df = pd.read_csv("filename.csv")

For example if you have a data field named df which contains your csv file.例如,如果您有一个名为 df 的数据字段,其中包含您的 csv 文件。 You can drop rows by the indexes by the following code:您可以通过以下代码按索引删除行:

df1 = df.drop([df.index[1], df.index[2]])

There are any number of ways by which you can drop rows from a csv using Pandas.您可以通过多种方式使用 Pandas 从 csv 中删除行。 For example: by row value, by null values, by data type and what not!例如:按行值、按 null 值、按数据类型等等!

This is the full working code for you, The main issue was on your csv format that you're getting from online, Full code: https://rextester.com/QUGF24653这是您的完整工作代码,主要问题是您从网上获得的 csv 格式,完整代码: https://rextester.com/QUGF24653

What I did:我做了什么:

  1. I did skip first 10 rows我确实跳过了前 10 行
  2. Set column names after using a 3 space separator使用 3 个空格分隔符后设置列名
  3. Split last column to 2 new columns将最后一列拆分为 2 个新列
  4. Filter Form Type with "10-K"带有“10-K”的过滤器表单类型
import requests import os import pandas as pd years = [2016] quarters = ['QTR1'] base_path = '/Users/xyz/Desktop' current_dirs = os.listdir(path=base_path) for yr in years: if str(yr) not in current_dirs: os.mkdir('/'.join([base_path, str(yr)])) current_files = os.listdir('/'.join([base_path, str(yr)])) for qtr in quarters: local_filename = f'{yr}-{qtr}.csv' local_file_path = '/'.join([base_path, str(yr), local_filename]) if local_filename in current_files: print(f'Skipping file for {yr}, {qtr} because it is already saved.') continue url = f'https://www.sec.gov/Archives/edgar/full-index/{yr}/{qtr}/form.idx' r = requests.get(url, stream=True) with open(local_file_path, 'wb') as f: for chunk in r.iter_content(chunk_size=128): f.write(chunk) colnames=['Form Type', 'Company Name', 'CIK', 'Date Filed','File Name'] r2 = pd.read_csv('/Users/xyz/Desktop/2016-QTR1.csv', sep=r'\s{3,}', skiprows=10, encoding="utf-8", names=colnames,header=None) r2[['Date Filed','File Name']] = r2['Date Filed'].str.split(expand=True) filtered = (r2['Form Type'] == '10-K') r2_10K = r2.loc[filtered] print(r2_10K.head())

Output: Output:

   Form Type                            Company Name      CIK  Date Filed                                    File Name
2181      10-K                       1347 Capital Corp  1606163  2016-03-21  edgar/data/1606163/0001144204-16-089184.txt
2182      10-K  1347 Property Insurance Holdings, Inc.  1591890  2016-03-17  edgar/data/1591890/0001387131-16-004603.txt
2183      10-K                1ST CONSTITUTION BANCORP  1141807  2016-03-22  edgar/data/1141807/0001141807-16-000010.txt
2184      10-K                         1ST SOURCE CORP    34782  2016-02-19    edgar/data/34782/0000034782-16-000102.txt
2185      10-K            1st Century Bancshares, Inc.  1420525  2016-03-04  edgar/data/1420525/0001437749-16-026765.txt

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM