简体   繁体   English

熊猫read_csv-忽略分号分隔文件中的转义字符

[英]Pandas read_csv - Ignore Escape Char in SemiColon Seperated File

I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. 我正在尝试加载用分号分隔的txt文件,并且在一些实例中数据中包含转义字符。 These are typically &lt ; 这些通常&lt; (space removed so it isn't covered to <) which adds a semicolon. (空格已删除,因此它不会被<覆盖),这会添加分号。 This obviously messes up my data and since dtypes are important causes read_csv problems. 显然,这弄乱了我的数据,并且由于dtypes很重要,因此会导致read_csv问题。 Is there away to tell pandas to ignore these when the file is read? 读取文件时,有没有办法告诉大熊猫忽略它们?

I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable. 我尝试从文件中删除字符,它现在可以工作,但是鉴于我想要对数百万行进行自动化处理,因此这是不可持续的。

df = pd.read_csv(file_loc.csv,
                 header=None, 
                 names=column_names, 
                 usecols=counters, 
                 dtype=dtypes,
                 delimiter=';', 
                 low_memory=False)
ValueError: could not convert string to float:

As my first column is a string and the second is a float, but if the first is split by the &lt ; 因为我的第一列是字符串,第二列是浮点数,但是如果第一列被&lt分隔, it then goes on the 2nd too. 然后它也进入第二。

Is there a way to tell pandas to ignore these or efficiently remove before loading? 有没有办法告诉熊猫在加载之前忽略它们或有效删除它们?

Give the following example csv file so57732330.csv : 给出以下示例csv文件so57732330.csv

col1;col2
1&lt;2;a
3;

we read it using StringIO after unescaping named and numeric html5 character references: 命名和数字html5字符引用进行StringIO后,我们使用StringIO读取了它:

import pandas as pd
import io
import html

with open('so57732330.csv') as f:
    s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')

Result: 结果:

  col1 col2
0  1<2    a
1    3  NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM