熊猫read_csv-忽略分号分隔文件中的转义字符

Question

I am trying to load a semicolon seperated txt file and there are a few instances where escape chars are in the data. 我正在尝试加载用分号分隔的txt文件，并且在一些实例中数据中包含转义字符。 These are typically &lt ; 这些通常＆lt; (space removed so it isn't covered to <) which adds a semicolon. （空格已删除，因此它不会被<覆盖），这会添加分号。 This obviously messes up my data and since dtypes are important causes read_csv problems. 显然，这弄乱了我的数据，并且由于dtypes很重要，因此会导致read_csv问题。 Is there away to tell pandas to ignore these when the file is read? 读取文件时，有没有办法告诉大熊猫忽略它们？

I tried deleting the char from the file and it works now, but given that I want an automated process on millions of rows this is not sustainable. 我尝试从文件中删除字符，它现在可以工作，但是鉴于我想要对数百万行进行自动化处理，因此这是不可持续的。

df = pd.read_csv(file_loc.csv,
                 header=None, 
                 names=column_names, 
                 usecols=counters, 
                 dtype=dtypes,
                 delimiter=';', 
                 low_memory=False)

ValueError: could not convert string to float:

As my first column is a string and the second is a float, but if the first is split by the &lt ; 因为我的第一列是字符串，第二列是浮点数，但是如果第一列被＆lt分隔， it then goes on the 2nd too. 然后它也进入第二。

Is there a way to tell pandas to ignore these or efficiently remove before loading? 有没有办法告诉熊猫在加载之前忽略它们或有效删除它们？

Answer 1

Give the following example csv file so57732330.csv : 给出以下示例csv文件so57732330.csv ：

col1;col2
1&lt;2;a
3;

we read it using StringIO after unescaping named and numeric html5 character references: 在对命名和数字html5字符引用进行StringIO后，我们使用StringIO读取了它：

import pandas as pd
import io
import html

with open('so57732330.csv') as f:
    s = f.read()
f = io.StringIO(html.unescape(s))
df = pd.read_csv(f,sep=';')

Result: 结果：

  col1 col2
0  1<2    a
1    3  NaN

熊猫read_csv-忽略分号分隔文件中的转义字符

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-08-30 21:20:42

熊猫read_csv-忽略分号分隔文件中的转义字符

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-08-30 21:20:42

解决方案1
1 已采纳 2019-08-30 21:20:42