简体   繁体   English

当其中有HTML转义字符串时,使用python(pandas)读取CSV文件

[英]Reading CSV files with python (pandas) when there is HTML escaped string in there

I'm trying to read a CSV file with pandas read_csv. 我正在尝试使用熊猫read_csv读取CSV文件。 The data looks like this (example) 数据如下所示(示例)

thing;weight;price;colour
apple;1;2;red
m & m's;0;10;several
cherry;0,5;2;dark red

Because of the HTML-escaped ampersand thingy, the second row would contain 5 fields according to pandas. 由于HTML转义的&符号,第二行将根据熊猫包含5个字段。 How can I make sure, that thing gets read correctly? 我如何确保正确读取该内容?

The example here is pretty much how my data looks like: separator is ";", no string quotes, cp1251 encoding. 这里的示例几乎是我的数据的样子:分隔符为“;”,不带引号,cp1251编码。 The data I receive is pretty big, and reading it must run in one step (meaning no preprocessing outside of python). 我收到的数据非常大,并且读取数据必须一步执行(这意味着python之外无需任何预处理)。

I didn't find any reference in the pandas doc (I'm using pandas 0.19 with python 3.5.1). 我在pandas文档中没有找到任何参考(我在python 3.5.1中使用pandas 0.19)。 Any suggestions? 有什么建议么? Thanks in advance. 提前致谢。

Unescape the html character references : 取消转义html字符引用

import html
with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    content = html.unescape(f.read())
    g.write(content)
print(content)
# thing;weight;price;colour
# apple;1;2;red
# m & m's;0;10;several
# cherry;0,5;2;dark red

Then load the csv in the usual way: 然后以通常的方式加载csv:

import pandas as pd
df = pd.read_csv('data-fixed.csv', sep=';')
print(df)

yields 产量

     thing weight  price    colour
0    apple      1      2       red
1  m & m's      0     10   several
2   cherry    0,5      2  dark red

Although the data file is "pretty big", you appear to have enough memory to read it into a DataFrame. 尽管数据文件“很大”,但您似乎有足够的内存将其读入DataFrame。 Therefore you should also have enough memory to read the file into a single string: f.read() . 因此,您还应该有足够的内存将文件读入单个字符串: f.read() Converting the HTML with one call to html.unescape is more performant than calling html.unescape on many smaller strings. 一个调用将HTML转换为html.unescape的性能要比在许多较小的字符串上调用html.unescape性能更高。 This is why I suggest using 这就是为什么我建议使用

with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    content = html.unescape(f.read())
    g.write(content)

instead of something like 而不是像

with open('data.csv', 'r', encoding='cp1251') as f, open('data-fixed.csv', 'w') as g:
    for line in f:
        g.write(html.unescape(line))

If you need to read this data file more than once, then it pays to fix it (and save it to disk) so you don't need to call html.unescape every time you wish to parse the data. 如果您需要多次读取此数据文件,则需要对其进行修复(并将其保存到磁盘),因此您无需每次希望解析数据时都调用html.unescape That's why I suggest writing the unescaped contents to data-fixed.csv . 这就是为什么我建议将未转义的内容写入data-fixed.csv

If reading this data is a one-off task and you wish to avoid the performance or resource cost of writing to disk, then you could use a StringIO (in-memory file-like object): 如果读取此数据是一次性任务,并且希望避免写入磁盘的性能或资源成本,则可以使用StringIO(类似于内存中文件的对象):

from io import StringIO
import html
import pandas as pd

with open('data.csv', 'r', encoding='cp1251') as f:
    content = html.unescape(f.read())
df = pd.read_csv(StringIO(content), sep=';')
print(df)

You can use a regex as separator for pandas.read_csv In your specific case you can try: 您可以使用正则表达式作为pandas.read_csv分隔符。在您的特定情况下,您可以尝试:

pd.read_csv("test.csv",sep = "(?<!&amp);")
#         thing weight  price    colour
#0        apple      1      2       red
#1  m &amp; m's      0     10   several
#2       cherry    0,5      2  dark red

to select all the ; 选择全部; not preceded by &amp , this can be extended to other escaped characters 不能以&amp ,可以扩展到其他转义字符

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM