简体   繁体   English

Pandas 无法使用 pandas 读取 CSV 文件,带有额外的引号字符

[英]Pandas Unable to Read CSV file using pandas, with extra quote char

i have following CSV with following entries我有以下 CSV 和以下条目

"column1"| “列1”| "column2"| “列2”| "column3"| “第 3 列”| "column4"| “第 4 列”| "column5" “第 5 列”
"123" | “123” | "sometext", "this somedata", "8 inches"", "hello" “sometext”、“this somedata”、“8 英寸”、“你好”

The issue comes when i try to read 8 inches" , i am unable to read the csv using read_csv() .当我尝试读取8 inches" ,我无法使用read_csv()读取 csv。

Pandas.read_csv(io.BytesIO(obj['Body'].read()), sep="|",
                                      quoting=1,
                                      engine='c', error_bad_lines=False, warn_bad_lines=True,
                                      encoding="utf-8", converters=pandas_config['converters'],skipinitialspace=True,escapechar='\"')

Is there a way to handle the quote within the cell.有没有办法处理单元格内的报价。

Start from passing appropriate parameters for this case:从为这种情况传递适当的参数开始:

  1. sep='[|,]' - there are two separators: a pipe char and a comma , so define them as a regex . sep='[|,]' - 有两个分隔符:一个pipe char 和一个comma ,因此将它们定义为regex
  2. skipinitialspace=True - your source text contains extra spaces (after separators), so you should drop them. skipinitialspace=True - 您的源文本包含额外的空格(在分隔符之后),因此您应该删除它们。
  3. engine='python' - to suppress a warning concerning Falling back to the 'python' engine . engine='python' - 抑制有关Falling back to the 'python' engine的警告。

The above options alone allow to call read_csv with no error, but the downside (for now) is that double quotes remain.仅上述选项就可以毫无错误地调用read_csv ,但缺点(目前)是双引号仍然存在。

To eliminate them, at least from the data rows, another trick is needed:为了至少从数据行中消除它们,需要另一个技巧:

Define a converter (lambda) function:定义一个转换器 (lambda) function:

cnv = lambda txt: txt.replace('"', '')

and apply it to all source columns.并将其应用于所有源列。

In your case you have 5 columns, so to keep the code concise, you can use a dictionary comprehension :在您的情况下,您有 5 列,因此为了保持代码简洁,您可以使用字典理解

{ i: cnv for i in range(5) }

So the whole code can be:所以整个代码可以是:

df = pd.read_csv(io.StringIO(txt), sep='[|,]', skipinitialspace=True,
    engine='python', converters={ i: cnv for i in range(5) })

and the result is:结果是:

  "column1"  "column2"       "column3"  "column4"  "column5"
0      123    sometext   this somedata   8 inches      hello

But remember that now all columns are of string type, so you should convert required columns to numbers.但请记住,现在所有列都是字符串类型,因此您应该将所需的列转换为数字。 An alternative is to pass second converter for numeric columns, returning a number instead of a string.另一种方法是为数字列传递第二个转换器,返回数字而不是字符串。

To have proper column names (without double quotes), you can pass additional parameters:要获得正确的列名(不带双引号),您可以传递其他参数:

  • skiprows=1 - to omit the initial line, skiprows=1 - 省略第一行,
  • names=["column1", "column2", "column3", "column4", "column5"] - to define the column list on your own. names=["column1", "column2", "column3", "column4", "column5"] - 自行定义列列表。

We can specify a somewhat complicated separator, read the datas and strip the extra quote chars:我们可以指定一个稍微复杂的分隔符,读取数据并去除多余的引号字符:

# Test data:
text='''"column1"| "column2"| "column3"| "column4"| "column5" 
        "123" | "sometext", "this somedata", "8 inches"", "hello"'''
ff=io.StringIO(text)


df= pd.read_csv(ff,sep=r'"\s*[|,]\s*"',engine="python")
# Make it tidy:
df= df.transform(lambda s: s.str.strip('"'))
df.columns= ["column1"]+list(df.columns[1:-1])+["column5"]

The result:结果:

  column1   column2        column3   column4 column5
0     123  sometext  this somedata  8 inches   hello

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM