简体   繁体   English

在熊猫中使用逗号读取CSV文件时出现问题

[英]Problems reading CSV file with commas in Pandas

An extension to Problems reading CSV file with commas and characters in pandas 扩展问题以读取CSV文件(带逗号和熊猫字符)

Solution provided in the above link works if and only if one column which contains commas as its characters and rest of the columns are well behaved. 以上链接中提供的解决方案只有当包含逗号作为其字符的一列以及其余各列的行为良好时,才起作用。

What if there are more than one column with above issue? 如果以上问题有多个栏怎么办?

Example CSV content with additional commas issue : 带有其他逗号问题的示例CSV内容:

Name,Age,Address,Phone,Qualification
Suresh,28,Texas,3334567892,B.Tech
Ramesh,24,NewYork, NY,8978974040,9991111234,Ph.D
Mukesh,26,Dallas,4547892345,Ph.D

Required Output Pandas DataFrame: 必需的输出熊猫数据框:

Name    Age  Address      Phone                  Qualification
Suresh  28   Texas        3334567892             B.Tech
Ramesh  24   NewYork, NY  8978974040,9991111234  Ph.D
Mukesh  26   Dallas       4547892345             Ph.D

Edited : 编辑:

Input file with commas as characters in successive columns : 在连续的列中以逗号作为字符的输入文件:

Name,Age,Address,Qualification,Grade                  
Suresh,28,Texas,B.Tech,Ph.D,A
Ramesh,24,NewYork, NY,B.Tech,A+
Mukesh,26,Dallas,B.Tech,Ph.D,A

Required Output Pandas DataFrame: 必需的输出熊猫数据框:

Name    Age  Address      Qualification Grade                  
Suresh  28   Texas        B.Tech,Ph.D   A
Ramesh  24   NewYork, NY  B.Tech        A+
Mukesh  26   Dallas       B.Tech,Ph.D   A

Can I get any suggestions to solve this issue? 我可以得到解决此问题的任何建议吗?

Thanks in Advance!!! 提前致谢!!!

One way to do this would be to have " to clearly separate your data - 实现此目的的一种方法是"明确区分您的数据-

Name,Age,Address,Phone,Qualification
Suresh,28,Texas,3334567892,B.Tech
Ramesh,24,"NewYork, NY","8978974040,9991111234",Ph.D
Mukesh,26,Dallas,4547892345,Ph.D

If this isn't there, pandas will struggle to read it right. 如果不存在, pandas将很难正确阅读。

Copy the above data, do a pd.read_clipboard(sep=',') and it will yield - 复制上述数据,执行pd.read_clipboard(sep=',') ,它将产生-

     Name  Age      Address                  Phone Qualification
0  Suresh   28        Texas             3334567892        B.Tech
1  Ramesh   24  NewYork, NY  8978974040,9991111234          Ph.D
2  Mukesh   26       Dallas             4547892345          Ph.D

If modifying the source data as a whole is not within your means- 如果整个源数据修改超出您的能力范围,

A practical approach would be to do a usual read_csv with error_bad_lines=False . 一种实用的方法是使用error_bad_lines=False进行常规的read_csv Once done, look through the logs and make a note of the lines that pandas is struggling to read and modify only those lines accordingly. 完成后,仔细查看日志并记下pandas努力读取和修改的行。

Hope this helps. 希望这可以帮助。

Your data appears fixed for the first two columns and also the last, so these can be removed and the remaining values could be processed using itertools.groupby() to group the remaining columns into numeric or non-numeric groups. 您的数据在前两列和最后一列中都是固定的,因此可以删除它们,并可以使用itertools.groupby()将剩余的值处理为将剩余的列分组为数字或非数字组。 The resulting data could then be loaded into pandas: 然后可以将生成的数据加载到熊猫中:

import pandas as pd
from itertools import groupby
import csv

data = []

with open('input.csv', newline='') as f_input:
    csv_input = csv.reader(f_input)
    header = next(csv_input)

    for row in csv_input:
        addr_phone = [','.join(g) for k, g in groupby(row[2:-1], lambda x: x.isdigit())]
        data.append(row[:2] + addr_phone + [row[-1]])

df = pd.DataFrame(data, columns=header)        
print(df)

Giving you: 给你:

     Name Age      Address                  Phone Qualification
0  Suresh  28        Texas             3334567892        B.Tech
1  Ramesh  24  NewYork, NY  8978974040,9991111234          Ph.D
2  Mukesh  26       Dallas             4547892345          Ph.D

To work with your second example, you would have to decide on a way to split the two columns. 要使用第二个示例,您将必须确定拆分两列的方法。 I would suggest you create a list of possible qualifications. 我建议您创建一个可能的资格列表。 When there is a match, you would be able to split at that point. 当有比赛时,您将可以在那时候分裂。 For example: 例如:

import pandas as pd
import csv

def find_split(data):
    for index, v in enumerate(data):
        if v.lower() in ['b.tech', 'ph.d']:
            return [', '.join(data[:index]), ', '.join(data[index:])]
    return [', '.join(data), '']

data = []

with open('input.csv', newline='') as f_input:
    csv_input = csv.reader(f_input, skipinitialspace=True)
    header = next(csv_input)

    for row in csv_input:
        data.append(row[:2] + find_split(row[2:-1]) + [row[-1]])

df = pd.DataFrame(data, columns=header)        
print(df)

Giving you: 给你:

     Name Age      Address Qualification Grade
0  Suresh  28        Texas  B.Tech, Ph.D     A
1  Ramesh  24  NewYork, NY        B.Tech    A+
2  Mukesh  26       Dallas  B.Tech, Ph.D     A

You could create a list of qualifications by first creating a set() based on the contents of row[2] (lowercased). 您可以通过首先基于row[2]的内容(小写)创建一个set()来创建资格列表。 Print the contents of the set and then add that to the script and rerun it. 打印集合的内容,然后将其添加到脚本中并重新运行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM