如何使用熊猫在最后一个字段中存在分隔符的情况下读取CSV文件？

Question

I have raw data in the following format: 我有以下格式的原始数据：

JobID,Publish,Expire,TitleAndDetail
7428,17/12/2006 2:00:00 PM,28/01/2007 2:00:00 PM,Project Engineer - Mechanical      Looking,.....,....
7429,9/03/2006 2:00:00 PM,27/02/2007 2:00:00 PM,Supply Teacher      The job is,.....,.....

As you can see the delimiter is comma, however the last column is a chunk of text with commas within. 如您所见，定界符是逗号，但是最后一列是其中包含逗号的文本块。 I am using pandas' read_csv function to read this CSV file. 我正在使用熊猫的read_csv函数读取此CSV文件。 However in pandas dataframe the text parts after the 4th comma in each line are lost. 但是，在熊猫数据框中，每行第四个逗号之后的文本部分会丢失。

raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
                       names=['JobID', 'Publish', 'Expire', 'TitleAndDetail'],
                       header=None
                       )

If using string.split() function, I can specify a maxsplit parameter which allows me to keep all the content in the last column even if there're many commas. 如果使用string.split()函数，则可以指定maxsplit参数，即使有很多逗号，该参数也可以将所有内容保留在最后一列中。 Is there similar functionality in Pandas? 熊猫有类似的功能吗？

Answer 1

So here's a bit of a hack you could try: 因此，您可以尝试以下方法：

raw_data = pd.read_csv(r"/ABC/JobDetails.csv",
                       squeeze=True,
                       sep="\a"
                       )

This should give you a Series by ignoring the ","s 这应该通过忽略“，”

Then you can do: 然后，您可以执行以下操作：

df = raw_data.str.split(",", n=4, expand=True)
df.columns = ['JobID', 'Publish', 'Expire', 'TitleAndDetail']

That should split into 4 columns and rename 那应该分成4列并重命名

Answer 2

You can do in this way: 您可以通过以下方式进行操作：

with open("file.csv", "r") as fp:
    reader = csv.reader(fp, delimiter=",")
    rows = [x[:3] + [','.join(x[3:])] for x in reader]
    df = pd.DataFrame(rows)
    df.columns = df.iloc[0]
    df = df.reindex(df.index.drop(0))
    print df

Answer 3

Read the file manually and then create the dataframe: 手动读取文件，然后创建数据框：

rows = []

with open('somefile.csv') as f:
  keys = next(f).split(',')
  for line in f:
     rows.append(dict(zip(keys, line.split(',', 3))))

df = pd.DataFrame(rows)

.split takes an optional parameter to limit the number of times it splits over the delimiter. .split一个可选参数来限制它在定界符上分割的次数。 Passing 3 means it ignores the commas in your last field: 传递3表示它会忽略最后一个字段中的逗号：

>>> s.split(',', 3)
['7428',
 '17/12/2006 2:00:00 PM',
 '28/01/2007 2:00:00 PM',
 'Project Engineer - Mechanical      Looking,.....,....']

Next, we create a dictionary with the keys from the header row and the values from the data rows: 接下来，我们使用标题行中的键和数据行中的值创建一个字典：

>>> f = 'JobID,Publish,Expire,TitleAndDetail'.split(',')
>>> dict(zip(f, s.split(',', 3)))
{'JobID': '7428',
 'Publish': '17/12/2006 2:00:00 PM',
 'Expire': '28/01/2007 2:00:00 PM',
 'TitleAndDetail': 'Project Engineer - Mechanical      Looking,.....,....'}

Finally, we make a list of these dictionaries (in rows ), and pass this as an argument to create our data frame object. 最后，我们列出这些字典（ rows ）的列表，并将其作为参数传递来创建数据框对象。

如何使用熊猫在最后一个字段中存在分隔符的情况下读取CSV文件？

问题描述

3 个解决方案

解决方案1
0 2018-09-26 06:43:45

解决方案2
0 2018-09-26 06:53:29

解决方案3
0 已采纳 2018-09-26 06:56:01

如何使用熊猫在最后一个字段中存在分隔符的情况下读取CSV文件？

问题描述

3 个解决方案

解决方案1 0 2018-09-26 06:43:45

解决方案2 0 2018-09-26 06:53:29

解决方案3 0 已采纳 2018-09-26 06:56:01

解决方案1
0 2018-09-26 06:43:45

解决方案2
0 2018-09-26 06:53:29

解决方案3
0 已采纳 2018-09-26 06:56:01