简体   繁体   English

如何在Pandas.read_csv中使用方括号作为引号字符

[英]How to use square brackets as a quote character in Pandas.read_csv

Let's say I have a text file that looks like this: 假设我有一个看起来像这样的文本文件:

Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]

What I'd like to be able to do is read that in with pandas.read_csv , but the second row will throw an error. 我希望能够做的是用pandas.read_csv读取,但第二行将抛出错误。 Here is the code I'm currently using: 这是我目前使用的代码:

import pandas as pd
df = pd.read_csv("path/to/file.txt", sep=",", dtype=str)

I've tried to set quotechar to "[", but that obviously just eats up the lines until the next open bracket and adding a closing bracket results in a "string of length 2 found" error. 我试图将quotechar设置为“[”,但是这显然只是占用了行,直到下一个打开括号并添加一个quotechar括号会导致“找到长度为2的字符串”错误。 Any insight would be greatly appreciated. 任何见解将不胜感激。 Thanks! 谢谢!

Update 更新

There were three primary solutions that were offered: 1) Give a long range of names to the data frame to allow all data to be read in and then post-process the data, 2) Find values in square brackets and put quotes around it, or 3) replace the first n number of commas with semicolons. 提供了三种主要解决方案:1)为数据框提供大量名称,以允许读入所有数据,然后对数据进行后处理,2)在方括号中查找值并在其周围加上引号,或者3)用分号替换前n个逗号。

Overall, I don't think option 3 is a viable solution in general (albeit just fine for my data) because a) what if I have quoted values in one column that contain commas, and b) what if my column with square brackets is not the last column? 总的来说,我认为选项3通常不是一个可行的解决方案(虽然对我的数据来说很好),因为a)如果我在一个包含逗号的列中引用了值,b)如果我的方括号列是不是最后一栏? That leaves solutions 1 and 2. I think solution 2 is more readable, but solution 1 was more efficient, running in just 1.38 seconds, compared to solution 2, which ran in 3.02 seconds. 这留下了解决方案1和2.我认为解决方案2更具可读性,但解决方案1更有效,仅运行1.38秒,而解决方案2则运行3.02秒。 The tests were run on a text file containing 18 columns and more than 208,000 rows. 测试在包含18列和超过208,000行的文本文件上运行。

I think you can replace first 3 occurence of , in each line of file to ; 我认为你可以replace前3 occurence的,在文件中的每一行; and then use parameter sep=";" 然后使用参数sep=";" in read_csv : read_csv

import pandas as pd
import io

with open('file2.csv', 'r') as f:
    lines = f.readlines()
    fo = io.StringIO()
    fo.writelines(u"" + line.replace(',',';', 3) for line in lines)
    fo.seek(0)    

df = pd.read_csv(fo, sep=';')
print df
   Item        Date   Time                            Location
0     1  01/01/2016  13:41                 [45.2344:-78.25453]
1     2  01/03/2016  19:11  [43.3423:-79.23423,41.2342:-81242]
2     3  01/10/2016  01:27                 [51.2344:-86.24432]

Or can try this complicated approach, because main problem is, separator , between values in lists is same as separator of other column values. 或者可以尝试这种复杂的方法,因为主要问题是,分隔符, lists中的值与其他列值的分隔符相同。

So you need post - processing: 所以你需要后期处理:

import pandas as pd
import io

temp=u"""Item,Date,Time,Location
1,01/01/2016,13:41,[45.2344:-78.25453]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242,41.2342:-81242]
3,01/10/2016,01:27,[51.2344:-86.24432]"""
#after testing replace io.StringIO(temp) to filename
#estimated max number of columns
df = pd.read_csv(io.StringIO(temp), names=range(10))
print df
      0           1      2                    3               4  \
0  Item        Date   Time             Location             NaN   
1     1  01/01/2016  13:41  [45.2344:-78.25453]             NaN   
2     2  01/03/2016  19:11   [43.3423:-79.23423  41.2342:-81242   
3     3  01/10/2016  01:27  [51.2344:-86.24432]             NaN   

                 5   6   7   8   9  
0              NaN NaN NaN NaN NaN  
1              NaN NaN NaN NaN NaN  
2  41.2342:-81242] NaN NaN NaN NaN  
3              NaN NaN NaN NaN NaN  
#remove column with all NaN
df = df.dropna(how='all', axis=1)
#first row get as columns names
df.columns = df.iloc[0,:]
#remove first row
df = df[1:]
#remove columns name
df.columns.name = None

#get position of column Location
print df.columns.get_loc('Location')
3
#df1 with Location values
df1 = df.iloc[:, df.columns.get_loc('Location'): ]
print df1
              Location             NaN              NaN
1  [45.2344:-78.25453]             NaN              NaN
2   [43.3423:-79.23423  41.2342:-81242  41.2342:-81242]
3  [51.2344:-86.24432]             NaN              NaN

#combine values to one column
df['Location'] = df1.apply( lambda x : ', '.join([e for e in x if isinstance(e, basestring)]), axis=1)

#subset of desired columns
print df[['Item','Date','Time','Location']]
  Item        Date   Time                                           Location
1    1  01/01/2016  13:41                                [45.2344:-78.25453]
2    2  01/03/2016  19:11  [43.3423:-79.23423, 41.2342:-81242, 41.2342:-8...
3    3  01/10/2016  01:27                                [51.2344:-86.24432]

I can't think of a way to trick the CSV parser into accepting distinct open/close quote characters, but you can get away with a pretty simple preprocessing step: 我想不出一种方法来欺骗CSV解析器接受不同的打开/关闭引号字符,但你可以通过一个非常简单的预处理步骤逃脱:

import pandas as pd
import io
import re

# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]')

with open('path/to/file.txt', 'r') as fi:
    # replaced brackets with quotes, pipe into file-like object
    fo = io.StringIO()
    fo.writelines(unicode(re.sub(location_regex, r'"\1"', line)) for line in fi)

    # rewind file to the beginning
    fo.seek(0)

# read transformed CSV into data frame
df = pd.read_csv(fo)
print df

This gives you a result like 这会给你一个结果

            Date_Time  Item                             Location
0 2016-01-01 13:41:00     1                  [45.2344:-78.25453]
1 2016-01-03 19:11:00     2  [43.3423:-79.23423, 41.2342:-81242]
2 2016-01-10 01:27:00     3                  [51.2344:-86.24432]

Edit If memory is not an issue, then you are better off preprocessing the data in bulk rather than line by line, as is done in Max's answer . 编辑如果内存不是问题,那么最好是批量预处理数据而不是逐行预处理,就像Max的答案中所做的那样。

# regular expression to capture contents of balanced brackets
location_regex = re.compile(r'\[([^\[\]]+)\]', flags=re.M)

with open('path/to/file.csv', 'r') as fi:
    data = unicode(re.sub(location_regex, r'"\1"', fi.read()))

df = pd.read_csv(io.StringIO(data))

If you know ahead of time that the only brackets in the document are those surrounding the location coordinates, and that they are guaranteed to be balanced, then you can simplify it even further (Max suggests a line-by-line version of this, but I think the iteration is unnecessary): 如果您提前知道文档中的唯一括号是围绕位置坐标的那些,并且保证它们是平衡的,那么您可以进一步简化它(Max建议逐行版本,但是我认为迭代是不必要的):

with open('/path/to/file.csv', 'r') as fi:
    data = unicode(fi.read().replace('[', '"').replace(']', '"')

df = pd.read_csv(io.StringIO(data))

Below are the timing results I got with a 200k-row by 3-column dataset. 下面是我用200k-row 3列数据集得到的时序结果。 Each time is averaged over 10 trials. 每次平均超过10次试验。

  • data frame post-processing ( jezrael's solution ): 2.19s 数据框后处理( jezrael的解决方案 ): 2.19s
  • line by line regex: 1.36s 逐行正则表达式: 1.36s
  • bulk regex: 0.39s 批量正则表达式: 0.39s
  • bulk string replace: 0.14s 批量字符串替换: 0.14s

We can use simple trick - quote balanced square brackets with double quotes: 我们可以使用简单的技巧 - 使用双引号引用平衡方括号:

import re
import six
import pandas as pd


data = """\
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]"""

print('{0:-^70}'.format('original data'))
print(data)
data = re.sub(r'(\[[^\]]*\])', r'"\1"', data, flags=re.M)
print('{0:-^70}'.format('quoted data'))
print(data)
df = pd.read_csv(six.StringIO(data))
print('{0:-^70}'.format('data frame'))

pd.set_option('display.expand_frame_repr', False)
print(df)

Output: 输出:

----------------------------original data-----------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,[45.2344:-78.25453],[aaaa,bbb]
2,01/03/2016,19:11,[43.3423:-79.23423,41.2342:-81242],[0,1,2,3]
3,01/10/2016,01:27,[51.2344:-86.24432],[12,13]
4,01/30/2016,05:55,[51.2344:-86.24432,41.2342:-81242,55.5555:-81242],[45,55,65]
-----------------------------quoted data------------------------------
Item,Date,Time,Location,junk
1,01/01/2016,13:41,"[45.2344:-78.25453]","[aaaa,bbb]"
2,01/03/2016,19:11,"[43.3423:-79.23423,41.2342:-81242]","[0,1,2,3]"
3,01/10/2016,01:27,"[51.2344:-86.24432]","[12,13]"
4,01/30/2016,05:55,"[51.2344:-86.24432,41.2342:-81242,55.5555:-81242]","[45,55,65]"
------------------------------data frame------------------------------
   Item        Date   Time                                           Location        junk
0     1  01/01/2016  13:41                                [45.2344:-78.25453]  [aaaa,bbb]
1     2  01/03/2016  19:11                 [43.3423:-79.23423,41.2342:-81242]   [0,1,2,3]
2     3  01/10/2016  01:27                                [51.2344:-86.24432]     [12,13]
3     4  01/30/2016  05:55  [51.2344:-86.24432,41.2342:-81242,55.5555:-81242]  [45,55,65]

UPDATE : if you are sure that all square brackets are balances, we don't have to use RegEx's: 更新 :如果您确定所有方括号都是余额,我们不必使用RegEx:

import io
import pandas as pd

with open('35948417.csv', 'r') as f:
    fo = io.StringIO()
    data = f.readlines()
    fo.writelines(line.replace('[', '"[').replace(']', ']"') for line in data)
    fo.seek(0)

df = pd.read_csv(fo)
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM