简体   繁体   English

在导入带有额外逗号的熊猫的csv文件时,如何使用正则表达式作为分隔符?

[英]How can I use regex as a delimiter when importing a csv file with pandas with extra commas?

The csv file was sent to me/ I can not re delimit the columns CSV文件已发送给我/我无法重新分隔列

239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00

I replaced the string letters to cover sensitive info, however the problem is apparent. 我替换了字符串字母以覆盖敏感信息,但是问题很明显。

This is an example "problem row" in my csv. 这是我的csv中的“问题行”示例。 It should be sorted into 8 columns as follows: 应该将其分为8列,如下所示:

col1: 239845723
col2: 28374
col3: 2384234
col4: AEVNE EFU 5 GN OR WNV
col5: Owinv Vnwo Badvw 5 VIN
col6: Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).
col7: 2011-07-13 00:00:00
col8: 2011-07-13 00:00:00

As you can see, column 6 is where the problem occurs as there are commas in the string that cause pandas to delimit and create new columns incorrectly. 如您所见,第6列是问题发生的地方,因为字符串中有逗号导致熊猫分隔和错误地创建新列。 How can I solve this problem? 我怎么解决这个问题? I was thinking regex would help, perhaps with the below setup. 我认为正则表达式可能会有所帮助,也许使用以下设置。 Any help is appreciated! 任何帮助表示赞赏!

    csvfile = open(filetrace) 
    reader = csv.reader(csvfile)
    new_list=[]
    for line in reader:
        for i in line:
            #not sure

Istead of going to regex, read the csv with delimiter ',', You can extract the last two dates and store it in a list. 不用去正则表达式,而是用定界符','读取csv,您可以提取最后两个日期并将其存储在列表中。 Then fill the dates with '' then join the columns you want and the delete the rest. 然后在日期中填入''然后加入所需的列,然后删除其余的列。 Example

If you have a csv file : 如果您有一个csv文件:

239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE.).,2011-07-13 00:00:00,2011-07-13 00:00:00
239845723,28374,2384234,AEVNE EFU 5 GN OR WNV,Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).,2011-07-13 00:00:00,2011-07-13 00:00:00

Then 然后

df = pd.read_csv('good.txt',delimiter=',',header=None)
# Get the Dates from all the DataFrame 
dates = [[item] for i in df.values for item in i if '2011-' in str(item)]
# Merge two Dates for each column
dates = pd.DataFrame([x+y for x,y in zip(dates[0::2], dates[1::2])])
# Remove the dates present 
df = df.replace({'2011-': np.nan}, regex=True).replace(np.nan,'')

#Get the columns you want to merge 
cols = df.columns[4:]
# Merge the columns 
df[4] = df[cols].astype(str).apply(lambda x: ','.join(x), axis=1)
df[4] = df[4].replace('\,+$', '',regex=True)
#Drop the Columns 
df = df.drop(df.columns[5:],axis=1)
#Concat the dates 
df = pd.concat([df,dates],axis=1)

Output : print(df) 输出:print(df)

0      1        2                      3  \
0  239845723  28374  2384234  AEVNE EFU 5 GN OR WNV   
1  239845723  28374  2384234  AEVNE EFU 5 GN OR WNV   
2  239845723  28374  2384234  AEVNE EFU 5 GN OR WNV   

                                                   4                    0  \
0  Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera ...  2011-07-13 00:00:00   
1  Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera ...  2011-07-13 00:00:00   
2  Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse...  2011-07-13 00:00:00   

                     1  
0  2011-07-13 00:00:00  
1  2011-07-13 00:00:00  
2  2011-07-13 00:00:00

Ouput of the 4th column : 第四栏的输出:

['Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee, 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).',

 'Owinv Vnwo Badvw 5 VIN,Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd bdfwd dsf (sdv seves 4-6), sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE.).',

'Owinv Vnwo Badvw 5 VIN sebsbe sve(sevsev esvse 7-10) fsesef fesevsesv PaVvin (1 evesve vEV VEWee 2 for WVEee VEWE. paper tuff as sWEFEWoon as VEWeew.).']

If you want to change column index 如果要更改列索引

df.columns = [i for i in range(df.shape[1])]

Hope it helps 希望能帮助到你

So, without knowing the specifics of the file or the data I can offer a regex solution that could work if the data is consistent (and has the period at the end of column 6). 因此,在不知道文件或数据的细节,我可以提供一个正则表达式解决方案,如果该数据是一致的(并且在列6月底期间), 可以正常工作。 We can do it without using the csv module and just the regex module. 我们无需使用csv模块和仅使用regex模块就可以做到这一点。

import re

# make the regex pattern here
pattern = r"([\d\.]*),([\d\.]*),([\d\.]*),([^,]*),([^,]*),(.*\.?),([\d\-\s:]*),([\d\-\s:]*)"

# open the file with 'with' so you don't have to worry about closing it
with open(filetrace) as f:
    for line in f:  # iterate through the lines
        values = re.findall(pattern, line)[0]  # re.findall returns a list 
                                               # literal of a tuple
        # record the values somewhere

values here is an 8-tuple containing the values from each of the columns that you had in your original csv, just use/store them however you want. 这里的values是一个8元组,其中包含原始csv中每个列的值,可随意使用/存储它们。

Best of luck with it! 祝你好运!

Since you know exactly how many columns you need and there is only one problematic column, we can split the first few off from the left and rest from the right. 由于您确切知道需要多少列,并且只有一个有问题的列,因此我们可以从左向右拆分前几列。 In other words, you don't need regex 换句话说,您不需要regex

Read file into single string 将文件读入单个字符串

csvfile = open(filetrace).read()

Make pd.Series 制作pd.Series

s = pd.Series(csvfile.split('\n'))

Split s and limit it to 5 splits, which should be 6 columns 拆分s并将其限制为5个拆分,应为6列

df = s.str.split(',', 5, expand=True)

Now split the right side limited to 2 splits 现在将右侧拆分为2个拆分

df = df.iloc[:, :-1].join(df.iloc[-1].str.rsplit(',', 2, expand=True))

Another way starting from s s开始的另一种方式

left = s.str.split(',', 5)
right = left.str[-1].str.rsplit(',', 2)

df = pd.DataFrame(left.str[:-1].add(right).tolist())

I ran this and took the transpose to make it easier to read on screen 我运行了它并进行了移调,以使其在屏幕上更易于阅读

df.T



                                                   0
0                                          239845723
1                                              28374
2                                            2384234
3                              AEVNE EFU 5 GN OR WNV
4                             Owinv Vnwo Badvw 5 VIN
5  Ginq 2 jnwve wef evera wve 6 vwe as fgsb bfd b...
6                                2011-07-13 00:00:00
7                                2011-07-13 00:00:00

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM