简体   繁体   English

熊猫通过正则表达式使用字符串分隔符读取CSV

[英]Pandas Read CSV with string delimiters via regex

I am trying to import a weirdly formatted text file into a pandas DataFrame. 我正在尝试将奇怪格式的文本文件导入pandas DataFrame。 Two example lines are below: 下面是两个示例行:

LOADED LANE       1   MAT. TYPE=    2    LEFFECT=    1    SPAN=  200.    SPACE=   10.    BETA=   3.474 LOADEFFECT 5075.    LMAX= 3643.    COV=  .13
LOADED LANE       1   MAT. TYPE=    3    LEFFECT=    1    SPAN=  200.    SPACE=   10.    BETA=   3.515 LOADEFFECT10009.    LMAX= 9732.    COV=  .08

First I tried the following: 首先,我尝试了以下方法:

df = pd.read_csv('beta.txt', header=None, delim_whitespace=True, usecols=[2,5,7,9,11,13,15,17,19])

This seemed to work fine, however got messed up when it hit the above example line, where there is no whitespace after the LOADEFFECT string (you may need to scroll a bit right to see it in the example). 这似乎工作正常,但是当它碰到上面的示例行时就搞砸了,在LOADEFFECT字符串后没有空格(您可能需要向右滚动一点才能在示例中看到它)。 I got a result like: 我得到的结果是:

632   1   2   1  200  10  3.474  5075.  3643.  0.13
633   1   3   1  200  10  3.515  LMAX=   COV=   NaN

Then I decided to use a regular expression to define my delimiters. 然后,我决定使用正则表达式定义分隔符。 After many trial and error runs (I am no expert in regex), I managed to get close with the following line: 经过多次试验和错误运行(我不是regex专家),我设法通过以下代码接近:

df = pd.read_csv('beta.txt', header=None, sep='/s +|LOADED LANE|MAT. TYPE=|LEFFECT=|SPAN=|SPACE=|BETA=|LOADEFFECT|LMAX=|COV=', engine='python')

This almost works, but creates a NaN column for some reason at the very beginning: 这几乎可以用,但是由于某种原因在一开始就创建了一个NaN列:

632 NaN  1  2  1  200  10  3.474   5075  3643  0.13
633 NaN  1  3  1  200  10  3.515  10009  9732  0.08

At this point I think I can just delete that first column, and get away with it. 在这一点上,我认为我可以删除第一列,然后再删除它。 However I wonder what would be the correct way to set up the regex to correctly parse this text file in one shot. 但是,我想知道设置正则表达式以一次正确解析此文本文件的正确方法是什么。 Any ideas? 有任何想法吗? Other than that, I am sure there is a smarter way to parse this text file. 除此之外,我敢肯定还有一种更聪明的方式来解析此文本文件。 I would be glad to hear your recommendations. 我很高兴听到您的建议。

Thanks! 谢谢!

import re
import pandas as pd
import csv
csvfile = open("parsing.txt") #open text file
reader = csv.reader(csvfile)
new_list=[]
for line in reader:
    for i in line:
        new_list.append(re.findall(r'(\d*\.\d+|\d+)', i))

table = pd.DataFrame(new_list)
table # output will be pandas DataFrame with values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM