简体   繁体   English

如何在 Python 中修复我的重新编译语句

[英]How can I fix my re compile statement in Python

I have a text file and I am using re to locate a specific section of text (a list containing water usage in different towns) and putting the information into a pandas dataframe.我有一个文本文件,我正在使用 re 来定位文本的特定部分(包含不同城镇用水情况的列表)并将信息放入熊猫数据框中。 The text list is ordered using letters eg (a), (b), (c) etc. The code works fine and returns all the information I need into the dataframe up until the ordering switches to double letters eg (aa), (ab), (ac) etc.文本列表使用字母排序,例如 (a)、(b)、(c) 等。代码工作正常并将我需要的所有信息返回到数据框中,直到排序切换为双字母,例如 (aa), (ab ), (ac) 等。

How can I fix my re statement so that it also works for double lettered indexes in the text list?如何修复我的 re 语句,使其也适用于文本列表中的双字母索引?

Here is the code:这是代码:

pattern = regex.compile('\d+ (?=ML\/year)|(?<= in the |the )[\w \/\(\)]+')
    columns = ('Water Usage', 'Town')

    res = [dict(zip(columns, pattern.findall(line))) for line in finalText.splitlines() if pattern.match(line)]
    df = pd.DataFrame(res)

    return df

And here is an example of the text:这是文本的示例:

(w) 218 ML/year in the Murrumbidgee I Water Source,
(x) 133 ML/year in the Murrumbidgee II Water Source,
(y) 116 ML/year in the Murrumbidgee III Water Source,
(z) 73 ML/year in the Murrumbidgee North Water Source,
(aa) 476 ML/year in the Murrumbidgee Western Water Source,
(ab) 92 ML/year in the Muttama Water Source,
(ac) 150 ML/year in the Numeralla East Water Source,

As I said, it works for all the rows with single letter indexes but doesn't for double letters.正如我所说,它适用于所有具有单字母索引的行,但不适用于双字母。

You can use https://regex101.com/ or https://regexr.com/ to troubleshoot your regular expression.您可以使用https://regex101.com/https://regexr.com/对正则表达式进行故障排除。 Here's one that matches the key components.这是与关键组件匹配的一个。

^\\([^)]+\\)\\s+(\\S+)\\s+(.*\\/year)\\s+in the\\s+(.*),

Python re module doesn't allow variable width pattern in look behind assertions. Python re模块不允许在断言后面查看可变宽度模式。
correcting it, if you had used search() instead of match() it would have worked.纠正它,如果您使用search()而不是match()它会起作用。

def create_df(finalText):
    pattern = re.compile('\d+ (?=ML\/year)|(?<= in the)[\w \/\(\)]+')
    columns = ('Water Usage', 'Town')
    res = [dict(zip(columns, pattern.findall(line))) for line in finalText.splitlines() if pattern.search(line)]
    df = pd.DataFrame(res)
    return df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM