简体   繁体   English

如何在Python中仅从正​​则表达式中提取一个字符串?

[英]How to extract only one string from regex in Python?

I have been trying to build a simple account manager sort of application for myself using Python which will read SMS from my phone and extract information based on some regex patterns. 我一直在尝试使用Python为自己构建一个简单的帐户管理器应用程序,该应用程序将从手机读取SMS并根据某些正则表达式模式提取信息。

I wrote a complex regex pattern and tested the same on https://pythex.org/ . 我编写了一个复杂的regex模式,并在https://pythex.org/上进行了测试。 Example: 例:

Text: 1.00 is debited from ******1234  for food

Pattern: (account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)

Result: from ******1234

However, when I try to do the same in Python using the str.extract() method, rather than getting a single result, I am getting a dataframe with a column for each group. 但是,当我尝试使用str.extract()方法在Python中执行相同操作时,而不是得到单个结果,而是获得一个数据组,其中每个组都有一列。

Python code looks like this: Python代码如下所示:

all_sms=pd.read_csv("all_sms.csv")

pattern = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'

test = all_sms.extract(pattern, expand = False)

Output of the python code for the message above: 上面消息的python代码输出:

0           from
1               
2            NaN
3            NaN
4            NaN
5     ******1234
6           1234
7           1234
8               
9               
10              

I am very new to Python and trying to learn by hands-on experience, it would be really helpful if someone can point out where I am going wrong with this? 我是Python的新手,并尝试通过实践经验学习,如果有人可以指出我对此有何误解,那将真的很有帮助。

Before diving into your regex pattern you should understand why you are using pandas. 在深入研究正则表达式模式之前,您应该了解为什么要使用熊猫。 Pandas is suitable for data analysis (thus suitable for your problem) but seems like an overkill here. 熊猫适合进行数据分析(因此适合您的问题),但在这里似乎有些过头了。

If you are a beginner I advice you to stick with pure python not because pandas is complicated but because knowing the python standard library will help you in the long run. 如果您是初学者,我建议您坚持使用纯python,不是因为pandas很复杂,而是因为了解python标准库从长远来看会为您提供帮助。 If you skip the basics now this may hurt you in the long run. 如果您现在跳过基础知识,从长远来看可能会伤害您。

Considering you are going to use python3 (without pandas) I would proceed as follow: 考虑到您将使用python3(不带熊​​猫),我将按以下步骤进行:

# Needed imports from standard library.
import csv
import re

# Declare the constants of my tiny program.
PATTERN = '(account|a\/c|ac|from|acct|savings|credit in|ac\/|sb\-|acc|a\/c)(\s|\.|\-)*(no|number)*(\.|\s|:)*\s*(ending)*\s*(((n{1,}|x{1,}|[0-9]+|\*{1,}))+)\-*((n{1,}|x{1,}|[0-9]+|\*{1,}|\s))*\-*([0-9]*)'
COMPILED_REGEX = re.compile(PATTERN)

# This list will store the matched regex.
found_regexes = list()

# Do the necessary loading to enable searching for the regex.
with open('mysmspath.csv', newline='') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=' ', quotechar='"')
    # Iterate over rows in your csv file.
    for row in csv_reader:
        match = COMPILED_REGEX.search(row)
        if match:
            found_regexes.append(row)

print(found_regexes)

Not necessarily this is going to solve your problem with copy-paste but this might give you an idea of a more simpler approach to your problem. 不一定会通过复制粘贴来解决您的问题,但这可能会让您想到一种更简单的问题解决方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM