如何将正则表达式 pattern.finditer 写入 dataframe

Question

I am trying to write a regular expression to select the text I want from a corpus, and then write the extracted text into a dataframe in CSV format.我正在尝试将正则表达式写入 select 我想要从语料库中获取的文本，然后将提取的文本写入 CSV 格式的 dataframe 中。

Here is the code that I used:这是我使用的代码：

import re

import pandas as pd

def main():

    pattern = re.compile(r'(case).(reason)(.+)(})')

    with open('/Users/cleantext.txt', 'r') as f:
        content = f.read()
        matches = pattern.finditer(content)
        for match in matches:
            print(tuple(match.groups()))


    # Create a DF for the expenses
    df = pd.DataFrame(data=[tuple(match.groups())])

    df.to_csv("judgement.csv", index=True)

if __name__ == '__main__':
     main()

However the CSV would only return one line of output:然而 CSV 只会返回 output 的一行：

,0,1,2,3
0,xxx,yyy,zzz,}

where I was expecting multiple lines since the corpus contained at least 100 judicial judgements.由于语料库至少包含 100 条司法判决，因此我期待多行。

the orginal corpus looks something like this:原始语料库看起来像这样：

{mID a9d50454f624         case xxx reason yyy judgement zzz}
{mID a9d5049e34e934bff9b  case xxx reason yyy judgement zzz}
{mID a67c9e34e934bff9b    case xxx reason yyy judgement zzz}

Thank you so much for your help.非常感谢你的帮助。

Answer 1

You probably need to get the two substrings denoting case and reason from each match.您可能需要从每个匹配项中获取表示case和reason的两个子字符串。 You can use您可以使用

pattern = re.compile(r'\bcase\s*(?P<Case>.*?)\s*reason\s*(?P<Reason>.*?)\s*judgement')
matches = [x.groupdict() for x in pattern.finditer(content)]
df = pd.DataFrame(matches)

Note the named capturing groups are used to automatically create a column name, the x.groupdict() returns a tuple containing the group name and its value.请注意，命名捕获组用于自动创建列名， x.groupdict()返回一个包含组名及其值的元组。 The [x.groupdict() for x in pattern.finditer(content)] returns a list dictionaries that can be used to populate the dataframe. [x.groupdict() for x in pattern.finditer(content)]返回可用于填充 dataframe 的列表字典。

You can also use你也可以使用

matches = pattern.findall(content)
df=pd.DataFrame(matches, columns=['Case', 'Reason'])

See the regex demo .请参阅正则表达式演示。 Details :详情：

\bcase - a whole word case \bcase - 一个完整的单词case
\s* - zero or more whitespaces \s* - 零个或多个空格
(?P<Case>.*?) - Group "Case": zero or more chars other than line break chars, as few as possible (?P<Case>.*?) - 组“Case”：除换行符之外的零个或多个字符，尽可能少
\s*reason\s* - reason word enclosed with optional whitespaces \s*reason\s* - 用可选空格括起来的reason词
(?P<Reason>.*?) - Group "Reason": zero or more chars other than line break chars, as few as possible (?P<Reason>.*?) - 组“原因”：除换行符之外的零个或多个字符，尽可能少
\s*judgement - zero or more whitespaces and then judgement string. \s*judgement - 零个或多个空格，然后是judgement字符串。

如何将正则表达式 pattern.finditer 写入 dataframe

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-04-21 09:15:17

如何将正则表达式 pattern.finditer 写入 dataframe

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-04-21 09:15:17

解决方案1
1 已采纳 2021-04-21 09:15:17