如何在 python 中使用正则表达式 (re) 编辑 txt 文件

Question

Im having a trouble with editing a txt file on python.我在 python 上编辑 txt 文件时遇到问题。

Hi guys,嗨，大家好，

Im having a trouble with editing a txt file on python.我在 python 上编辑 txt 文件时遇到问题。

Here is the first few lines of the txt file这是txt文件的前几行

m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance']
m1 +++$+++ 1492: conquest of paradise +++$+++ 1992 +++$+++ 6.20 +++$+++ 10421 +++$+++ ['adventure', 'biography', 'drama', 'history']

here is my code:这是我的代码：

import re

file = open('datasets/movie_titles_metadata.txt')

def extract_categories(file):

    for line in file:
        line: str = line.rstrip()
        if re.search(" ", line):
            line = re.sub(r"[0-9]", "", line)
            line = re.sub(r"[$ + : . ]", "", line)
            return line
        
      
    
extract_categories(file)

i need to get an out put that looks like this:我需要得到一个看起来像这样的输出：

['action', 'comedy', 'crime', 'drama', 'thriller'] can someone help? ['action', 'comedy', 'crime', 'drama', 'thriller']有人可以帮忙吗？

Answer 1

Regex is not the correct solution for this.正则表达式不是正确的解决方案。 Each of your lists is at the end of each line, so use str.rsplit :您的每个列表都在每一行的末尾，因此请使用str.rsplit ：

from io import StringIO
import ast

content = """m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance']
m1 +++$+++ 1492: conquest of paradise +++$+++ 1992 +++$+++ 6.20 +++$+++ 10421 +++$+++ ['adventure', 'biography', 'drama', 'history']"""

# this is a mock file-handle, use your file instead here
with StringIO(content) as fh:
    genres = []

    for line in fh:
        # the 1 means that only 1 split occurs
        _, lst = line.rsplit('+++$+++', 1)

        # use ast to convert the string representation
        # to a python list
        lst = ast.literal_eval(lst.strip())

        # extend your result list
        genres.extend(lst)

print(genres)
['comedy', 'romance', 'adventure', 'biography', 'drama', 'history']

Answer 2

Alternatively, if you want to use regex instead:或者，如果您想改用正则表达式：

def extract_categories(file):
    categories = []

    for line in file:
        _, line = line.rsplit('+++$+++', 1)
        if re.search(r"\['[a-z]+", line):
            res = re.findall(r"'([a-z]+)'", line)
            categories.extend(res)

    return categories

如何在 python 中使用正则表达式 (re) 编辑 txt 文件

问题描述

2 个解决方案

解决方案1
1 2022-11-23 19:39:59

解决方案2
0 2022-11-23 20:04:31

如何在 python 中使用正则表达式 (re) 编辑 txt 文件

问题描述

2 个解决方案

解决方案1 1 2022-11-23 19:39:59

解决方案2 0 2022-11-23 20:04:31

解决方案1
1 2022-11-23 19:39:59

解决方案2
0 2022-11-23 20:04:31