简体   繁体   English

如何在 python 中使用正则表达式 (re) 编辑 txt 文件

[英]how to edit txt file with regular expressions (re) in python

Im having a trouble with editing a txt file on python.我在 python 上编辑 txt 文件时遇到问题。

Hi guys,嗨,大家好,

Im having a trouble with editing a txt file on python.我在 python 上编辑 txt 文件时遇到问题。

Here is the first few lines of the txt file这是txt文件的前几行

m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance']
m1 +++$+++ 1492: conquest of paradise +++$+++ 1992 +++$+++ 6.20 +++$+++ 10421 +++$+++ ['adventure', 'biography', 'drama', 'history']

here is my code:这是我的代码:

import re

file = open('datasets/movie_titles_metadata.txt')

def extract_categories(file):

    for line in file:
        line: str = line.rstrip()
        if re.search(" ", line):
            line = re.sub(r"[0-9]", "", line)
            line = re.sub(r"[$ + : . ]", "", line)
            return line
        
      
    
extract_categories(file) 

i need to get an out put that looks like this:我需要得到一个看起来像这样的输出:

['action', 'comedy', 'crime', 'drama', 'thriller'] can someone help? ['action', 'comedy', 'crime', 'drama', 'thriller']有人可以帮忙吗?

Regex is not the correct solution for this.正则表达式不是正确的解决方案。 Each of your lists is at the end of each line, so use str.rsplit :您的每个列表都在每一行的末尾,因此请使用str.rsplit

from io import StringIO
import ast

content = """m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance']
m1 +++$+++ 1492: conquest of paradise +++$+++ 1992 +++$+++ 6.20 +++$+++ 10421 +++$+++ ['adventure', 'biography', 'drama', 'history']"""

# this is a mock file-handle, use your file instead here
with StringIO(content) as fh:
    genres = []

    for line in fh:
        # the 1 means that only 1 split occurs
        _, lst = line.rsplit('+++$+++', 1)

        # use ast to convert the string representation
        # to a python list
        lst = ast.literal_eval(lst.strip())

        # extend your result list
        genres.extend(lst)

print(genres)
['comedy', 'romance', 'adventure', 'biography', 'drama', 'history']

Alternatively, if you want to use regex instead:或者,如果您想改用正则表达式:

def extract_categories(file):
    categories = []

    for line in file:
        _, line = line.rsplit('+++$+++', 1)
        if re.search(r"\['[a-z]+", line):
            res = re.findall(r"'([a-z]+)'", line)
            categories.extend(res)

    return categories

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM