[英]how to edit txt file with regular expressions (re) in python
Im having a trouble with editing a txt file on python.我在 python 上编辑 txt 文件时遇到问题。
Hi guys,嗨,大家好,
Im having a trouble with editing a txt file on python.我在 python 上编辑 txt 文件时遇到问题。
Here is the first few lines of the txt file这是txt文件的前几行
m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance']
m1 +++$+++ 1492: conquest of paradise +++$+++ 1992 +++$+++ 6.20 +++$+++ 10421 +++$+++ ['adventure', 'biography', 'drama', 'history']
here is my code:这是我的代码:
import re
file = open('datasets/movie_titles_metadata.txt')
def extract_categories(file):
for line in file:
line: str = line.rstrip()
if re.search(" ", line):
line = re.sub(r"[0-9]", "", line)
line = re.sub(r"[$ + : . ]", "", line)
return line
extract_categories(file)
i need to get an out put that looks like this:我需要得到一个看起来像这样的输出:
['action', 'comedy', 'crime', 'drama', 'thriller']
can someone help? ['action', 'comedy', 'crime', 'drama', 'thriller']
有人可以帮忙吗?
Regex is not the correct solution for this.正则表达式不是正确的解决方案。 Each of your lists is at the end of each line, so use
str.rsplit
:您的每个列表都在每一行的末尾,因此请使用
str.rsplit
:
from io import StringIO
import ast
content = """m0 +++$+++ 10 things i hate about you +++$+++ 1999 +++$+++ 6.90 +++$+++ 62847 +++$+++ ['comedy', 'romance']
m1 +++$+++ 1492: conquest of paradise +++$+++ 1992 +++$+++ 6.20 +++$+++ 10421 +++$+++ ['adventure', 'biography', 'drama', 'history']"""
# this is a mock file-handle, use your file instead here
with StringIO(content) as fh:
genres = []
for line in fh:
# the 1 means that only 1 split occurs
_, lst = line.rsplit('+++$+++', 1)
# use ast to convert the string representation
# to a python list
lst = ast.literal_eval(lst.strip())
# extend your result list
genres.extend(lst)
print(genres)
['comedy', 'romance', 'adventure', 'biography', 'drama', 'history']
Alternatively, if you want to use regex instead:或者,如果您想改用正则表达式:
def extract_categories(file):
categories = []
for line in file:
_, line = line.rsplit('+++$+++', 1)
if re.search(r"\['[a-z]+", line):
res = re.findall(r"'([a-z]+)'", line)
categories.extend(res)
return categories
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.