[英]how to split txt file into two lists and than split one list to its captions
Got this text file:得到这个文本文件:
1e.jpg#0 A dog going for a walk .
2e.jpg#1 A boy is going to swim
3e.jpg#2 A girl is chasing the cat .
4e.jpg#3 Three people are going to a hockey game
I need to split it into two separate lists.我需要将它分成两个单独的列表。 One list for IDs and the second for the sentences.
一个列表用于 ID,第二个用于句子。 This is where I need help as now I need to split the sentences list into the following:
这是我需要帮助的地方,因为现在我需要将句子列表拆分为以下内容:
[["a", "dog", "going", "for", "a"...], ["a",......]]
This is how far I got这是我走了多远
path = "s.txt"
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
l2.append(line.split("\t")[1:])
print(l2)
You can use the same principle.您可以使用相同的原理。 The
split
function splits on whitespace by default.默认情况下,
split
function 在空格上拆分。 I also removed the :
from l2.append(line.split("\t")[1:])
so that it returns a string instead of a list with one element:我还从
l2.append(line.split("\t")[1:])
中删除了:
,以便它返回一个字符串而不是一个包含一个元素的列表:
path = "s.txt"
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
l2.append(line.split("\t")[1])
words_list = []
for s in l2:
words_list.append(s.split())
print(words_list)
If you don't care about punctuation being added to your lists, you can just split your string in your current code (assuming only one tab character occurs):如果您不关心将标点符号添加到列表中,则可以在当前代码中拆分字符串(假设仅出现一个制表符):
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
l2.append(line.split("\t")[1].split())
print(l2)
Output: Output:
[['A', 'dog', 'going', 'for', 'a', 'walk', '.'], ['A', 'boy', 'is', 'going', 'to', 'swim'], ['A', 'girl', 'is', 'chasing', 'the', 'cat', '.'], ['Three', 'people', 'are', 'going', 'to', 'a', 'hockey', 'game']]
If you want to remove non-word elements, you can use re.split
:如果要删除非单词元素,可以使用
re.split
:
import re
split_pattern = re.compile(r'\W? \W?')
l1 = []
l2 = []
read_file=open(path, "r")
split = [line.strip() for line in read_file]
for line in split:
l1.append(line.split("\t")[0])
word_list = [x for x in re.split(split_pattern, line.split("\t")[1]) if x]
l2.append(word_list)
print(l2)
Output: Output:
[['A', 'dog', 'going', 'for', 'a', 'walk'], ['A', 'boy', 'is', 'going', 'to', 'swim'], ['A', 'girl', 'is', 'chasing', 'the', 'cat'], ['Three', 'people', 'are', 'going', 'to', 'a', 'hockey', 'game']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.