[英]Write all lines for each set of a range to new file each time the range changes Python 3.6
trying to find a way of making this process work pythonically or at all. 试图找到一种方法来使该过程以Python的方式或完全不起作用。 Basically, I have a really long text file that is split into lines.
基本上,我有一个很长的文本文件,该文件分为几行。 Every x number of lines there is one that is mainly uppercase, which should roughly be the title of that particular section.
每x个行都有一个主要是大写的行,大约应该是该特定节的标题。 Ideally, I'd want the title and everything after to go into a text file using the title as the name for the file.
理想情况下,我希望标题和之后的所有内容都进入文本文件,并使用标题作为文件名。 This would have to happen 3039 in this case as that is as many titles will be there.
在这种情况下,这将必须发生3039,因为那里将有许多标题。 My process so far is this: I created a variable that reads through a text file tells me if it's mostly uppercase.
到目前为止,我的过程是这样的:我创建了一个变量,该变量会读取文本文件,告诉我它是否大部分是大写字母。
def mostly_uppercase(text):
threshold = 0.7
isupper_bools = [character.isupper() for character in text]
isupper_ints = [int(val) for val in isupper_bools]
try:
upper_percentage = np.mean(isupper_ints)
except:
return False
if upper_percentage >= threshold:
return True
else:
return False
Afterwards, I made a counter so that I could create an index and then I combined it: 之后,我做了一个计数器,以便创建索引,然后将其合并:
counter = 0
headline_indices = []
for line in page_text:
if mostly_uppercase(line):
print(line)
headline_indices.append(counter)
counter+=1
headlines_with_articles = []
headline_indices_expanded = [0] + headline_indices + [len(page_text)-1]
for first, second in list(zip(headline_indices_expanded, headline_indices_expanded[1:])):
article_text = (page_text[first:second])
headlines_with_articles.append(article_text)
All of that seems to be working fine as far as I can tell. 据我所知,所有这些似乎都工作正常。 But when I try to print the pieces that I want to files, all I manage to do is print the entire text into all of the txt files.
但是,当我尝试打印要归档的文件时,我要做的就是将整个文本打印到所有txt文件中。
for i in range(100):
out_pathname = '/sharedfolder/temp_directory/' + 'new_file_' + str(i) + '.txt'
with open(out_pathname, 'w') as fo:
fo.write(articles_filtered[2])
Edit: This got me halfway there. 编辑:这让我中途了。 Now, I just need a way of naming each file with the first line.
现在,我只需要一种用第一行命名每个文件的方法。
for i,text in enumerate(articles_filtered):
open('/sharedfolder/temp_directory' + str(i + 1) + '.txt', 'w').write(str(text))
One conventional way of processing a single input file involves using a Python with
statement and a for
loop, in the following way. 处理单个输入文件的一种传统方式涉及以下列方式使用带
with
语句的Python和for
循环。 I have also adapted a good answer from someone else for counting uppercase characters, to get the fraction you need. 我还从其他人那里得到了一个很好的答案,用于计算大写字符,以获得所需的分数。
def mostly_upper(text):
threshold = 0.7
## adapted from https://stackoverflow.com/a/18129868/131187
upper_count = sum(1 for c in text if c.isupper())
return upper_count/len(text) >= threshold
first = True
out_file = None
with open('some_uppers.txt') as some_uppers:
for line in some_uppers:
line = line.rstrip()
if first or mostly_upper(line):
first = False
if out_file: out_file.close()
out_file = open(line+'.txt', 'w')
print(line, file=out_file)
out_file.close()
In the loop, we read each line, asking whether it's mostly uppercase. 在循环中,我们读取每一行,并询问是否大部分都是大写的。 If it is we close the file that was being used for the previous collection of lines and open a new file for the next collection, using the contents of the current line as a title.
如果是这样,则以当前行的内容为标题,关闭用于上一个行集合的文件,并为下一个集合打开一个新文件。
I allow for the possibility that the first line might not be a title. 我允许第一行可能不是标题。 In this case the code creates a file with the contents of the first line as its names, and proceeds to write everything it finds to that file until it does find a title line.
在这种情况下,代码创建的第一行作为其名称的内容的文件,并继续书写它找到该文件,直到一切它找到一个标题行。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.