简体   繁体   中英

How could I add a \n Newline character after every 1000 words in a text file?

Alright so here's the question. I have some text files that have 14,000+ words in them, but they are all on 1 line, and if you use a editor that doesn't have a auto wrapping feature you can't read the text file. So I would like to add returns or newline characters to my file after at least 1000 words and on the next occurrence of a "." . My first thought was to count the lines then add it up and when it reached 1000 insert a \\n character, but with it being all on 1 line. That makes things a bit more difficult, and I haven't been able to find a way to accomplish what I want. Without I, myself, going through the text file and adding newlines myself. Which defeats the purpose of my goal of just running a python script to automatically do it for me. Is this possible? Or am I crazy for thinking so? Thanks in advance for any help you can provide! I have provided my various attempts to do this below.

In this attempt the code works as expected, but instead of it printing Word Count is over 1000 about 14 times. Since, the word count for this text file is 14,000 and something. It only prints once since there is only one line for it to read.

text_file = "textfile.txt"
numLines = 0
numWords = 0
numChars = 0

with open(text_file, 'r') as file:
for line in file:
    wordsList = line.split()
    numLines +=1
    numWords += len(wordsList)
    numChars += len(line)
    if numWords > 1000:
        print("Word Count is over 1000.")

In this next attempt I didn't something similar but still got the same result as above. Instead of seeing it write \\n\\n\\n\\n to the text file about 14 times it only happened once at the end of the file.

def oldWordCounter(input_file):
    word_count = 0

    with open(input_file, 'r') as f:
        for line in f:
            word_count = len(line.split(' '))
            print("Word count = %s \n" % word_count)

    if word_count > 1000: 
        with open(input_file, 'a') as f:
            f.write("\n\n\n\n")

I am sure I am just missing something simple but I am pretty new to python. Even though it kills me to ask a question on here. I am at my wits end and can't seem to get any further than this. So again thank you so much for any help you can provide on this issue!

Also below I have provide the way I planned to add the newlines at after the next period occurred. Not sure if this will help any but might help you see more of what I was wanting to accomplish.

def splitOnPeriod(input_file):
with open(input_file,"r") as f:
    for line in f:
        searchPhrase = "."
        if searchPhrase in line:
            file = open(input_file, "a")
            file.write("\n\n\n\n")
            print("found it\n")

Here is a small portion of the text I am working with...

World headquarters, only business Google without bada bing bada boom, guess who's back inside your room. It is the Thrive time show on your radio. My name is Clay Clark, the former and recovering disc jockey. I am joined today Inside the Box rocks with with a guy. He sees he's on telling you what he's he's back in Tulsa for at least the foreseeable future, maybe maybe for several days several minutes. It'S dr. Robert zoellner, sir welcome back. I am so fired up today. I am in such a great mood and right now I could see Marshall and I could see his reaction as I get to announce why I'm so happy all really. Yes, I glorious thing happen this weekend. You'Re discovering more hair is growing and I like you're, going with that by the way this happened to do with a little support. We Americans love so much call football Hurricane football. Absolutely I mean the world. I have waited a year to get the world right again and in my Oklahoma, Sooners go up to Columbus and whoop. I mean now. Let'S talk about the facts here, cuz there's a lot of people listening. This is a business, show its business school without the BS to keep it relevant to make sure that understand this Oklahoma. If I'm correct was right, number 5 correct and I believe that Ohio state was ranked number 2. Yes, why you leave in the box of rocks? Do is In-N-Out Marshall to the drivers who don't know Marshall, for business coaches in Ohio from Ohio and he's not so he really cares about Ohio. Yes, fifth-ranked Boomer Sooners went up there and beat him was a close. Now. It wasn't even close, really really good, and so then I'm so that was Saturday and then Sunday this last weekend and I've been waiting to have Marshall in the Box, because I can't make this announcement without you really here to sit on that till Wednesday. Clear the clear that kind of thing I didn't seem last couple things on Sunday, the Dallas Cowboys won the double bonus. Can I will quick on this and I've loved the Patriots and Jonathan are off as he hates the Patriots, and so whenever his Giants lose, I almost feel better about their loss. I almost feel better about their loss, then actual win for the Patriots and when I saw the Cowboys just turn it on I'm like this is great. I don't care what team it is as long as they're playing the Giants. I am I'm almost. I wouldn't make a prayer chain, but I will be on the verge of making your prayer chain for your team excited to see, but I don't care who it is they beat. The Giants is a great thing for American I'm a Little Lamb lunch Wagers. I am going to whenever he pays off on The Chew very slowly and enjoy every moment of tizers have reserved, but I'll have to I'll. Have I don't normally do it, but since you're paying for it Marshall, I think I will now on Today Show we're breaking down to six books that every entrepreneur should read the six books at every entrepreneur should read, and a book number one was thinking, Grow. Rich book number to you can actually get that book for free. It is start here the book The we put together the documents, our business cyst shamelessly. So if you want to learn how to grow successful company to start here to 550 page book, it's absolutely free to download it Thrive time show. And we just hit the amazon.com best sellers list on that. So if you go to Amazon now and you type in like business Consulting into the search bar, that book actually comes up in the top five books now, and so that's a book that you can get there for free to ebook, it's absolutely free for you. We move on now to book number 3, which is Titan now. Titan is the book that documents, the Life, The Life and Times of John D Rockefeller, who actually grew up like everybody else, use Easy. You start somewhere. He grew up poor and at the age of 16 he began working to support his mother because his father was an absent father and actually decided to leave his family and get married to another woman without telling his current wife it's breaking down some notable quotables from That book and I'm going to go ahead and give you the first notable quotable. This is John D. Rockefeller Miss. Is it from the book tighten the author writes he had a great generals, ability to focus on his goals and a brush aside obstacles as Petty distractions. He wants said you can abuse me.

This code will split every 1000 lines, resetting when it hits a . :

words = s.split()
new_text = ""
word_count = 0
for word in words:
    new_text += word + " "
    word_count += 1
    if word_count == 1000 or "." in word:
        new_text += "\n"
        word_count = 0

Where s is the string read from the file. Simply write new_text to the file afterwards.

Read all words to list & append '\\n' to file for every 1000th word or word having period.

AllWords = []
for line in open("data_words.txt"):
    row = line.split(' ')
    AllWords+=list(row)

line_breaker=1000
i=1
with open("/home/kiran/km/km_hadoop/data/data_wordcount_op.txt", 'a') as file:
    for word in AllWords:
        if("." in word or i==line_breaker):
            file.write(word.strip('\n')+"\n")
            i=0
        else:
            file.write(word.strip('\n')+" ")

        i+=1

To answer your first question I defined a linewrapper function that takes a file and the length of the wrap that you would want. Using the modulo operator we divide our iterator by the wrap_length minus one since the index starts at 0. The modulo operator allows us to determine if it is divisible by 100. For instance if wrap_length is 97 and i is 96 we will have a remainder resulting in a value other than 0. If no remainder then the value will be 0. We need to check if i is 0 since 0 divided by anything will result in no remainder. You can read more about how to apply that operator here: https://docs.python.org/3.3/reference/expressions.html#binary-arithmetic-operations

def linewrapper(input_file, wrap_length):
    with open(input_file, 'r') as input_file, open('output.txt', 'w') as output_file:
        for line in input_file:
            words = line.split()
            for i in range(0, len(words)):
                output_file.write('%s ' % words[i])
                if i != 0 and i % (wrap_length - 1) == 0:
                    output_file.write("\n")

linewrapper('input.txt', 100)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM