简体   繁体   中英

Python writing only 1 line for CSV file

My apologies for reiterating this question, however, it's still yet to be solved.

It's not a very complex problem and I'm certain it's fairly straight-forward, but I simply cannot see the issue.

My code for parsing through an XML file is open and read in the format that I want - the print statement in the final for-loop proves this.

As an example it outputs this:

Pivoting support handle D0584129 20090106 US

Hinge D0584130 20090106 US

Deadbolt turnpiece D0584131 20090106 US

And this is exactly how I want my data written into the CSV file. However, when I attempt to actually write these as rows into the CSV itself it only prints one of the last lines in the XML file and in this way:

Flashlight package,D0584138,20090106,US

Here is my entire code because it might help in understanding the entire process whereby the area of interest is where the for xml_string in separated_xml begins:

from bs4 import BeautifulSoup
import csv
import unicodecsv as csv

infile = "C:\\Users\\Grisha\\Documents\\Inventor\\2009_Data\\Jan\\ipg090106.xml"

# The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with

def separated_xml(infile):  # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml...) to the next iteration of the root element 
    file = open(infile, "r")   # Used to open the xml file
    buffer = [file.readline()] # Used to read each line and placing inside vector

# The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually
# It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors

    for line in file:       # Running for-loop for the opened file and searches for root elements
        if line.startswith("<?xml "):
            yield "".join(buffer)  # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it
            buffer = []     # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element
        buffer.append(line) # Passes lines into list
    yield "".join(buffer)   # Outputs
    file.close()

# The second nested set of for-loops are used to parse the newly reformatted data into a new list

for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
    soup = BeautifulSoup(xml_string, "lxml")     # BeautifulSoup parses the data strings where the XML is converted to Unicode
    pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
    lst = []  # Creating empty list to append into


    with open('./output.csv', 'wb') as f:
        writer = csv.writer(f, dialect = 'excel')

        for info in pub_ref:  # Looping over all instances of publication

# The final loop finds every instance of invention name, patent number, date, and country to print and append into


                for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
                    print(inv_name.text, pat_num.text, date_num.text, country.text)
                    lst.append((inv_name.text, pat_num.text, date_num.text, country.text))                   
                    writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

I've also tried placing the open and writer outside the for-loops to check where the problem arises but to no avail. I know that the file is writing only 1 row at a time and overwriting the same line over and over (which is why only 1 line remains in the CSV file), I just can't see it.

Thanks so much for the help in advance.

I believe (first working theory anyway) the basis of your problem is the fact that your with open statement falls within your for loop, and uses a mode of "wb" which overwrites the file if it already exists. This means each time your for loop runs it overwrites anything that was there previously, and leaves you with only a single line of output once it's done.

There are two ways I could see you handling this. The more correct way would be to move the file open statement outside of the outermost for loop. I know you mention that you've tried this already, but the devil is in the details. This would make your updated code look something like this:

    with open('./output.csv', 'wb') as f:
      writer = csv.writer(f, dialect='excel')

      for xml_string in separated_xml(infile):
        soup = BeautifulSoup(xml_string, "lxml")
        pub_ref = soup.findAll("publication-reference")
        lst = []

        for info in pub_ref:

          for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
            print(inv_name.text, pat_num.text, date_num.text, country.text)
            lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
            writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

The hacky, but faster and easier way would be to simply change the mode in your open call to "ab" (append, binary) rather than "wb" (write binary, which overwrites any existing data). This is far less efficient as you're still re-opening the file each time through the for loop, but it would probably work.

I hope this helps!

with open('./output.csv', 'wb') as f:

Simply needed the change 'wb' -> 'ab' to not overwrite.

Did not work first time around but moving the opening function before the last 2 loops fixed this. Thanks to those who helped.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM