简体   繁体   English

Python只为CSV文件写1行

[英]Python writing only 1 line for CSV file

My apologies for reiterating this question, however, it's still yet to be solved. 我为重申这个问题而道歉,但是,它仍然有待解决。

It's not a very complex problem and I'm certain it's fairly straight-forward, but I simply cannot see the issue. 这不是一个非常复杂的问题,我敢肯定这是很简单的,但是我根本看不到这个问题。

My code for parsing through an XML file is open and read in the format that I want - the print statement in the final for-loop proves this. 我通过XML文件进行解析的代码已打开,并以所需的格式读取-最终的for循环中的print语句证明了这一点。

As an example it outputs this: 作为示例,它输出以下内容:

Pivoting support handle D0584129 20090106 US 旋转支撑手柄D0584129 20090106 US

Hinge D0584130 20090106 US 铰链D0584130 20090106美国

Deadbolt turnpiece D0584131 20090106 US 锁舌式转盘D0584131 20090106美国

And this is exactly how I want my data written into the CSV file. 这正是我希望将数据写入CSV文件的方式。 However, when I attempt to actually write these as rows into the CSV itself it only prints one of the last lines in the XML file and in this way: 但是,当我尝试将这些实际作为行写入CSV本身时,它仅打印XML文件中的最后一行,并且采用这种方式:

Flashlight package,D0584138,20090106,US 手电筒包装,D0584138,20090106,美国

Here is my entire code because it might help in understanding the entire process whereby the area of interest is where the for xml_string in separated_xml begins: 这是我的完整代码,因为它可能有助于理解整个过程,从而使您感兴趣的区域成为splited_xml中xml_string的开始位置:

from bs4 import BeautifulSoup
import csv
import unicodecsv as csv

infile = "C:\\Users\\Grisha\\Documents\\Inventor\\2009_Data\\Jan\\ipg090106.xml"

# The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with

def separated_xml(infile):  # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml...) to the next iteration of the root element 
    file = open(infile, "r")   # Used to open the xml file
    buffer = [file.readline()] # Used to read each line and placing inside vector

# The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually
# It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors

    for line in file:       # Running for-loop for the opened file and searches for root elements
        if line.startswith("<?xml "):
            yield "".join(buffer)  # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it
            buffer = []     # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element
        buffer.append(line) # Passes lines into list
    yield "".join(buffer)   # Outputs
    file.close()

# The second nested set of for-loops are used to parse the newly reformatted data into a new list

for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
    soup = BeautifulSoup(xml_string, "lxml")     # BeautifulSoup parses the data strings where the XML is converted to Unicode
    pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
    lst = []  # Creating empty list to append into


    with open('./output.csv', 'wb') as f:
        writer = csv.writer(f, dialect = 'excel')

        for info in pub_ref:  # Looping over all instances of publication

# The final loop finds every instance of invention name, patent number, date, and country to print and append into


                for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
                    print(inv_name.text, pat_num.text, date_num.text, country.text)
                    lst.append((inv_name.text, pat_num.text, date_num.text, country.text))                   
                    writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

I've also tried placing the open and writer outside the for-loops to check where the problem arises but to no avail. 我还尝试将open和writer置于for循环之外,以检查出现问题的地方,但无济于事。 I know that the file is writing only 1 row at a time and overwriting the same line over and over (which is why only 1 line remains in the CSV file), I just can't see it. 我知道该文件一次只写入1行,并且一遍又一遍地覆盖同一行(这就是为什么CSV文件中仅保留1行的原因),我只是看不到它。

Thanks so much for the help in advance. 非常感谢您的提前帮助。

I believe (first working theory anyway) the basis of your problem is the fact that your with open statement falls within your for loop, and uses a mode of "wb" which overwrites the file if it already exists. 我认为(无论如何是第一个可行的理论)问题的根源是您的with open语句属于您的for循环,并且使用“ wb”模式覆盖了文件(如果已经存在)。 This means each time your for loop runs it overwrites anything that was there previously, and leaves you with only a single line of output once it's done. 这意味着每次您的for循环运行时,它都会覆盖以前存在的任何内容,并在完成后仅留下一行输出。

There are two ways I could see you handling this. 我可以通过两种方法来处理这个问题。 The more correct way would be to move the file open statement outside of the outermost for loop. 更正确的方法是将文件open语句移到最外层的for循环之外。 I know you mention that you've tried this already, but the devil is in the details. 我知道您提到您已经尝试过此方法,但细节在于问题。 This would make your updated code look something like this: 这会使更新后的代码如下所示:

    with open('./output.csv', 'wb') as f:
      writer = csv.writer(f, dialect='excel')

      for xml_string in separated_xml(infile):
        soup = BeautifulSoup(xml_string, "lxml")
        pub_ref = soup.findAll("publication-reference")
        lst = []

        for info in pub_ref:

          for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
            print(inv_name.text, pat_num.text, date_num.text, country.text)
            lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
            writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

The hacky, but faster and easier way would be to simply change the mode in your open call to "ab" (append, binary) rather than "wb" (write binary, which overwrites any existing data). 骇人听闻但又更快又更容易的方法是简单地将打开调用中的模式更改为“ ab”(追加,二进制),而不是“ wb”(写入二进制,这会覆盖现有数据)。 This is far less efficient as you're still re-opening the file each time through the for loop, but it would probably work. 由于效率仍然低得多,因为您仍然需要通过for循环每次都重新打开文件,但这可能会起作用。

I hope this helps! 我希望这有帮助!

with open('./output.csv', 'wb') as f:

Simply needed the change 'wb' -> 'ab' to not overwrite. 只需更改'wb'->'ab'就不会覆盖。

Did not work first time around but moving the opening function before the last 2 loops fixed this. 第一次没有解决问题,而是在最后两个循环之前将打开功能移开了,以解决此问题。 Thanks to those who helped. 感谢那些帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM