繁体   English   中英

Python只为CSV文件写1行

[英]Python writing only 1 line for CSV file

我为重申这个问题而道歉,但是,它仍然有待解决。

这不是一个非常复杂的问题,我敢肯定这是很简单的,但是我根本看不到这个问题。

我通过XML文件进行解析的代码已打开,并以所需的格式读取-最终的for循环中的print语句证明了这一点。

作为示例,它输出以下内容:

旋转支撑手柄D0584129 20090106 US

铰链D0584130 20090106美国

锁舌式转盘D0584131 20090106美国

这正是我希望将数据写入CSV文件的方式。 但是,当我尝试将这些实际作为行写入CSV本身时,它仅打印XML文件中的最后一行,并且采用这种方式:

手电筒包装,D0584138,20090106,美国

这是我的完整代码,因为它可能有助于理解整个过程,从而使您感兴趣的区域成为splited_xml中xml_string的开始位置:

from bs4 import BeautifulSoup
import csv
import unicodecsv as csv

infile = "C:\\Users\\Grisha\\Documents\\Inventor\\2009_Data\\Jan\\ipg090106.xml"

# The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with

def separated_xml(infile):  # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml...) to the next iteration of the root element 
    file = open(infile, "r")   # Used to open the xml file
    buffer = [file.readline()] # Used to read each line and placing inside vector

# The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually
# It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors

    for line in file:       # Running for-loop for the opened file and searches for root elements
        if line.startswith("<?xml "):
            yield "".join(buffer)  # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it
            buffer = []     # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element
        buffer.append(line) # Passes lines into list
    yield "".join(buffer)   # Outputs
    file.close()

# The second nested set of for-loops are used to parse the newly reformatted data into a new list

for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
    soup = BeautifulSoup(xml_string, "lxml")     # BeautifulSoup parses the data strings where the XML is converted to Unicode
    pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
    lst = []  # Creating empty list to append into


    with open('./output.csv', 'wb') as f:
        writer = csv.writer(f, dialect = 'excel')

        for info in pub_ref:  # Looping over all instances of publication

# The final loop finds every instance of invention name, patent number, date, and country to print and append into


                for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
                    print(inv_name.text, pat_num.text, date_num.text, country.text)
                    lst.append((inv_name.text, pat_num.text, date_num.text, country.text))                   
                    writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

我还尝试将open和writer置于for循环之外,以检查出现问题的地方,但无济于事。 我知道该文件一次只写入1行,并且一遍又一遍地覆盖同一行(这就是为什么CSV文件中仅保留1行的原因),我只是看不到它。

非常感谢您的提前帮助。

我认为(无论如何是第一个可行的理论)问题的根源是您的with open语句属于您的for循环,并且使用“ wb”模式覆盖了文件(如果已经存在)。 这意味着每次您的for循环运行时,它都会覆盖以前存在的任何内容,并在完成后仅留下一行输出。

我可以通过两种方法来处理这个问题。 更正确的方法是将文件open语句移到最外层的for循环之外。 我知道您提到您已经尝试过此方法,但细节在于问题。 这会使更新后的代码如下所示:

    with open('./output.csv', 'wb') as f:
      writer = csv.writer(f, dialect='excel')

      for xml_string in separated_xml(infile):
        soup = BeautifulSoup(xml_string, "lxml")
        pub_ref = soup.findAll("publication-reference")
        lst = []

        for info in pub_ref:

          for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
            print(inv_name.text, pat_num.text, date_num.text, country.text)
            lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
            writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

骇人听闻但又更快又更容易的方法是简单地将打开调用中的模式更改为“ ab”(追加,二进制),而不是“ wb”(写入二进制,这会覆盖现有数据)。 由于效率仍然低得多,因为您仍然需要通过for循环每次都重新打开文件,但这可能会起作用。

我希望这有帮助!

with open('./output.csv', 'wb') as f:

只需更改'wb'->'ab'就不会覆盖。

第一次没有解决问题,而是在最后两个循环之前将打开功能移开了,以解决此问题。 感谢那些帮助。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM