簡體   English   中英

Python只為CSV文件寫1行

[英]Python writing only 1 line for CSV file

我為重申這個問題而道歉,但是,它仍然有待解決。

這不是一個非常復雜的問題,我敢肯定這是很簡單的,但是我根本看不到這個問題。

我通過XML文件進行解析的代碼已打開,並以所需的格式讀取-最終的for循環中的print語句證明了這一點。

作為示例,它輸出以下內容:

旋轉支撐手柄D0584129 20090106 US

鉸鏈D0584130 20090106美國

鎖舌式轉盤D0584131 20090106美國

這正是我希望將數據寫入CSV文件的方式。 但是,當我嘗試將這些實際作為行寫入CSV本身時,它僅打印XML文件中的最后一行,並且采用這種方式:

手電筒包裝,D0584138,20090106,美國

這是我的完整代碼,因為它可能有助於理解整個過程,從而使您感興趣的區域成為splited_xml中xml_string的開始位置:

from bs4 import BeautifulSoup
import csv
import unicodecsv as csv

infile = "C:\\Users\\Grisha\\Documents\\Inventor\\2009_Data\\Jan\\ipg090106.xml"

# The first line of code defines a function "separated_xml" that will allow us to separate, read, and then finally parse the data of interest with

def separated_xml(infile):  # Defining the data reading function for each xml section - This breaks apart the xml from the start (root element <?xml...) to the next iteration of the root element 
    file = open(infile, "r")   # Used to open the xml file
    buffer = [file.readline()] # Used to read each line and placing inside vector

# The first for-loop is used to slice every section of the USPTO XML file to be read and parsed individually
# It is necessary because Python wishes to read only one instance of a root element but this element is found many times in each file which causes reading errors

    for line in file:       # Running for-loop for the opened file and searches for root elements
        if line.startswith("<?xml "):
            yield "".join(buffer)  # 1) Using "yield" allows to generate one instance per run of a root element and 2) .join takes the list (vector) "buffer" and connects an empty string to it
            buffer = []     # Creates a blank list to store the beginning of a new 'set' of data in beginning with the root element
        buffer.append(line) # Passes lines into list
    yield "".join(buffer)   # Outputs
    file.close()

# The second nested set of for-loops are used to parse the newly reformatted data into a new list

for xml_string in separated_xml(infile): # Calls the output of the separated and read file to parse the data
    soup = BeautifulSoup(xml_string, "lxml")     # BeautifulSoup parses the data strings where the XML is converted to Unicode
    pub_ref = soup.findAll("publication-reference") # Beginning parsing at every instance of a publication
    lst = []  # Creating empty list to append into


    with open('./output.csv', 'wb') as f:
        writer = csv.writer(f, dialect = 'excel')

        for info in pub_ref:  # Looping over all instances of publication

# The final loop finds every instance of invention name, patent number, date, and country to print and append into


                for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
                    print(inv_name.text, pat_num.text, date_num.text, country.text)
                    lst.append((inv_name.text, pat_num.text, date_num.text, country.text))                   
                    writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

我還嘗試將open和writer置於for循環之外,以檢查出現問題的地方,但無濟於事。 我知道該文件一次只寫入1行,並且一遍又一遍地覆蓋同一行(這就是為什么CSV文件中僅保留1行的原因),我只是看不到它。

非常感謝您的提前幫助。

我認為(無論如何是第一個可行的理論)問題的根源是您的with open語句屬於您的for循環,並且使用“ wb”模式覆蓋了文件(如果已經存在)。 這意味着每次您的for循環運行時,它都會覆蓋以前存在的任何內容,並在完成后僅留下一行輸出。

我可以通過兩種方法來處理這個問題。 更正確的方法是將文件open語句移到最外層的for循環之外。 我知道您提到您已經嘗試過此方法,但細節在於問題。 這會使更新后的代碼如下所示:

    with open('./output.csv', 'wb') as f:
      writer = csv.writer(f, dialect='excel')

      for xml_string in separated_xml(infile):
        soup = BeautifulSoup(xml_string, "lxml")
        pub_ref = soup.findAll("publication-reference")
        lst = []

        for info in pub_ref:

          for inv_name, pat_num, date_num, country in zip(soup.findAll("invention-title"), soup.findAll("doc-number"), soup.findAll("date"), soup.findAll("country")):
            print(inv_name.text, pat_num.text, date_num.text, country.text)
            lst.append((inv_name.text, pat_num.text, date_num.text, country.text))
            writer.writerow([inv_name.text, pat_num.text, date_num.text, country.text])

駭人聽聞但又更快又更容易的方法是簡單地將打開調用中的模式更改為“ ab”(追加,二進制),而不是“ wb”(寫入二進制,這會覆蓋現有數據)。 由於效率仍然低得多,因為您仍然需要通過for循環每次都重新打開文件,但這可能會起作用。

我希望這有幫助!

with open('./output.csv', 'wb') as f:

只需更改'wb'->'ab'就不會覆蓋。

第一次沒有解決問題,而是在最后兩個循環之前將打開功能移開了,以解決此問題。 感謝那些幫助。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM