[英]Writing into csv file after BeautifulSoup
使用BeautifulSoup提取一些文本,然后將條目保存到csv文件中。 我的代碼如下:
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
saveFile = open("some.csv", "a")
saveFile.write(str(tdTags_string) + ",")
saveFile.close()
saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()
只要條目中有逗號(“,”),它就可以滿足我的大部分需求,它將其視為分隔符並將單個條目拆分為兩個不同的單元格(這不是我想要的)。 因此,我在網上搜索,發現有人建議使用csv模塊,然后將代碼更改為:
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
print tdTags_string
with open("some.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(tdTags_string)
saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()
這就變得更糟了,現在單詞或數字的每個字母/數字都占用了csv文件中的單個單元格。 例如,如果條目為“貓”。 “ C”在一個單元格中,“ a”是下一個單元格,“ t”是第三個單元格,依此類推。
編輯版本:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()
# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow([tdTags_string])
第二版:
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
placeHolder.append(tdTags_string)
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
更新的輸出:
u'stuff1'
u'stuff2'
u'stuff3'
輸出示例:
u'record1' u'31 Mar 1901' u'California'
u'record1' u'31 Mar 1901' u'California'
record1 31-Mar-01 California
另一個已編輯的代碼(仍然有一個問題-跳過以下一行):
import urllib2
import re
import csv
from bs4 import BeautifulSoup
SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()
# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
#print repr(tdTags_string)
placeHolder.append(tdTags_string.rstrip('\n'))
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
with open("some.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow([tdTags_string]) # put in a list
writeFile.writerow
將遍歷你在這樣一個字符串傳遞什么"foo"
變成f,o,o
三個獨立的值,在其包裝list
可以防止這種作為作家會遍歷列表不是字符串
您應該一次打開文件,而不是每次循環都打開文件:
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True) #
writeFile.writerow([tdTags_string])
對於最新的跳線問題,我找到了答案。 代替
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
用這個:
with open("SomeSite.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
來源: https : //docs.python.org/3/library/functions.html#open 。 “ a”模式是附加模式,其中“ ab”是將文件作為二進制文件打開時的附加模式,解決了跳過多余一行的問題。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.