简体   繁体   English

在BeautifulSoup之后写入csv文件

[英]Writing into csv file after BeautifulSoup

Using BeautifulSoup to extract some text, and then I want to save the entries into a csv file. 使用BeautifulSoup提取一些文本,然后将条目保存到csv文件中。 My code as follows: 我的代码如下:

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    saveFile = open("some.csv", "a")
    saveFile.write(str(tdTags_string) + ",")
    saveFile.close()

saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()

It did what I want for the most part EXCEPT whenever if there is a comma (",") within the entry, it sees it as a separator and split the single entry into two different cells (which is not what I want). 只要条目中有逗号(“,”),它就可以满足我的大部分需求,它将其视为分隔符并将单个条目拆分为两个不同的单元格(这不是我想要的)。 So I searched around the net and found people suggested of using the csv module and I changed my codes into: 因此,我在网上搜索,发现有人建议使用csv模块,然后将代码更改为:

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    print tdTags_string

    with open("some.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow(tdTags_string)

saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()

This made it even worse, now each letter/number of a word or number occupies a single cell in the csv file. 这就变得更糟了,现在单词或数字的每个字母/数字都占用了csv文件中的单个单元格。 For example, if the entry is "Cat". 例如,如果条目为“猫”。 The "C" is in one cell, "a" is the next cell, and "t" is the third cell, etc. “ C”在一个单元格中,“ a”是下一个单元格,“ t”是第三个单元格,依此类推。

Edited version: 编辑版本:

import urllib2
import re
import csv
from bs4 import BeautifulSoup

SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()

# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()

# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)

    with open("SomeSite.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow([tdTags_string])

2nd edition: 第二版:

placeHolder = []

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    placeHolder.append(tdTags_string)

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

Updated output: 更新的输出:

u'stuff1'
u'stuff2'
u'stuff3'

Output example: 输出示例:

u'record1'  u'31 Mar 1901'  u'California'

u'record1'  u'31 Mar 1901'  u'California'

record1     31-Mar-01       California

Another edited codes (still having one issue - skipping one line below): 另一个已编辑的代码(仍然有一个问题-跳过以下一行):

import urllib2
import re
import csv
from bs4 import BeautifulSoup

SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()

# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()

# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")

placeHolder = []

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    #print repr(tdTags_string)
    placeHolder.append(tdTags_string.rstrip('\n'))

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)
with open("some.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow([tdTags_string]) # put in a list

writeFile.writerow will iterate over what you pass in so a string "foo" becomes f,o,o three separate values, wrapping it in a list will prevent this as writer will iterate over the list not the string writeFile.writerow将遍历你在这样一个字符串传递什么"foo"变成f,o,o三个独立的值,在其包装list可以防止这种作为作家会遍历列表不是字符串

You should open your file once as opposed to every time through your loop: 您应该一次打开文件,而不是每次循环都打开文件:

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    for trTag in trTags:
        tdTags = trTag.find("td", class_="result-value")
        tdTags_string = tdTags.get_text(strip=True) # 
        writeFile.writerow([tdTags_string])

For the latest problem of skipping line, I have found an answer. 对于最新的跳线问题,我找到了答案。 Instead of 代替

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

Use this: 用这个:

with open("SomeSite.csv", "ab") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

Source: https://docs.python.org/3/library/functions.html#open . 来源: https : //docs.python.org/3/library/functions.html#open The "a" mode is the appending mode, where as "ab" is an appending mode while opening the file as binary file which solves the problem of skipping one extra line. “ a”模式是附加模式,其中“ ab”是将文件作为二进制文件打开时的附加模式,解决了跳过多余一行的问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM