简体   繁体   中英

Writing into csv file after BeautifulSoup

Using BeautifulSoup to extract some text, and then I want to save the entries into a csv file. My code as follows:

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    saveFile = open("some.csv", "a")
    saveFile.write(str(tdTags_string) + ",")
    saveFile.close()

saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()

It did what I want for the most part EXCEPT whenever if there is a comma (",") within the entry, it sees it as a separator and split the single entry into two different cells (which is not what I want). So I searched around the net and found people suggested of using the csv module and I changed my codes into:

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    print tdTags_string

    with open("some.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow(tdTags_string)

saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()

This made it even worse, now each letter/number of a word or number occupies a single cell in the csv file. For example, if the entry is "Cat". The "C" is in one cell, "a" is the next cell, and "t" is the third cell, etc.

Edited version:

import urllib2
import re
import csv
from bs4 import BeautifulSoup

SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()

# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()

# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)

    with open("SomeSite.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow([tdTags_string])

2nd edition:

placeHolder = []

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    placeHolder.append(tdTags_string)

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

Updated output:

u'stuff1'
u'stuff2'
u'stuff3'

Output example:

u'record1'  u'31 Mar 1901'  u'California'

u'record1'  u'31 Mar 1901'  u'California'

record1     31-Mar-01       California

Another edited codes (still having one issue - skipping one line below):

import urllib2
import re
import csv
from bs4 import BeautifulSoup

SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()

# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()

# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")

placeHolder = []

for trTag in trTags:
    tdTags = trTag.find("td", class_="result-value")
    tdTags_string = tdTags.get_text(strip=True)
    #print repr(tdTags_string)
    placeHolder.append(tdTags_string.rstrip('\n'))

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)
with open("some.csv", "a") as f:
        writeFile = csv.writer(f)
        writeFile.writerow([tdTags_string]) # put in a list

writeFile.writerow will iterate over what you pass in so a string "foo" becomes f,o,o three separate values, wrapping it in a list will prevent this as writer will iterate over the list not the string

You should open your file once as opposed to every time through your loop:

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    for trTag in trTags:
        tdTags = trTag.find("td", class_="result-value")
        tdTags_string = tdTags.get_text(strip=True) # 
        writeFile.writerow([tdTags_string])

For the latest problem of skipping line, I have found an answer. Instead of

with open("SomeSite.csv", "a") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

Use this:

with open("SomeSite.csv", "ab") as f:
    writeFile = csv.writer(f)
    writeFile.writerow(placeHolder)

Source: https://docs.python.org/3/library/functions.html#open . The "a" mode is the appending mode, where as "ab" is an appending mode while opening the file as binary file which solves the problem of skipping one extra line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM