简体   繁体   中英

Webscraping with mechanize and beautifulsoup - cannot write output file

I need to get lots of data specific to rivercruising, so I am working with alteryx, and for scraping I want to use python from the command line. I need to write the output file to json or to csv. The output file is empty. The hashtags in the code are for processing the output file in alteryx, as the scraped text already contains ",". Preferably I would love to map the output to Json. My code is as follows:

from mechanize import Browser
from bs4 import BeautifulSoup
import lxml

mech = Browser()

url = 'http://www.cruiseshipschedule.com/viking-river-cruises/viking-aegir-schedule/'
page = mech.open(url)

html = page.read()
html.replace('charset="ISO-8859-1"','charset=utf-8')
s = BeautifulSoup(html, "lxml")
content = s.findAll('div', id="content")
link = s.findAll("a")
h1 = s.findAll("h1")

table = s.findAll("table", border="1")

for link in s.findAll("a"):
    linktext = link.text
    linkhref = link.get("href")

for h1 in s.findAll("h1"):
    ship = h1.text

h2_1 = s.h2
h2_1.text
h2_2 = h2_1.find_next('h2')
itinerary_1 = h2_2.text
h2_3 = h2_2.find_next('h2')
itinerary_2 = h2_3.text
h2_4 = h2_3.find_next('h2')
itinerary_3 = h2_4.text

for table in content:
    table0 = s.findAll("table", border='0')

    for tr in s.findAll("table", border='1'):
        trs1 = s.findAll("tr")
        table1 = tr.text.replace('\n','|')
        tds1 = s.findAll('td')
        uls1 = s.findAll('ul')
        lis1 = s.findAll('li')



    for tr in s.findAll("table", border='0'):
        trs2 = s.findAll("tr")
        table2 = tr.text.replace('\n','|')
        tds2 = s.findAll('td')
        uls2 = s.findAll('ul')
        lis2 = s.findAll('li')

all_data=ship+"#"+table1+"#"+table2+"#"+itinerary_1+"#"+itinerary_2+"#"+itinerary_3


all_data = open("Z:/txt files/all_data.txt", "w")
print all_data >> "Z:/txt files/all_data.txt"

To get output to your file, try something like instead of the last 2 lines in your code above:

with open('all_data_txt, 'w') as f:
    f.write(all_data.encode('utf8'))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM