Python Web Scrape將輸出寫入文件

Question

我有一個基本的Python腳本，可以將輸出存儲到文件中。 這個文件很難解析。 還有其他方法可以將抓取的數據寫入文件，以便輕松將其讀取到Python中進行分析？

import requests
from bs4 import BeautifulSoup as BS
import json
data='C:/test.json'
url="http://sfbay.craigslist.org/search/sby/sss?sort=rel&query=baby" 

r=requests.get(url)
soup=BS(r.content)
links=soup.find_all("p")
#print soup.prettify()

for link in links:
    connections=link.text
    f=open(data,'a')
    f.write(json.dumps(connections,indent=1))
    f.close()

輸出文件包含以下內容：“ $ 25 Sep 5陶瓷小鹿$ 25（sunnyvale）圖片家居用品-所有者”“ $ 7500 Sep 5喬治·史塔克嬰兒大鋼琴演奏家$ 7500（morgan hill）地圖樂器-by

Answer 1

如果您想將其從python寫入文件，並稍后再讀回python，則可以使用Pickle- Pickle教程。

泡菜文件為二進制文件，不會被人工閱讀，如果這對您很重要，那么您可以看一下yaml，我承認這有點學習上的困難，但是會生成格式正確的文件。

import yaml

f = open(filename, 'w')
f.write( yaml.dump(data) )
f.close()

...


stream = open(filename, 'r')
data = yaml.load(stream)

Answer 2

聽起來您的問題更多是關於如何解析從craigslist獲取的抓取數據，而不是如何處理文件。 一種方法是采用每個<p>元素，並用空格將字符串標記化。 例如，標記字符串

“ $ 25 9月5日瓷器小鹿$ 25（sunnyvale）圖片家居用品-由所有者”

可以使用split來完成：

s = " $25 Sep 5 Porcelain Baby Deer $25 (sunnyvale) pic household items - by owner "
L = s.strip().split(' ') #remove whitespace at ends and break string apart by spaces

L現在是具有值的列表

['$25', 'Sep', '5', 'Porcelain', 'Baby', 'Deer', '$25', '(sunnyvale)', 'pic', 'household', 'items', '-', 'by', 'owner']

從這里開始，您可以嘗試按列表元素出現的順序確定它們的含義。 L[0]可能總是保持價格， L[1]表示月份， L[2]表示月份，等等。 如果您有興趣將這些值寫入文件並稍后再次解析，請考慮閱讀csv模塊。

Answer 3

確定您實際需要的數據。 價格？ 說明？ 列出日期？
確定一個好的數據結構來保存此信息。 我建議一個包含相關字段或列表的類。
使用正則表達式或許多其他方法之一來擦除您需要的數據。
扔掉你不需要的東西

5A。 將列表內容以以后可以輕松使用的格式（XML，逗號分隔等）寫入文件。

要么

5B。 按照上面的Mike Ounsworth的建議腌制對象。

如果您對XML解析還不滿意，則只需為每個鏈接寫一行，然后用一個字符定界所需的字符，以后就可以使用該字符進行拆分。 例如：

import re #I'm going to use regular expressions here

link_content_matcher = re.compile("""\$(?P<price>[1-9]{1,4})\s+(?P<list_date>[A-Z]{1}[a-z]{2}\s+[0-9]{1,2})\s+(?P<description>.*)\((?P<location>.*)\)""")

some_link = "$50    Sep 5 Baby Carrier - Black/Silver (san jose)"

# Grab the matches
matched_fields = link_content_matcher.search(some_link)

# Write what you want to a file using a delimiter that 
# probably won't exist in the description. This is risky,
# but will do in a pinch.
output_file = open('results.txt', 'w')
output_file.write("{price}^{date}^{desc}^{location}\n".format(price=matched_fields.group('price'),
    date=matched_fields.group('list_date'),
    desc=matched_fields.group('description'),
    location=matched_fields.group('location')))
output_file.close()

當您要重新訪問該數據時，請從文件中逐行獲取並使用split進行解析。

input_contents = open('results.txt', 'r').readlines()

for line in input_contents:
    price, date, desc, location = line.split('^')
    # Do something with this data or add it to a list

Answer 4

import requests
from bs4 import BeautifulSoup as bs
url="http://sfbay.craigslist.org/baa/"
r=requests.get(url)
soup=bs(r.content)
import re
s=soup.find_all('a', class_=re.compile("hdrlnk")) 
for i in s:
  col=i.text
  scol=str(col)
  print scol

s1=soup.find_all('span', class_=re.compile("price")) ### Price

Python Web Scrape將輸出寫入文件

問題描述

4 個解決方案

解決方案1
1 2014-09-05 18:47:56

解決方案2
0 已采納 2014-09-05 19:47:29

解決方案3
0 2014-09-05 20:29:26

解決方案4
0 2014-09-11 05:01:10

Python Web Scrape將輸出寫入文件

問題描述

4 個解決方案

解決方案1 1 2014-09-05 18:47:56

解決方案2 0 已采納 2014-09-05 19:47:29

解決方案3 0 2014-09-05 20:29:26

解決方案4 0 2014-09-11 05:01:10

解決方案1
1 2014-09-05 18:47:56

解決方案2
0 已采納 2014-09-05 19:47:29

解決方案3
0 2014-09-05 20:29:26

解決方案4
0 2014-09-11 05:01:10