簡體   English   中英

在 Python 中將數據保存在 XML 文件中

[英]Save data in XML file in Python

我正在嘗試將我的數據保存到 XML 文件中。 這些數據來自我想收集評論的網站。 每頁總是有五個評論,我想以 XML 格式將它們保存在一個文件中。 問題是,如果我用print(ET.tostring(root, encoding='utf8').decode('utf8'))打印出 XML 樹,那么我想要的所有五個評論。 但是,如果我使用tree.write("test.xml", encoding='unicode')將它們保存到文件中,那么我只會看到一條評論......這是我的代碼:

import requests
from bs4 import BeautifulSoup
import re
import json
import xml.etree.cElementTree as ET

source = requests.get('https://www.tripadvisor.ch/Hotel_Review-g188113-d228146-Reviews-Coronado_Hotel-Zurich.html#REVIEWS').text

soup = BeautifulSoup(source, 'lxml')
pattern = re.compile(r'window.__WEB_CONTEXT__={pageManifest:(\{.*\})};')
script = soup.find("script", text=pattern)
dictData = pattern.search(script.text).group(1)
jsonData = json.loads(dictData)

def get_countrycitydata():

    countrycity_dict = dict()

    country_data = jsonData['urqlCache']['3960485871']['data']['locations']
    for data in country_data:
        data1 = data['parents']
        countrycity_dict["country_name"] = data1[2]['name']
        countrycity_dict["tripadvisorid_city"] = data1[0]['locationId']
        countrycity_dict["city_name"] = data1[0]['name']

    return countrycity_dict

def get_hoteldata():

    hotel_dict = dict()

    locations = jsonData['urqlCache']['669061039']['data']['locations']
    for data in locations:
        hotel_dict["tripadvisorid_hotel"] = data['locationId']
        hotel_dict["hotel_name"] = data['name']

    return hotel_dict

def get_reviews():  

    all_dictionaries = []

    for locations in jsonData['urqlCache']['669061039']['data']['locations']:
        for reviews in locations['reviewListPage']['reviews']:

            review_dict = {}

            review_dict["reviewid"] = reviews['id']
            review_dict["reviewurl"] =  reviews['absoluteUrl']
            review_dict["reviewlang"] = reviews['language']
            review_dict["reviewtitle"] = reviews['title']
            reviewtext = reviews['text']
            clean_reviewtext = reviewtext.replace('\n', ' ')
            review_dict["reviewtext"] = clean_reviewtext

            all_dictionaries.append(review_dict)

    return all_dictionaries

def xml_tree(new_dict): # should I change something here???

    root = ET.Element("countries")
    country = ET.SubElement(root, "country")

    ET.SubElement(country, "name").text = new_dict["country_name"]
    city = ET.SubElement(country, "city")

    ET.SubElement(city, "tripadvisorid").text = str(new_dict["tripadvisorid_city"])
    ET.SubElement(city, "name").text = new_dict["city_name"]
    hotels = ET.SubElement(city, "hotels")

    hotel = ET.SubElement(hotels, "hotel")
    ET.SubElement(hotel, "tripadvisorid").text = str(new_dict["tripadvisorid_hotel"])
    ET.SubElement(hotel, "name").text = new_dict["hotel_name"]
    reviews = ET.SubElement(hotel, "reviews")

    review = ET.SubElement(reviews, "review")
    ET.SubElement(review, "reviewid").text = str(new_dict["reviewid"])
    ET.SubElement(review, "reviewurl").text = new_dict["reviewurl"]
    ET.SubElement(review, "reviewlang").text = new_dict["reviewlang"]
    ET.SubElement(review, "reviewtitle").text = new_dict["reviewtitle"]
    ET.SubElement(review, "reviewtext").text = new_dict["reviewtext"]

    tree = ET.ElementTree(root)
    tree.write("test.xml", encoding='unicode')  

    print(ET.tostring(root, encoding='utf8').decode('utf8'))

##########################################################  

def main():

    city_dict = get_countrycitydata()
    hotel_dict = get_hoteldata()
    review_list = get_reviews()

    for index in range(len(review_list)):
        new_dict = {**city_dict, **hotel_dict, **review_list[index]}

        xml_tree(new_dict)

if __name__ == "__main__":
    main()  

如何更改 XML 樹,以便將所有五個評論都保存在文件中? XML 文件應如下所示:

<countries>
    <country>
        <name>Schweiz</name>
        <city>
            <tripadvisorid>188113</tripadvisorid>
            <name>Zürich</name>
            <hotels>
                <hotel>
                    <tripadvisorid>228146</tripadvisorid>
                    <name>Hotel Coronado</name>
                    <reviews>
                        <review>
                            <reviewid>672052111</reviewid> 
                            <reviewurl>https://www.tripadvisor.ch/ShowUserReviews-g188113-d228146-r672052111-Coronado Hotel-Zurich.html</reviewurl>
                            <reviewlang>de</reviewlang>
                            <reviewtitle>Optimale Lage und Preis</reviewtitle>
                            <reviewtext>Hervorragendes Hotel.Beste Erfahrun mit Service und Zimme.Die Qalität der Betten ist optimalr. Zimmer sind trotz geringer Größe sehr gut ausgestattet.Der Föhn war in diesem Fall (nicht in früheren)etwas lahm</reviewtext>
                        </review>
                        <review>
                         second review here ...
                        </review>
                        <review>
                         third review here ...
                        </review>
                        ...
                    </reviews>
                </hotel>
            </hotels>
        </city>
    </country>
</countries>

預先感謝您的所有建議!

由於您的xml_tree(new_dict)存在於for循環內,因此tree.write()調用tree.write()方法覆蓋您的文件。

使用open() a (追加)模式打開文件:

tree.write(open('test.xml', 'a'), encoding='unicode')

請參閱此處的文檔

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM