简体   繁体   English

解析文件夹中的多个 xml 文件,然后写入中央 csv

[英]Parsing multiple xml files in folder then write to central csv

I'm new to python and have written the following code to parse multiple xml files in a directory and write the content to a central CSV file.我是 python 的新手,并编写了以下代码来解析目录中的多个 xml 文件并将内容写入中央 CSV 文件。

I have a folder with approximately 30 xml files.我有一个包含大约 30 个 xml 文件的文件夹。 My issue is that it is only grabbing content from the first xml in the folder but not the rest.我的问题是它只从文件夹中的第一个 xml 中获取内容,而不是 rest。 I think I have an issue with not having a loop?我想我有没有循环的问题? I am using Beautifulsoup and would like to stick to this as I have built understanding of it.我正在使用 Beautifulsoup 并希望坚持这一点,因为我已经了解它。

#open beautifulsoup library AND csv function

from bs4 import BeautifulSoup
import csv
import glob

#Open and read files in folder ending with .xml
for filename in glob.glob("*.xml"):
    with open(filename) as open_file:
        content = open_file.read()
        soup = BeautifulSoup(content, 'lxml')

#open and write csv file

csv_file = open('scrape.csv', 'a')
post_line = ['postid', 'subreddit', 'post title', 'author', 'post url', 'post date', 'post time', 'post score', 'submission text']
csv_writer = csv.writer(csv_file)
csv_writer.writerow(post_line)

#grab content from xml from following textblocks
#postid

for postid in soup.find('textblock', tid='7').text:
pid = postid.split(':')[1]
print(pid)

#subreddit
for subreddit in soup.find('textblock', tid='15').text
subred = subreddit.split(':')[1]
print(subred)

#post title
for posttitle in soup.find('textblock', tid='12').text
ptitle = posttitle.split(':')[1]
print(ptitle)

#author
for username in soup.find('textblock', tid='0').text
author = username.split(':')[1]
print(author)

#post url
for posturl in soup.find('textblock', tid='13').text
url = posturl.split(':')[2]
purl = f'https:{url}'
print(purl)

#post date
for postdate in soup.find('textblock', tid='3').text
pdate = postdate.split()[1]
print(pdate)

#post time
for posttime in soup.find('textblock', tid='3').text
ptime = posttime.split()[2]
print(ptime)

#post score
for postscore in soup.find('textblock', tid='10').text
pscore = postscore.split(':')[1]
print(pscore)

#submission text
for submission in soup.find('textblock', tid='20').text
print(submission)

#blank space
print()

csv_writer.writerow([pid, subred, ptitle, author, purl, pdate, ptime, pscore, submission])

csv_file.close()

you only use the soup of the last file.你只使用最后一个文件的汤。

...
for filename in glob.glob("*.xml"):
    with open(filename) as open_file:
        content = open_file.read()
        soup = BeautifulSoup(content, 'lxml')    # overwrite soup with the soup of current file
...
# handle soup

you could make handle_soup a function and call it on every soup.您可以将 handle_soup 设为 function 并在每道汤中调用它。

import glob
import csv

from bs4 import BeautifulSoup

csv_file = open('scrape.csv', 'a')
post_line = ['postid', 'subreddit', 'post title', 'author', 'post url', 'post date', 'post time', 'post score', 'submission text']
csv_writer = csv.writer(csv_file)
csv_writer.writerow(post_line)

def handle_soup(soup, csv_writer):
    pid = soup.find('textblock', tid='7').text.split(":")[1]
    print(pid)

    subred = soup.find('textblock', tid='15').text.split(':')[1]
    print(subred)
    ...     # replace ... with other items
    csv_writer.writerow([pid, subred, ...]) # replace ... with other items


for filename in glob.glob("*.xml"):
    with open(filename) as open_file:
        content = open_file.read()
        soup = BeautifulSoup(content, 'lxml')
        handle_soup(soup, csv_writer)

csv_file.close()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM