[英]Parsing multiple xml files in folder then write to central csv
I'm new to python and have written the following code to parse multiple xml files in a directory and write the content to a central CSV file.我是 python 的新手,并编写了以下代码来解析目录中的多个 xml 文件并将内容写入中央 CSV 文件。
I have a folder with approximately 30 xml files.我有一个包含大约 30 个 xml 文件的文件夹。 My issue is that it is only grabbing content from the first xml in the folder but not the rest.我的问题是它只从文件夹中的第一个 xml 中获取内容,而不是 rest。 I think I have an issue with not having a loop?我想我有没有循环的问题? I am using Beautifulsoup and would like to stick to this as I have built understanding of it.我正在使用 Beautifulsoup 并希望坚持这一点,因为我已经了解它。
#open beautifulsoup library AND csv function
from bs4 import BeautifulSoup
import csv
import glob
#Open and read files in folder ending with .xml
for filename in glob.glob("*.xml"):
with open(filename) as open_file:
content = open_file.read()
soup = BeautifulSoup(content, 'lxml')
#open and write csv file
csv_file = open('scrape.csv', 'a')
post_line = ['postid', 'subreddit', 'post title', 'author', 'post url', 'post date', 'post time', 'post score', 'submission text']
csv_writer = csv.writer(csv_file)
csv_writer.writerow(post_line)
#grab content from xml from following textblocks
#postid
for postid in soup.find('textblock', tid='7').text:
pid = postid.split(':')[1]
print(pid)
#subreddit
for subreddit in soup.find('textblock', tid='15').text
subred = subreddit.split(':')[1]
print(subred)
#post title
for posttitle in soup.find('textblock', tid='12').text
ptitle = posttitle.split(':')[1]
print(ptitle)
#author
for username in soup.find('textblock', tid='0').text
author = username.split(':')[1]
print(author)
#post url
for posturl in soup.find('textblock', tid='13').text
url = posturl.split(':')[2]
purl = f'https:{url}'
print(purl)
#post date
for postdate in soup.find('textblock', tid='3').text
pdate = postdate.split()[1]
print(pdate)
#post time
for posttime in soup.find('textblock', tid='3').text
ptime = posttime.split()[2]
print(ptime)
#post score
for postscore in soup.find('textblock', tid='10').text
pscore = postscore.split(':')[1]
print(pscore)
#submission text
for submission in soup.find('textblock', tid='20').text
print(submission)
#blank space
print()
csv_writer.writerow([pid, subred, ptitle, author, purl, pdate, ptime, pscore, submission])
csv_file.close()
you only use the soup of the last file.你只使用最后一个文件的汤。
...
for filename in glob.glob("*.xml"):
with open(filename) as open_file:
content = open_file.read()
soup = BeautifulSoup(content, 'lxml') # overwrite soup with the soup of current file
...
# handle soup
you could make handle_soup a function and call it on every soup.您可以将 handle_soup 设为 function 并在每道汤中调用它。
import glob
import csv
from bs4 import BeautifulSoup
csv_file = open('scrape.csv', 'a')
post_line = ['postid', 'subreddit', 'post title', 'author', 'post url', 'post date', 'post time', 'post score', 'submission text']
csv_writer = csv.writer(csv_file)
csv_writer.writerow(post_line)
def handle_soup(soup, csv_writer):
pid = soup.find('textblock', tid='7').text.split(":")[1]
print(pid)
subred = soup.find('textblock', tid='15').text.split(':')[1]
print(subred)
... # replace ... with other items
csv_writer.writerow([pid, subred, ...]) # replace ... with other items
for filename in glob.glob("*.xml"):
with open(filename) as open_file:
content = open_file.read()
soup = BeautifulSoup(content, 'lxml')
handle_soup(soup, csv_writer)
csv_file.close()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.