简体   繁体   English

如果在 python 中解析 xml 和 BeautifulSoup 时,如果元素不存在,如何避免遇到 IndexError: list index out of range 错误

[英]How to avoid running into IndexError: list index out of range error if an element is nonexistent while parsing xml with BeautifulSoup in python

I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:我有以下代码从 xml 文件解析以生成 pandas dataframe。XML 文件如下所示:

<?xml version="1.0" encoding="UTF-8"?>
<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>

And my code is below:我的代码如下:

from bs4 import BeautifulSoup
import pandas as pd 

fd = open("file_120123.xml",'r')
data = fd.read()

Bs_data = BeautifulSoup(data,'xml')

ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try: 
   Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
   Cat = ''

CatDict = {
    "ENG":"English",
    "MAT" :"Mathematics"
}

dataDf = []
for i in range(0,len(ID)):
      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)
    
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')

As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file.如您所见,代码使用 BeautifulSoup 库读取名为“file_120123.xml”的 xml 文件,并调用文件中存在的每个元素。 Now one of the elements is a key and I have created a dictionary listing all possible keys.现在其中一个元素是一个键,我创建了一个列出所有可能键的字典 Not all parents have that element.并不是所有的父母都有这个因素。 I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.我想将提取的键与字典中的键进行比较,并将其替换为与该键对应的值。

With this code, I get the error IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict): line.使用此代码,我收到错误IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict):行。 Any insights on how to resolve this?关于如何解决这个问题的任何见解?

If you just want to avoid raising the error, add a conditional break如果您只是想避免引发错误,请添加条件break

for i in range(0,len(ID)):
      if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded

      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)

First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath).首先,关于xml的LXML为什么比BeautifulSoup更好和像 elementpath 这样的库)。 BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html. BS 不支持 xpath,但它确实对 css 选择器有很好的支持,它与 html 一起工作得更好。

Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner, Though you haven't shown your expected output. my guess is you expect the output below: Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0> has <CategoryOfEntry> twice, so I removed one:说了这么多 - 在你的特定情况下,你可能也不需要 lxml - 只有 pandas 和一个班轮,虽然你没有显示你预期的 output。我猜你期望下面的 output:请注意在你的样本中xml 有可能出现错误:第二个<EntrySynopsisDetail_1_0>有两次<CategoryOfEntry> ,所以我删除了一个:

entries = """<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
  
<EntrySynopsisDetail_1_0>
        <EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>"""

pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")

Output: Output:

EntryID        EntryTitle                          CategoryOfEntry
0   262148      Establishment of the Graduate Internship Program    ENG
1   2667654     Call for Mobility Program                         MAT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python BeautifulSoup 索引错误:列表索引超出范围 - Python BeautifulSoup IndexError: list index out of range IndexError:在Python中运行KMeans时,列表索引超出范围 - IndexError: list index out of range while running KMeans in Python 获取IndexError:在Python中解析HTML时列出索引超出范围错误 - Getting IndexError: list index out of range error while parsing HTML in Python 如何避免“IndexError:列表索引超出范围”错误? - How to avoid 'IndexError: list index out of range' error? Python,解析大数组时“IndexError:列表索引超出范围”BeautifulSoup - Python, 'IndexError: list index out of range' when parsing large arrays BeautifulSoup 如何避免我的Python代码中出现“ IndexError:列表索引超出范围”错误 - How to avoid “IndexError: list index out of range” error in my Python code IndexError:在使用beautifulsoup进行webscraping广告时列出索引超出范围 - IndexError: list index out of range while webscraping advertisements with beautifulsoup Python错误:“ IndexError:列表索引超出范围” - Error in Python: “IndexError: list index out of range” Python:IndexError:列表索引超出范围错误 - Python: IndexError: list index out of range Error Python错误:IndexError列表索引超出范围 - Python Error: IndexError list index out of range
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM