如果在 python 中解析 xml 和 BeautifulSoup 时，如果元素不存在，如何避免遇到 IndexError: list index out of range 错误

Question

I have the following code to parse from an xml file to produce a pandas dataframe. The XML file looks like this:我有以下代码从 xml 文件解析以生成 pandas dataframe。XML 文件如下所示：

<?xml version="1.0" encoding="UTF-8"?>
<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
<EntrySynopsisDetail_1_0>
<EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>

And my code is below:我的代码如下：

from bs4 import BeautifulSoup
import pandas as pd 

fd = open("file_120123.xml",'r')
data = fd.read()

Bs_data = BeautifulSoup(data,'xml')

ID = Bs_data.find_all('EntryID')
Title = Bs_data.find_all('EntryTitle')
try: 
   Cat = Bs_data.find_all('CategoryOfEntry')
except IndexError:
   Cat = ''

CatDict = {
    "ENG":"English",
    "MAT" :"Mathematics"
}

dataDf = []
for i in range(0,len(ID)):
      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)
    
df = pd.DataFrame(dataDf, columns =['ID', 'Title', 'Category'], dtype=float)
df.to_csv('120123.csv')

As you see, the code reads a xml file called 'file_120123.xml' using BeautifulSoup library, and calls each of the elements present in the file.如您所见，代码使用 BeautifulSoup 库读取名为“file_120123.xml”的 xml 文件，并调用文件中存在的每个元素。 Now one of the elements is a key and I have created a dictionary listing all possible keys.现在其中一个元素是一个键，我创建了一个列出所有可能键的字典。 Not all parents have that element.并不是所有的父母都有这个因素。 I want to compare the extracted key with the ones in the dictionary and replace that with the value corresponding to that key.我想将提取的键与字典中的键进行比较，并将其替换为与该键对应的值。

With this code, I get the error IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict): line.使用此代码，我收到错误IndexError: list index out of range on Cat[i] on if (Cat[i] == CatDict):行。 Any insights on how to resolve this?关于如何解决这个问题的任何见解？

Answer 1

If you just want to avoid raising the error, add a conditional break如果您只是想避免引发错误，请添加条件break

for i in range(0,len(ID)):
      if not i < len(Cat): break ## <-- break loop if length of Cat is exceeded

      if (Cat[i] == CatDict):
        Cat[i] == CatDict.get(Cat[i])
       
      rows = [ID[i].get_text(), Title[i].get_text(), Cat[i])
      dataDf.append(rows)

Answer 2

First, as to why lxml is better than BeautifulSoup for xml, the answer is simple: the best way to query xml is with xpath. lxml supports xpath (though only version 1.0; for more complex xml and queries you will need xpath 2.0 to 3.1 and a library like elementpath).首先，关于xml的LXML为什么比BeautifulSoup更好和像 elementpath 这样的库）。 BS doesn't support xpath, though it does have excellent support for css selectors, which works better with html. BS 不支持 xpath，但它确实对 css 选择器有很好的支持，它与 html 一起工作得更好。

Having said all that - in your particular case, you probably don't need lxml either - only pandas and a one liner, Though you haven't shown your expected output. my guess is you expect the output below: Note that in your sample xml there is probability an error: the 2nd <EntrySynopsisDetail_1_0> has <CategoryOfEntry> twice, so I removed one:说了这么多 - 在你的特定情况下，你可能也不需要 lxml - 只有 pandas 和一个班轮，虽然你没有显示你预期的 output。我猜你期望下面的 output：请注意在你的样本中xml 有可能出现错误：第二个<EntrySynopsisDetail_1_0>有两次<CategoryOfEntry> ，所以我删除了一个：

entries = """<Entries>
 <EntrySynopsisDetail_1_0>
        <EntryID>262148</EntryID>
        <EntryTitle>Establishment of the Graduate Internship Program</EntryTitle>
        <CategoryOfEntry>ENG</CategoryOfEntry>
</EntrySynopsisDetail_1_0>
  
<EntrySynopsisDetail_1_0>
        <EntryID>2667654</EntryID>
        <EntryTitle>Call for Mobility Program</EntryTitle>
        <CategoryOfEntry>MAT</CategoryOfEntry>
 </EntrySynopsisDetail_1_0>
</Entries>"""

pd.read_xml(entries,xpath="//EntrySynopsisDetail_1_0")

Output: Output：

EntryID        EntryTitle                          CategoryOfEntry
0   262148      Establishment of the Graduate Internship Program    ENG
1   2667654     Call for Mobility Program                         MAT

如果在 python 中解析 xml 和 BeautifulSoup 时，如果元素不存在，如何避免遇到 IndexError: list index out of range 错误

问题描述

2 个解决方案

解决方案1
0 2023-01-12 20:44:35

解决方案2
0 2023-01-13 11:39:06

如果在 python 中解析 xml 和 BeautifulSoup 时，如果元素不存在，如何避免遇到 IndexError: list index out of range 错误

问题描述

2 个解决方案

解决方案1 0 2023-01-12 20:44:35

解决方案2 0 2023-01-13 11:39:06

解决方案1
0 2023-01-12 20:44:35

解决方案2
0 2023-01-13 11:39:06