First, a brief description of the problem: Within an unordered list, we have many list items, each of which correspond to a "flashcard"
<ul>
<li>
<p><span>can you slice columns in a 2d list? </span></p>
<pre><code class='language-python' lang='python'>queryMatrixTranspose[a-1:b][i] = queryMatrix[i][a-1:b] </code></pre>
<ul>
<li>
<span>No: can't do this because python doesn't support multi-axis slicing, only multi-list slicing; see the article </span><a href='http://ilan.schnell-web.net/prog/slicing/' target='_blank' class='url'>http://ilan.schnell-web.net/prog/slicing/</a><span> for more info.</span>
</li>
</ul>
</li>
</ul>
The answer on the flashcard will always be a list item located under the xpath: /html/body/ul/li/ul
. I'd like to retrieve the answer in the format shown here
<li>
<span>No: can't do this because python doesn't support multi-axis slicing, only multi-list slicing; see the article </span><a href='http://ilan.schnell-web.net/prog/slicing/' target='_blank' class='url'>http://ilan.schnell-web.net/prog/slicing/</a><span> for more info.</span>
</li>
The flashcard's question is everything that remains in the xpath: /html/body/ul/li
after the answer has been extracted:
<li>
<p><span>can you slice columns in a 2d list? </span></p>
<pre><code class='language-python' lang='python'>queryMatrixTranspose[a-1:b][i] = queryMatrix[i][a-1:b] </code></pre>
</li>
For each flashcard in an unordered list of flashcards, I'd like to extract the utf-8
encoded html content of the question and answer list items. That is, I'd like to have both the text and html tags.
I tried to solve this problem by iterating through each flashcard and corresponding answer and removing the child-node answer from the parent-node flashcard.
flashcard_list = []
htmlTree = html.fromstring(htmlString)
for flashcardTree,answerTree in zip(htmlTree.xpath("/html/body/ul/li"),
htmlTree.xpath('/html/body/ul/li/ul')):
flashcard = html.tostring(flashcardTree,
pretty_print=True).decode("utf-8")
answer = html.tostring(answerTree,
pretty_print=True).decode("utf-8")
question = html.tostring(flashcardTree.remove(answerTree),
pretty_print=True).decode("utf-8")
flashcard_list.append((question,answer))
However, when I try to remove the answer child-node with flashcardTree.remove(answerTree)
, I encounter the error, TypeError: Type 'NoneType' cannot be serialized.
I don't understand why this function would return none; I'm trying to remove a node at /html/body/ul/li/ul
which is a valid child node of /html/body/ul/li
.
Whatever suggestions you have would be greatly appreciated. I'm not in any way attached to the code I wrote in my first attempt; I'll accept any answer where the output is a list of (question,answer) tuples, one for each flashcard.
If I understand correctly what you are looking for, this should work:
for flashcardTree,answerTree in zip(htmlTree.xpath("/html/body/ul/li/p/span"),
htmlTree.xpath('/html/body/ul/li/ul/li/descendant-or-self::*')):
question = flashcardTree.text
answer = answerTree.text_content().strip()
flashcard_list.append((question,answer))
for i in flashcard_list:
print(i[0],'\n',i[1])
Output:
can you slice columns in a 2d list?
No: can't do this because python doesn't support multi-axis slicing, only multi-list slicing; see the article http://ilan.schnell-web.net/prog/slicing/ for more info.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.