如何使用正則表達式從html標記之間提取文本？

Question

我需要從textarea標簽之間提取文本。

我該如何使用正則表達式呢？

<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">
 abc_text
 #include<abc>
 xyz
</textarea>

Answer 1

你可以試試，

>>> print [x.strip() for x in re.findall('<textarea.*?>(.*)</textarea>', content, re.MULTILINE | re.DOTALL)]
['abc_text\n #include<abc>\n xyz']

Answer 2

根據XML規則，XML無效。 開頭和結尾標簽不匹配。

#include<abc>

<abc>是開始標記，不是內容。

XML解析庫不會解析無效的Input。

修改輸入：

如果將#include<abc>更改為#include<abc> 那么以下將起作用：

>>> import lxml.html as PARSER
>>> root = PARSER.fromstring(data)
>>> root.xpath("//textarea/text()")
['\n abc_text\n #include<abc>\n xyz\n']
>>>

通過RE：

>>> data
'<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>'
>>> import re
>>> re.findall('<textarea[^>]*>[^<]*</textarea>', data)
['<textarea rows="20" cols="70" name="file" id="file" style="width: 100%;"data-input-file="1">\n abc_text\n</textarea>']
>>>

如何使用正則表達式從html標記之間提取文本？

問題描述

2 個解決方案

解決方案1
2 已采納 2015-12-21 10:13:23

解決方案2
1 2015-12-21 10:13:35

如何使用正則表達式從html標記之間提取文本？

問題描述

2 個解決方案

解決方案1 2 已采納 2015-12-21 10:13:23

解決方案2 1 2015-12-21 10:13:35

解決方案1
2 已采納 2015-12-21 10:13:23

解決方案2
1 2015-12-21 10:13:35