简体   繁体   English

使用Python提取HTML标签内容

[英]Extracting HTML tag contents using Python

I have a Word document running to 188 pages which uses mainly font sizes to denote structure. 我有一个运行到188页的Word文档,该文档主要使用字体大小来表示结构。

You can see it here: https://github.com/watty62/jazz_birthdays/blob/master/jazz_birthdays.doc 您可以在这里看到它: https : //github.com/watty62/jazz_birthdays/blob/master/jazz_birthdays.doc

Using Python (my preferred language) I would like to extract the content and save it to a data format such as json. 我想使用Python(我的首选语言)提取内容并将其保存为json等数据格式。

I opened the doc in Libre Office and saved it as HTML and also tried exporting it as an alternative XML file. 我在Libre Office中打开了该文档,并将其另存为HTML,还尝试将其导出为替代XML文件。

You can see the XMl and HTML files here Both seem to produce reasonably structured docs but extracting the meaning from the XML is more difficult 您可以在此处看到XMl和HTML文件两者似乎都可以生成结构合理的文档,但要从XML中提取含义则更加困难

<para>1 January   </para>
<para>Helmut Brandt, baritone sax, 1931 (July 26, 2001)</para> 

In the HTML version we end up with 在HTML版本中,我们最终得到

    <P LANG="en-US" STYLE="margin-top: 0.18cm; margin-bottom: 0.18cm; page-break-after: avoid">
<FONT SIZE=4>1 January   </FONT>
</P>
<P LANG="en-US" CLASS="western" STYLE="font-weight: normal">Helmut
Brandt, baritone sax, 1931 (July 26, 2001)</P>

Each date is encased in <FONT SIZE=4> </FONT> tags (although these are used occasionally for other purposes. 每个日期都用<FONT SIZE=4> </FONT>标记括起来(尽管偶尔将它们用于其他目的)。

A quick count give 377 uses of <FONT SIZE=4> - so assuming for now that all 366 days of the year are there then there are 11 uses of it which I'll have to ignore. 快速计数给出了<FONT SIZE=4> 377种用法-因此现在假设一年中所有366天都存在,那么我将不得不忽略它的11种用法。

My approach was to be to replace the first <Font size=4> with something to denote that it is the opening of the date field, eg <Date> then each subsequent one with a closing of the date (after all the musicians with that birthday) and open the next date thus </Date><Date> 我的方法是将第一个<Font size=4>替换为表示日期字段开头的内容,例如<Date>然后每个后续的字段都以日期结尾(在所有音乐家之后生日),然后打开下一个日期</Date><Date>

After that I thought that I'd simplify each line - although these will get complicated with name (possibly containing a nickname), instruments played separated by commas, year of birth, and date of death (in brackets and starting "d.") - so lots more to get stuck into later. 之后,我认为我会简化每一行-尽管这些会因名称(可能包含昵称),乐器演奏而变得复杂,并用逗号,出生年份和死亡日期(括号内并以“ d”开头)分开。 -还有更多让以后陷入困境的机会。

An initial attempt with Beautiful soup to parse the file threw up some encoding errors in the original file. 最初尝试使用Beautiful soup分析文件是在原始文件中引发了一些编码错误。

I'm not looking for a solution (as it is a real biggie) but would appreciate any prompts on approach, libraries etc to get me started please. 我不是在寻找解决方案(因为这确实是个大问题),但是请您提供有关方法,库等方面的提示,以帮助我入门。

Thanks 谢谢

Ian 伊恩

I hope that this is what you are looking for (if it isn't then please let me know so that I can remove my answer for you): 我希望这是您要寻找的(如果不是,请告诉我,以便我为您删除答案):

import re
s="""<P LANG="en-US" STYLE="margin-top: 0.18cm; margin-bottom: 0.18cm; page-break-after: avoid">
<FONT SIZE=4>1 January   </FONT>
</P>
<P LANG="en-US" CLASS="western" STYLE="font-weight: normal">Helmut
Brandt, baritone sax, 1931 (July 26, 2001)</P>"""
print re.findall(r"\d{1,2} \w+",s)

This Outputs: 输出:

['1 January']

As a quick explanation the re module is a fancy searching mechanism. 作为快速说明,re模块是一种奇特的搜索机制。 It's final() method takes a pattern to search for and a string to search in. I fed it the pattern r"\\d{1,2} \\w+". 这是final()方法,它需要搜索一个模式和一个要搜索的字符串。我向它提供了模式r“ \\ d {1,2} \\ w +”。 The r before the string tells python to ignore the backslashes so that re can use them for it's purposes. 字符串前的r告诉python忽略反斜杠,以便re可以将反斜杠用于其目的。 \\d means a digit. \\ d表示一个数字。 {1,2} means one or two times. {1,2}表示一两次。 The space just means a space. 空间仅表示空间。 \\w means a word-character. \\ w表示单词字符。 And + means one or more of. 和+表示一个或多个。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM