简体   繁体   English

如何将特定内容从HTML文件提取为TXT格式?

[英]How can I extract specific content from an HTML file to TXT format?

So my problem is that I have extracted a lot of forum posts into separate txt files which are now on my harddrive. 所以我的问题是,我已经将很多论坛帖子提取到单独的txt文件中,这些文件现在位于我的硬盘上。 Each file contains information I would like to extract, some of which I already have figured out how to extract. 每个文件都包含我想提取的信息,其中一些我已经弄清楚了如何提取。 The information I need to extract is in the following form: 我需要提取的信息采用以下形式:

Within the same "html block" 在同一“ html块”内

1: (x) messages in this thread 1:此线程中的(x)条消息
2: Message is in reply to (some html code) A HREF="link" (some html code= 2:消息是对(某些html代码)HREF =“ link”(某些html代码=

In task 1 is simply need to extract x 在任务1中,只需提取x
In task 2 i need to extract the links to which the message is in reply to 在任务2中,我需要提取消息所回复到的链接

I have looked into the different tm and XML packages but have not been able to actually find out what to use. 我研究了不同的tm和XML包,但实际上无法找出要使用的内容。 Any advice is appreciated. 任何建议表示赞赏。

This is what one of the txt files looks like 这是其中一个txt文件的样子

`<HTML>
<HEAD>
<TITLE>Dear LEGO : 5668 </TITLE>
<META NAME="ROBOTS" CONTENT="ALL, INDEX, FOLLOW">
<META NAME="KEYWORDS" CONTENT="lego, legos, legoland, toy, construction, community, education, technic, mindstorms, toolo, duplo, primo, dacta">
<META NAME="DESCRIPTION" CONTENT="Dear LEGO : 5668 - LUGNET: The international fan-created LEGOÆ Users Group Network. A place for LEGOÆ fans of all ages to find information, meet one another, and share ideas. As an independent site by fans, for fans, it is neither sponsored nor endorsed by the LEGO Company.">
<SCRIPT LANGUAGE="JavaScript" SRC="http://www.lugnet.com/js/common.js"></SCRIPT>
</HEAD>

<BODY
 LEFTMARGIN=0 TOPMARGIN=0 MARGINWIDTH=0 MARGINHEIGHT=0
 BGCOLOR="#FFFFFF" TEXT="#000000" xLINK="#0000FF" xVLINK="#501080" xALINK="#B0C8EC">  <TABLE BORDER=0 CELLPADDING=9 CELLSPACING=0 WIDTH="100%" BGCOLOR="#B0C8EC">
  <TR ALIGN=CENTER VALIGN=BOTTOM>

    <TD ALIGN=LEFT><NOBR><A TARGET="_top" HREF="http://www.lugnet.com/"><IMG BORDER=0 WIDTH=28 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-home.gif" ALT="To LUGNET Homepage"></A><A TARGET="_top" HREF="http://news.lugnet.com/"><IMG BORDER=0 WIDTH=27 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-news.gif" ALT="To LUGNET News Homepage"></A><A TARGET="_top" HREF="http://guide.lugnet.com/"><IMG BORDER=0 WIDTH=37 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-guide.gif" ALT="To LUGNET Guide Homepage"></A></NOBR><BR></TD>       <FORM NAME="search" ACTION="http://www.lugnet.com/search.cgi" METHOD=POST
       onSubmit="return(MetaSearch(document.search))">  <TD>
        <INPUT TYPE=HIDDEN NAME="category" VALUE="/dear-lego/">
        <NOBR><SELECT NAME="scope">
          <OPTION VALUE="SetGuide">Set Reference
          <OPTION VALUE="QuickSet">Set Reference (Popup)
          <OPTION VALUE="PartsRef">Parts Reference  <OPTION VALUE="News">News
          <OPTION VALUE="NewsRel" SELECTED>News (Dear LEGO)         </SELECT>&nbsp;<A HREF="http://www.lugnet.com/help/search/"><IMG BORDER=0 WIDTH=16 HEIGHT=16 HSPACE=0 VSPACE=0 SRC="http://www.lugnet.com/help/help.gif" ALT="Help on Searching"></A></NOBR><BR>  <NOBR><INPUT TYPE=TEXT NAME="query" VALUE="" SIZE=16 MAXLENGTH=200><SMALL>&nbsp;<INPUT TYPE=SUBMIT NAME="SUBMIT" VALUE="Search"></SMALL></NOBR><BR>
      </TD>
      </FORM> 

    <TD ALIGN=RIGHT><NOBR><A HREF="/news/post/?lugnet.dear-lego"><IMG BORDER=0 WIDTH=22 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-post.gif" ALT="Post new message to lugnet.dear-lego"></A><A HREF="news://lugnet.com/lugnet.dear-lego"><IMG BORDER=0 WIDTH=30 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-nntp.gif" ALT="Open lugnet.dear-lego in your NNTP Newsreader"></A><A HREF="http://news.lugnet.com/news/traffic/"><IMG BORDER=0 WIDTH=32 HEIGHT=44 HSPACE=10 VSPACE=0 SRC="/news/icon-traffic.gif" ALT="To LUGNET News Traffic Page"></A><IMG BORDER=0 WIDTH=3 HEIGHT=44 HSPACE=6 VSPACE=0 SRC="/news/icon-sep.gif"><A HREF="http://www.lugnet.com/people/members/sign-in/"><IMG BORDER=0 WIDTH=37 HEIGHT=44 HSPACE=0 VSPACE=0 SRC="/news/icon-signin-key.gif" ALT="Sign In (Members)"></A></NOBR><BR></TD>

  </TR> 
</TABLE>
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#8899BB"><TR><TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE>  <TABLE BORDER=0 CELLPADDING=7 CELLSPACING=0 WIDTH="100%" BGCOLOR="#E8F0FF"> <TR ALIGN=CENTER VALIGN=CENTER>
        <TD COLSPAN=2 ALIGN=CENTER VALIGN=CENTER>
<script type="text/javascript"><!--
google_ad_client = "pub-0089902038208374";
//LUGNET 728x15, Erstellt 13.12.07
google_ad_slot = "6645292597";
google_ad_width = 728;
google_ad_height = 15;
//--></script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
        </TD>
      </TR> <TR ALIGN=LEFT VALIGN=CENTER>  <TD>  <BIG><FONT FACE="Geneva,Arial,Helvetica">
        &nbsp;<A HREF="/dear-lego/">Dear&nbsp;LEGO</A>&nbsp;<FONT COLOR="#8899BB">/</FONT>  5668  <BR></FONT></BIG>  </TD>  <TD ALIGN=RIGHT><SMALL><FONT FACE="Geneva,Arial,Helvetica">
        <A HREF="/dear-lego/?n=5667">5667</A>&nbsp;&nbsp;|&nbsp;&nbsp;<A HREF="/dear-lego/?n=5669">5669</A>
      <BR></SMALL></FONT></TD>  </TR>

</TABLE>
<TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%" BGCOLOR="#8899BB"><TR><TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE>  <!-- google_ad_section_start --> <CENTER>  <TABLE BORDER=0 CELLPADDING=16 CELLSPACING=0 WIDTH="100%"><TR><TD ALIGN=LEFT>    <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0><TR ALIGN=LEFT VALIGN=TOP><TD>  <TABLE BORDER=0 CELLPADDING=8 CELLSPACING=0>

      <TR BGCOLOR="#E0E0E0"><TD ALIGN=LEFT> <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%"><TR ALIGN=CENTER VALIGN=TOP>  <TD ALIGN=LEFT VALIGN=TOP>

    <TABLE BORDER=0 CELLPADDING=2 CELLSPACING=0>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Subject:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><BIG><BIG><B>Online PAB and Design-by-me needs more parts for Lego Train</B></BIG></BIG><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Author:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><B>Benjamin Medinets</B><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Newsgroups:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><A HREF="/dear-lego/">lugnet.dear-lego</A>, <A HREF="/trains/">lugnet.trains</A><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Followup-To:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><A HREF="/trains/">lugnet.trains</A><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Date:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1">Thu, 6 Oct 2011 03:44:44 GMT<BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">From:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><FONT COLOR="#7070A0">Benjamin Medinets &lt;bmedinets@excite.com+stopspammers+&gt;</FONT><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Highlighted:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1"><FONT COLOR="#D57F7F"><B>!</B></FONT> 

<A HREF="/news/ahh.cgi?lugnet.dear-lego,5668">(details)</A><BR></FONT></TD>

          </TR>  <TR VALIGN=MIDDLE>

            <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#7070A0" SIZE="-1">Viewed:&nbsp;<BR></FONT></TD>

            <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" COLOR="#333366" SIZE="-1">3013 times<BR></FONT></TD>

          </TR>  </TABLE>

    </TD>  <TD WIDTH=20>&nbsp;&nbsp;</TD>

      <TD ALIGN=CENTER VALIGN=TOP>

      <FONT FACE="Geneva,Arial,Helvetica" SIZE="-2"><A HREF="/news/raw.cgi?lugnet.dear-lego,5668">View Raw<BR>Message</A><BR><BR></FONT>  <A HREF="/news/post/?lugnet.dear-lego,5668"><IMG BORDER=0 WIDTH=30 HEIGHT=44 HSPACE=10 VSPACE=10 SRC="/news/icon-reply.gif" TITLE="Post a public reply to this message"></A><BR>  </TD>  </TR></TABLE> </TD></TR>

      <TR BGCOLOR="#F0F0F0"><TD ALIGN=LEFT NOWRAP><TT>I was using Lego Digital Designer and am disappointed the downhill availabilty<BR>
of certain important parts to build &quot;buyable&quot; models.<BR>
<BR>
I would like to see a return of &quot;warehouse&quot; sliding doors to make<BR>
box cars.<BR>
<BR>
Train-style doors would also be nice as well as train windows (both in<BR>
2x3 and 4x3)... please.<BR>
<BR>
I looked at the instructions to build a mail car from the 7722, and<BR>
found that I really only need 2 red sliding rail doors, the pair of<BR>
&quot;decorated train doors&quot; and a set of two 2x3 thin yellow train<BR>
windows.<BR>
<BR>
Yes, there was a bit of minor substitution but it is mostly distiguishable<BR>
as the model.<BR>
<BR>
Here is what it looks like:<BR>
<BR>
<A HREF="http://www.lugnet.com/jump.cgi?http://www.brickshelf.com/gallery/medib/lego-fun/7722mailvan.jpg">http://www.brickshelf.com/gallery/medib/lego-fun/7722mailvan.jpg</A><BR>
<BR>
Yeah I know... where are the f-in doors???<BR>
<BR>
<BR>
Ben<BR>
</TT>
</TD></TR>

      <TR BGCOLOR="#E0E0E0"><TD ALIGN=LEFT></TD></TR>

    </TABLE> <BR> <BR>  <FONT FACE="Verdana,Geneva,Helvetica" SIZE="-1" COLOR="#990000">



      <B>1 Message in This Thread:</B><BR> <NOBR><IMG WIDTH=9 HEIGHT=11 VSPACE=2 SRC="/news/here.gif" TITLE="You are here"></NOBR><BR><NOBR></NOBR>
 <DL>

      <DT>Entire Thread on One Page:

      <SMALL><FONT COLOR="#000000">

        <DD><B>Nested:&nbsp;</B>

        <A HREF="/dear-lego/?n=5668&t=i&v=a">All</A> | <A HREF="/dear-lego/?n=5668&t=i&v=b">Brief</A> | <A HREF="/dear-lego/?n=5668&t=i&v=c">Compact</A> | <A HREF="/dear-lego/?n=5668&t=i&v=d">Dots</A>

        <BR><B>Linear:&nbsp;</B>

        <A HREF="/dear-lego/?n=5668&t=f&v=a">All</A> | <A HREF="/dear-lego/?n=5668&t=f&v=b">Brief</A> | <A HREF="/dear-lego/?n=5668&t=f&v=c">Compact</A>

      </FONT></SMALL>  </DL>



      </FONT>  </TD>

    <TD WIDTH=20>&nbsp;&nbsp;&nbsp;&nbsp;<BR></TD>

    <TD><FONT FACE="Verdana,Geneva,Arial,Helvetica" SIZE="-1">  
<script type="text/javascript"><!--
google_ad_client = "pub-0089902038208374";
//LUGNET 160x600, Erstellt 14.12.07
google_ad_slot = "5985678701";
google_ad_width = 160;
google_ad_height = 600;
//--></script>
<script type="text/javascript"
src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
<BR>
<style type="text/css"> @import url(http://www.google.com/cse/api/branding.css);
</style>
<div class="cse-branding-bottom" style="background-color:#FFFFFF;color:#000000">
  <div class="cse-branding-form">
    <form action="http://www.google.com/cse" id="cse-search-box">
      <div>
        <input type="hidden" name="cx" value="partner-pub-0089902038208374:9n7bh3k27mb" />
        <input type="hidden" name="ie" value="ISO-8859-1" />
        <input type="text" name="q" size="31" />
        <input type="submit" name="sa" value="Search" />
      </div>
    </form>
  </div>
  <div class="cse-branding-logo">
    <img src="http://www.google.com/images/poweredby_transparent/poweredby_FFFFFF.gif" alt="Google" />
  </div>
  <div class="cse-branding-text">
    Custom Search
  </div>
</div>  </FONT></TD>

    </TR></TABLE>  <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%">
<TR VALIGN=TOP>  </TR></TABLE>  </TD></TR></TABLE>
  </CENTER>
<!-- google_ad_section_end -->  <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 BGCOLOR="#8899BB" WIDTH="100%"><TR>
<TD><SPACER TYPE=BLOCK WIDTH=1 HEIGHT=1></TD></TR></TABLE>

<TABLE BORDER=0 CELLPADDING=4 CELLSPACING=0 BGCOLOR="#E8F0FF" WIDTH="100%">
  <TR VALIGN=TOP>
    <TD ALIGN=LEFT><FONT FACE="Geneva,Arial,Helvetica" SIZE="-2" COLOR="#000033">  <A HREF="/sitemap.cgi">Newsgroup Tree</A> &nbsp;|&nbsp; <A HREF="http://www.lugnet.com/admin/terms/agreement">Terms of Use</A> &nbsp;|&nbsp; <A HREF="http://www.lugnet.com/admin/feedback/">Feedback</A><BR>
    </FONT></TD>
    <TD ALIGN=RIGHT><FONT FACE="Geneva,Arial,Helvetica" SIZE="-2" COLOR="#000033"> &copy;2005 LUGNET. All rights reserved. - hosted by <a href="http://www.steinbruch.info/" target="_blank">steinbruch.info GbR</a><BR>
    </FONT></TD> 
  </TR>
</TABLE>

<script type="text/javascript">
var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : "http://www.");
document.write(unescape("%3Cscript src='" + gaJsHost + "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));
</script>
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-3258989-12");
pageTracker._initData();
pageTracker._trackPageview();
</script>
</BODY>
</HTML>  `

If that is your string, then you can get the material bounded by the strings 'A HREF="' using strsplit 如果这是您的字符串,则可以使用strsplit获取以字符串'A HREF =“为边界的材料

txt <- '</TABLE> <BR> <BR>  <FONT FACE="Verdana,Geneva,Helvetica" SIZE="-1" COLOR="#990000"><B>

    Message has 2 Replies: </B></FONT><BR>   <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH="100%"> <TR VALIGN=TOP BGCOLOR="#E0E0E0"><TD ALIGN=LEFT><A HREF="/dear-lego/?n=14"><IMG BORDER=5 HEIGHT=3 WIDTH=3 SRC="/news/x.gif"></A></TD><TD><FONT SIZE="-2">&nbsp;&nbsp;</FONT></TD><TD ALIGN=LEFT><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2"><A HREF="/dear-lego/?n=14">Re: Plate Paks</A><BR></FONT></TD><TD ALIGN=RIGHT><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2">&nbsp;Tom Stangl<BR></FONT></TD></TR><TR BGCOLOR="#F8F8F8"><TD COLSPAN=4 ALIGN=LEFT VALIGN=TOP><FONT FACE="Verdana,Geneva,Helvetica" SIZE="-2" '

This is the second fragment: 这是第二个片段:

> strsplit(txt, split='A HREF="')[[1]][2]
[1] "/dear-lego/?n=14\"><IMG BORDER=5 HEIGHT=3 WIDTH=3 SRC=\"/news/x.gif\"></A></TD><TD><FONT SIZE=\"-2\">&nbsp;&nbsp;</FONT></TD><TD ALIGN=LEFT><FONT FACE=\"Verdana,Geneva,Helvetica\" SIZE=\"-2\"><"

There are probably real XML and HTML processing steps but they generally require an example with all the headers and you have removed all those. 可能存在真正的XML和HTML处理步骤,但是它们通常都需要一个包含所有标题的示例,而您已经删除了所有这些标题。

You may see the following link: 您可能会看到以下链接:

Is there a simple way in R to extract only the text elements of an HTML page? R中是否有一种简单的方法来仅提取HTML页面的文本元素?

I think it best matches your question 我认为这最符合您的问题

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 动态加载时,如何从 a.txt 文件中获取动态内容以显示在多个 html 页面中 - How can I get dynamic content from a .txt file to display in multiple html pages when they are loaded dynamically 如何使用Perl的HTML :: TableExtract从HTML文件中提取带有标题名称的特定列 - How can I extract specific columns with header names from an HTML file using Perl's HTML::TableExtract 从html获取特定内容并在Perl中打印到txt文件 - Get the specific content from html and print to txt file in Perl 如何从 html 代码中提取 web 应用程序内容? - How can I extract web app content from html code? 如何使用 Notepad++ 或 Adobe Dreamweaver 从 HTML 文件中提取特定文本? - How can I extract specific texts from an HTML file by using Notepad++ or Adobe Dreamweaver? 如何在特定位置从html提取文本? - How can I extract the text from the html in a specific places? 如何从 html 中提取特定元素 - how can i extract a specific element from the html 如何将项目从 HTML 导出到 .TXT 文件? - How can I export items from HTML into a .TXT file? 我可以从HTML表单中读取输入并将其保存在TXT文件中的特定位置吗? - Can I read the input from a HTML form and save it at at a specific position in TXT file? 试图从 txt 文件中提取数据到 html - trying to extract data from txt file into and html
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM