[英]extract text content from web page using asp.net web form
i'm trying the load a page to may asp.net web form and extract only the text from it and display the extracted text in an Areatext
我正在尝试将页面加载到asp.net Web表单,并仅从其中提取文本,并在Areatext
显示提取的文本
like this: 像这样:
and my code is: 我的代码是:
<%@ Page Language="C#" AutoEventWireup="true" CodeFile="Default.aspx.cs" Inherits="_Default" %>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title></title>
<style type="text/css">
#form1 {
height: 500px;
width: 1199px;
}
.auto-style1 {}
#TextArea1 {
height: 288px;
width: 1157px;
}
</style>
</head>
<body>
<form id="form1" runat="server">
<asp:Button ID="Button1" runat="server" Text="Clike me"
OnClick="Button1_Click" OnClientClick="aspnetForm.target ='_blank';"
Width="160px" CssClass="auto-style1" Height="32px" />
<br />
<br />
<asp:RadioButtonList ID="RadioButtonList1" runat="server">
<asp:ListItem>CNN</asp:ListItem>
<asp:ListItem>BBC</asp:ListItem>
<asp:ListItem>FOX</asp:ListItem>
</asp:RadioButtonList>
<br />
<br />
<textarea id="TextArea1" name="S1" runat="server" ></textarea></form>
</body>
</html>
and 和
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.IO;
using System.Drawing;
using System.Threading;
using System.Windows.Forms;
public partial class _Default : System.Web.UI.Page
{
Uri url = null;
WebBrowser wb = new WebBrowser();
protected void Button1_Click(object sender, EventArgs e)
{
wb.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(DisplayText);
if (RadioButtonList1.Text == "CNN")
{
url = new Uri("http://www.edition.cnn.com/");
wb.Url = url;
//Response.Redirect(url);
}
else if (RadioButtonList1.Text == "BBC")
{
url = new Uri("http://www.bbc.com/");
wb.Url = url;
}
else
{
url = new Uri("http://www.foxnews.com/");
wb.Url = url;
}
}
private void DisplayText(object sender, WebBrowserDocumentCompletedEventArgs e)
{
WebBrowser wb = (WebBrowser)sender;
wb.Document.ExecCommand("SelectAll", false, null);
wb.Document.ExecCommand("Copy", false, null);
TextArea1.Value = Clipboard.GetText();
}
protected void Page_Load(object sender, EventArgs e)
{
}
}
but i have this error in line 但是我有这个错误
WebBrowser wb = new WebBrowser();
ActiveX control '8856f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current thread is not in a single-threaded apartment. 由于当前线程不在单线程单元中,因此无法实例化ActiveX控件'8856f961-340a-11d0-a96b-00c04fd705a2'。
so what i'm doing wrong pleas help and many thanks in advance 所以我做错了请帮忙,在此先感谢
I have never attempted to use WebBrowser in an object-reference, but I know that this being a web-Form means you will be receiving post backs, and if you re-instantiate the Browser reference each time, it isn't going to operate like the Page object. 我从未尝试过在对象引用中使用WebBrowser,但是我知道这是一个Web表单,这意味着您将收到回发,并且如果每次都重新实例化Browser引用,它将无法运行就像Page对象一样。 I would just use the Page object, you can collect any controls and methods needed, while also utilizing Request/Response namespaces. 我只是使用Page对象,您可以收集所需的任何控件和方法,同时还可以使用Request / Response名称空间。 I would also match on the radiobuttonlist control like the code below: 我还将在单选按钮列表控件上进行匹配,例如以下代码:
protected void Page_Load(object sender, EventArgs e)
{
if (Page.IsPostBack)
{
string url;
RadioButtonList rdl = new RadioButtonList();
url = rdl.SelectedItem.Text;
}
}
Of course you'd just grab the .SelectedItem.Text
from your markup-based RadioButtonList, instead of building one. 当然,您只需从基于标记的RadioButtonList中获取.SelectedItem.Text
,而不是构建一个。
I checked, and it also seems like the WebBrowser
object is under System.Windows.Forms
. 我检查了一下,似乎WebBrowser
对象也位于System.Windows.Forms
下。 From my experience, you never want to use that Library in Web Forms (bad experiences with MsgBox
). 根据我的经验,您永远都不想在Web窗体中使用该库( MsgBox
不良经验)。
I'd refactor using the sample above and just 我将使用上面的示例进行重构
Response.Redirect(url);
Hope that helps! 希望有帮助!
您可能要考虑使用基于其他自动化控件的方法,例如WatiN( 使用Windows Forms WebBrowser来访问c#asp.net )或HTML Agility Pack(参见网站自动化的最佳方法? )之类的东西。
You can use html agility pack . 您可以使用html敏捷包 。 Here is a sample code, taken from here : 这是示例代码,摘自此处 :
var root = doc.DocumentNode;
var sb = new StringBuilder();
foreach (var node in root.DescendantNodesAndSelf())
{
if (!node.HasChildNodes)
{
string text = node.InnerText;
if (!string.IsNullOrEmpty(text))
sb.AppendLine(text.Trim());
}
}
An example code to show how to download the web page, you can try the following code (taken from here ): 显示如何下载网页的示例代码,您可以尝试以下代码(从此处获取 ):
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.