[英]Reading Regular Expression in ASP.NET C#
我可以使用此正则表达式读取和下载页面上的.jpg文件列表
MatchCollection match = Regex.Matches(htmlText,@"http://.*?\b.jpg\b", RegexOptions.RightToLeft);
输出例如: HTTP://somefiles.jpg从这个线<img src="http://somefiles.jpg"/>
以html
问 :如何读取这种格式的文件?
<a href="download/datavoila-setup.exe" id="button_download" title="Download your copy of DataVoila!" onclick="pageTracker._trackPageview('/download/datavoila-setup.exe')"></a>
我只想在页面上使用.exe提取文件。 因此,在上面的示例中,^我只想获取datavoila-setup.exe
文件。 抱歉,我有点菜鸟,对T_T的操作感到困惑。 预先感谢任何可以帮助我的人。 :)
这是我更新的代码,但是我在HtmlDocument doc = new HtmlDocument();上遇到错误。 部分“无可用源”,并且列表的空值为:(
protected void Button2_Click(object sender, EventArgs e)
{
//Get the url given by the user
string urls;
urls = txtSiteAddress.Text;
StringBuilder result = new StringBuilder();
//Give request to the url given
HttpWebRequest requesters = (HttpWebRequest)HttpWebRequest.Create(urls);
requesters.UserAgent = "";
//Check for the web response
WebResponse response = requesters.GetResponse();
Stream streams = response.GetResponseStream();
//reads the url as html codes
StreamReader readers = new StreamReader(streams);
string htmlTexts = readers.ReadToEnd();
HtmlDocument doc = new HtmlDocument();
doc.Load(streams);
var list = doc.DocumentNode.SelectNodes("//a[@href]")
.Select(p => p.Attributes["href"].Value)
.Where(x => x.EndsWith("exe"))
.ToList();
doc.Save("list");
}
这是Flipbed的答案,它可以工作,但不是我没有被抓住:(我认为将html拆分为文本时需要进行一些编辑
protected void Button2_Click(object sender, EventArgs e)
{
//Get the url given by the user
string urls;
urls = txtSiteAddress.Text;
StringBuilder result = new StringBuilder();
//Give request to the url given
HttpWebRequest requesters = (HttpWebRequest)HttpWebRequest.Create(urls);
requesters.UserAgent = "";
//Check for the web response
WebResponse response = requesters.GetResponse();
Stream streams = response.GetResponseStream();
//reads the url as html codes
StreamReader readers = new StreamReader(streams);
string htmlTexts = readers.ReadToEnd();
WebClient webclient = new WebClient();
string checkurl = webclient.DownloadString(urls);
List<string> list = new List<string>();//!3
//Splits the html into with \ into texts
string[] parts = htmlTexts.Split(new string[] { "\"" },//!3
StringSplitOptions.RemoveEmptyEntries);//!3
//Compares the split text with valid file extension
foreach (string part in parts)//!3
{
if (part.EndsWith(".exe"))//!3
{
list.Add(part);//!3
//Download the data into a Byte array
byte[] fileData = webclient.DownloadData(this.txtSiteAddress.Text + '/' + part);//!6
//Create FileStream that will write the byte array to
FileStream file =//!6
File.Create(this.txtDownloadPath.Text + "\\" + list);//!6
//Write the full byte array to the file
file.Write(fileData, 0, fileData.Length);//!6
//Download message complete
lblMessage.Text = "Download Complete!";
//Clears the textfields content
txtSiteAddress.Text = "";
txtDownloadPath.Text = "";
//Close the file so other processes can access it
file.Close();
break;
}
}
正则表达式不是解析HTML文件的好选择。
HTML既不严格也不规范其格式。
您可以使用此代码使用HtmlAgilityPack检索所有exe。
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://yourWebSite.com");
var itemList = doc.DocumentNode.SelectNodes("//a[@href]")//get all hrefs
.Select(p => p.Attributes["href"].Value)
.Where(x=>x.EndsWith("exe"))
.ToList();
itemList
现在包含所有exe文件
这不是答案,但评论太久。 (我稍后将其删除)
为了解决它起作用的问题,它不起作用等等 ; 完整的代码,供那些可能想要检查的人使用
string html = @"<a href=""download/datavoila-setup.exe"" id=""button_download"" title=""Download your copy of DataVoila!"" onclick=""pageTracker._trackPageview('/download/datavoila-setup.exe')""></a>";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
//Anirudh's Solution
var itemList = doc.DocumentNode.SelectNodes("//a//@href")//get all hrefs
.Select(p => p.InnerText)
.Where(x => x.EndsWith("exe"))
.ToList();
//returns empty list
//correct one
var itemList2 = doc.DocumentNode.SelectNodes("//a[@href]")
.Select(p => p.Attributes["href"].Value)
.Where(x => x.EndsWith("exe"))
.ToList();
//returns download/datavoila-setup.exe
我将使用FizzlerEx ,它将类似语法的jQuery添加到HTMLAgilityPack。 使用ends-with
选择器来测试href属性:
using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html")
var page = document.DocumentNode;
foreach(var item in page.QuerySelectorAll("a[href$='exe']"))
{
var file = item.Attributes["href"].Value;
}
以及为什么用RegEx解析HTML不好的解释: http : //www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html
除了使用正则表达式,您还可以使用普通代码。
List<string> files = new List<string>();
string[] parts = htmlText.Split(new string[]{"\""},
StringSplitOptions.RemoveEmptyEntries);
foreach (string part in parts)
{
if (part.EndsWith(".exe"))
files.Add(part);
}
在这种情况下,您将在文件列表中找到所有找到的文件。
您可以这样做:
List<string> files = new List<string>();
string[] hrefs = htmlText.Split(new string[]{"href=\""},
StringSplitOptions.RemoveEmptyEntries);
foreach (string href in hrefs)
{
string[] possibleFile = href.Split(new string[]{"\""},
StringSplitOptions.RemoveEmptyEntries);
if (possibleFile.Length() > 0 && possibleFile[0].EndsWith(".exe"))
files.Add(possibleFile[0]);
}
这还将检查exe文件是否在href中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.