[英]How to Format a web.response stream before saving it to a text file
I am collecting info using web.response method as a web crawler. 我正在使用web.response方法作为网络爬虫来收集信息。 I collect it to a string and then save it to a text file .Then I search that text file using regular expression.
我将其收集为字符串,然后将其保存到文本文件。然后使用正则表达式搜索该文本文件。 Now the problem is when I search that text file using regular expression I am not able to do it properly because there are many random newlines in the text file.
现在的问题是,当我使用正则表达式搜索该文本文件时,由于文本文件中有许多随机的换行符,所以我无法正确执行该操作。
My question is "Is there a way that the XML (HTML) document I get by web.response method can be formatted properly before saving it to text file , so that there are no random spaces and newlines in the text. I can not even post an unformatted HTML here otherwise I would have done it. 我的问题是“是否有办法将通过web.response方法获得的XML(HTML)文档正确格式化,然后再将其保存到文本文件中,以使文本中没有随机空格和换行符。我什至无法在此处发布未格式化的HTML,否则我会做的。
Internet可能会讨厌您这样做,但是如果您有预定义的条件,则可以转换字符串,例如:
var formattedHtml = html.Replace(Environment.NewLine, "");
This could solve your problem. 这样可以解决您的问题。 But from performance point of view it is a
bad solution
. 但是从性能的角度来看,这是一个
bad solution
。
Perform following actions on the response 对响应执行以下操作
>
and <
symbols and perform a Trim white space operation >
和<
符号之间的内容并执行Trim空格操作 Another better solution
will be using a better RegEx for searching the string 另一个
better solution
是使用更好的RegEx搜索字符串
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.