简体   繁体   English

如何在将web.response流保存到文本文件之前对其进行格式化

[英]How to Format a web.response stream before saving it to a text file

I am collecting info using web.response method as a web crawler. 我正在使用web.response方法作为网络爬虫来收集信息。 I collect it to a string and then save it to a text file .Then I search that text file using regular expression. 我将其收集为字符串,然后将其保存到文本文件。然后使用正则表达式搜索该文本文件。 Now the problem is when I search that text file using regular expression I am not able to do it properly because there are many random newlines in the text file. 现在的问题是,当我使用正则表达式搜索该文本文件时,由于文本文件中有许多随机的换行符,所以我无法正确执行该操作。

My question is "Is there a way that the XML (HTML) document I get by web.response method can be formatted properly before saving it to text file , so that there are no random spaces and newlines in the text. I can not even post an unformatted HTML here otherwise I would have done it. 我的问题是“是否有办法将通过web.response方法获得的XML(HTML)文档正确格式化,然后再将其保存到文本文件中,以使文本中没有随机空格和换行符。我什至无法在此处发布未格式化的HTML,否则我会做的。

Internet可能会讨厌您这样做,但是如果您有预定义的条件,则可以转换字符串,例如:

var formattedHtml = html.Replace(Environment.NewLine, "");

This could solve your problem. 这样可以解决您的问题。 But from performance point of view it is a bad solution . 但是从性能的角度来看,这是一个bad solution

Perform following actions on the response 对响应执行以下操作

  1. Extract the content between > and < symbols and perform a Trim white space operation 提取><符号之间的内容并执行Trim空格操作
  2. Remove all the remaining new lines if present 删除所有剩余的新行(如果有)

Another better solution will be using a better RegEx for searching the string 另一个better solution是使用更好的RegEx搜索字符串

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM