简体   繁体   English

Node.js与Python解析HTML

[英]Node.js vs. Python for parsing HTML

I am so sorry if this is a repeat. 如果这是重复的话,我感到很抱歉。 I saw something on it, and then I couldn't find it again. 我看到了一些东西,然后又找不到了。

To be more concrete than other entries, say I have: 比其他条目更具体,说我有:

<h1>Hey, there.</h1>

h1 {
  color: green
}

I want to change it to "blue". 我想将其更改为“蓝色”。 Can I do that more efficiently/with less code/more accurately/with less time using python or Node.js? 我可以使用python或Node.js来更高效/更少代码/更准确/更少时间吗?

Say I have: 说我有:

<h1>Hey, there1.</h1>

Is it easier to change "Hey, there1." 更改“嘿,那里1”更容易吗? to "Hey, there2." 到“嘿,那里2”。 with Python or Node.js? 与Python或Node.js?

Also, say I have: 另外,说我有:

<h1>Hey, there.</h1> and want to add <a> tags so that it is:

<h1><a href="">Hey, there.</a></h1>

Those are the three cases I can think of. 这就是我能想到的三种情况。 I am trying to turn a block of html tags and content into a differently organized/content/css'ed html. 我正在尝试将html标记和内容块转换为其他组织/内容/ css编辑的html。

I am not very familiar with either python or Javascript at the moment, and I don't want to trail down the wrong path from the beginning. 目前,我对python或Javascript不太熟悉,我也不想从一开始就走错路。

I am also thinking of doing a ton of find/replace statements, but that's not very elegant. 我也正在考虑做大量的查找/替换语句,但这不是很优雅。

Thank you for any help/insight. 感谢您的帮助/咨询。

Luckily both python and node have nifty tools for that. 幸运的是,python和node都具有漂亮的工具。

However, since you're just starting out, I agree that the question you should be asking is: "what do I want to learn from this?" 但是,由于您才刚刚起步,因此我同意您应该提出的问题是:“我想从中学到什么?”

There is no point in choosing one over the other if your only goal is to just make a scraper/parser. 如果您的唯一目标只是制作一个刮板/解析器,那么选择一个就没有意义。 Think about your long term goals and how you want to ultimately want to shape your skillsets. 考虑一下您的长期目标,以及您最终想要如何塑造技能。

That lecture aside, I'll point you in the nodejs direction since (I personally) think it's a much easier route, and as others have said, javascript is the language of the web so if you plan to do webdev/scraping/whatever in the future, it's very helpful. 除了那堂课,我会指出您的Nodejs方向,因为(我个人)认为这是一条简单得多的路线,并且正如其他人所说的那样,javascript是Web的语言,因此如果您打算进行webdev / scraping / every in未来,这将非常有帮助。

Step one: nodeJS 第一步:nodeJS

install node as I am sure you already have ( https://nodejs.org/en/ ), make a nice little folder for this project, cd to the folder and run this: 安装节点,因为我确定您已经拥有( https://nodejs.org/en/ ),为此项目创建一个漂亮的小文件夹,将其安装到该文件夹​​并运行:

npm init

it lets node know this is where you'll be working and it'll install your packages (when npm is called inside this dir) here. 它让节点知道这是您要去的地方,并在此处安装软件包(当在该目录中调用npm时)。

Step two: the packages 第二步:打包

you'll want to install a html parser with some nice features, I recommend cheerio since you'll get the best of learning a bit of jquery as well ( https://github.com/cheeriojs/cheerio ) 您将要安装具有一些不错功能的html解析器,我建议使用cheerio,因为您也将获得最好的学习jquery的知识( https://github.com/cheeriojs/cheerio

npm install --save cheerio

Great, we have the start of a fun project! 太好了,我们开始了一个有趣的项目!

Step three: main.js 第三步:main.js

Name it whatever you want, but lets make a .js file to get going with 随心所欲命名它,但是让.js文件开始吧

let http    = require('http'),
    cheerio = require('cheerio');

//
// from cheerio docs
//
let $ = cheerio.load('<h2 class="title">Hello world</h2>')

$('h2.title').text('Hello there!')
$('h2').addClass('welcome')

$.html()
//=> <h2 class="title welcome">Hello there!</h2>

//console log some changes and see the change from Hello world to Hello there!
console.log($.html());

There's no right answer. 没有正确的答案。

Pick whatever you think suits your goals. 选择您认为适合自己目标的任何东西。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM