简体   繁体   English

从单个属性中提取多个文本数据

[英]Extracting multiple text data from a single attribute

I'm trying to extract a few fields of data from a single attribute from a single selector.我正在尝试从单个选择器的单个属性中提取一些数据字段。 What I mean by that is, all of the information I'm trying to scrape is in parts of the site I can get this way:我的意思是,我试图抓取的所有信息都在我可以通过这种方式获得的部分网站中:

response.css('td::attr(onclick)').get()

when I run that, I receive:当我运行它时,我收到:

handler(this, "HANDLE", {"asdf":"5777","zxcv":"754401863","hjkl":"666","tyui":"277371661","name":"lolol","something":"someth1ng","type":"animal","genre":"javasux"});return false;'

and let's say the Scrapy Items I'm trying to create have fields a, b and c, where I would like a to be the value of "hjkl" of the above exerpt (666), b to be value of "name" (lolol), and c to be value of "asdf" (5777).假设我正在尝试创建的 Scrapy 项目具有字段 a、b 和 c,其中我希望 a 是上述摘录(666)的“hjkl”的值,b 是“name”的值(大声笑),并且 c 是“asdf”(5777)的值。

Where in the scraper /project should I include the logic that would do this?我应该在刮板/项目中的哪个位置包含执行此操作的逻辑? Because I think sadly I can't "get" the values from these fields like asdf using selectors so I would have to use item loaders/item processors, is that correct?因为我很遗憾地认为我无法使用选择器从这些字段(如 asdf)“获取”值,所以我必须使用项目加载器/项目处理器,对吗? And I assume the actual selection would have to be done using regexp?我假设实际的选择必须使用正则表达式来完成? I'm asking because while in this particular project scraping a single site will be relatively simple, I have of those sites to go through, and regular expressions aren't too fast from what I understand.我问是因为虽然在这个特定的项目中抓取单个站点相对简单,但我已经将这些站点中的 go 通过,并且正则表达式并不是我所理解的太快。

Yes, I think regex would be the simpler solution, in the end that's just a long string so you can also clean it to get only the information that you need, and maybe just get the part that looks like a dictionary and json it.是的,我认为正则表达式将是更简单的解决方案,最后这只是一个长字符串,因此您还可以清理它以仅获取您需要的信息,并且可能只获取看起来像字典的部分和json它。

Another way would be to use a javascript parser because that's what you have inside that string.另一种方法是使用 javascript 解析器,因为这就是您在该字符串中所拥有的。 You can use js2xml for that.您可以为此使用js2xml

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM