简体   繁体   English

连接由re.findall在Python中找到的字符串

[英]Concatenate strings found by re.findall in Python

Scraping data from a website with a search bar. 使用搜索栏从网站上收集数据。

I'm using the search with python and then filtering the results for "Words Like These" : 我正在使用python搜索,然后过滤"Words Like These"的结果:

tabOne = re.findall(r"[A-Z][a-z]*", str(initialFilter))

The problem is that the data that I'm trying to get is occasionally multiple words such as 'Item Number One' but the re.findall shows that as 'Item' 'Number' 'One' . 问题是我试图获取的数据偶尔是多个单词,例如'Item Number One',但是re.findall将其显示为'Item' 'Number' 'One' Item''Number''One 'Item' 'Number' 'One'

I want to retain the original form of the data as one phrase of words, but I'm not sure how to tell python to group them together. 我想将数据的原始形式保留为一个词的短语,但是我不确定如何告诉python将它们分组在一起。

The phrases of the [AZ][az] words are always isolated from each other on the page, so I was wondering if it might be possible to check if the characters next to those words are [AZ][az] as well and if true, group them together. [AZ] [az]词的短语在页面上始终彼此隔离,因此我想知道是否有可能检查这些词旁边的字符是否也是[AZ][az]以及是的,将它们组合在一起。

Any suggestions? 有什么建议么?

Two different ways: 两种不同的方式:

  1. Change your regex to search for multiple words 更改您的正则表达式以搜索多个单词
  2. Join regex results back into string 将正则表达式结果联接回字符串

For (1), you can try something like: 对于(1),您可以尝试类似的方法:

tabOne = re.findall(r"((?:[A-Z][a-z]*\s?)+)", str(initialFilter))

For (2), you can do something like: 对于(2),您可以执行以下操作:

tabOne = re.findall(r"[A-Z][a-z]*", str(initialFilter))
results = ' '.join(tabOne)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM