简体   繁体   English

Python:查找所有以文本结尾的单词(re.findall)

[英]Python:Find all words ending in text (re.findall)

Load macOS.txt into a variable text.将 macOS.txt 加载到变量文本中。 Then do the following: Find all the occurrences of macOS, Mac OS, and OS X in the text.然后执行以下操作: 查找文本中出现的所有macOS、Mac OS 和 OS X。 Put the results in one list.将结果放在一个列表中。 Print the list of those words then print the following: There are {length of list} words mentioning macOS, Mac OS, or OS X in the text.打印这些单词的列表,然后打印以下内容:有 {length of list} 个单词在文本中提到 macOS、Mac OS 或 OS X。

I think I should use REGULAR EXPRESSION.Like re.findall or re.finditer.我想我应该使用 REGULAR EXPRESSION.Like re.findall 或 re.finditer。 Anyone can correct my codes below?任何人都可以在下面更正我的代码吗?

text = open("macOS.txt", "r")  
import re
pattern = '[A-Za-z0-9-]+' 
lines = "OS"  
ls = re.findall(pattern,lines)
print(ls)

But how to Find all the occurrences of macOS, Mac OS, and OS X in the text?但是如何在文本中找到所有出现的 macOS、Mac OS 和 OS X?

or this?或这个?

import re
with open('macOS.txt', 'r') as f:
  content = f.read()
temp = re.findall(\b(?!\w*OS\b)\w*OS\b)
print(f'There are {len(temp)} words ended with OS (other than OS and macOS) in the text.')

You can use fuzzywuzzy library.您可以使用fuzzywuzzy 库。 Take few letters before and after finding 'OS", use the fuzzywuzzy library to compare. https://www.geeksforgeeks.org/fuzzywuzzy-python-library/在找到“OS”之前和之后取几个字母,使用fuzzywuzzy库进行比较。https://www.geeksforgeeks.org/fuzzywuzzy-python-library/

Alternatively, if your output is limited to one word before and after 'OS', then you can just do this-或者,如果您的 output 在“OS”前后限制为一个字,那么您可以这样做 -

  1. if that word contains OS (macOS)如果该词包含 OS (macOS)
  2. find one word prior to OS => see if its 'Mac' => concat them在 OS 之前找到一个词 => 看看它是否是 'Mac' => 连接它们
  3. find one word after OS => see if its 'X' => concat them在 OS 之后找到一个词 => 看看它是否是 'X' => 连接它们

Use利用

re.findall(r'\b(?:(?:Mac |mac)OS|OS X)\b', s)

See proof .证明

EXPLANATION解释

--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  (?:                      group, but do not capture:
--------------------------------------------------------------------------------
    (?:                      group, but do not capture:
--------------------------------------------------------------------------------
      Mac                      'Mac '
--------------------------------------------------------------------------------
     |                        OR
--------------------------------------------------------------------------------
      mac                      'mac'
--------------------------------------------------------------------------------
    )                        end of grouping
--------------------------------------------------------------------------------
    OS                       'OS'
--------------------------------------------------------------------------------
   |                        OR
--------------------------------------------------------------------------------
    OS X                     'OS X'
--------------------------------------------------------------------------------
  )                        end of grouping
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM