简体   繁体   English

如何在excel或python中提取单词周围的文本?

[英]How to extract text around a word in excel or python?

I have a thousands lines text that goes like: 我有几千行文字,如:

ksjd 234first special 34-37xy kjsbn
sde 89second special 22-23xh ewio
647red special 55fg dsk
uuire another special 98
another special 107r
green special 55-59 ewk
blue special 31-39jkl

I need to extract a word before "special" and number (or number range) from the right. 我需要在“特殊”之前提取一个单词,从右边提取数字(或数字范围)。 In other words, I want: 换句话说,我想:

在此输入图像描述

converted into a table: 转换成表格:

在此输入图像描述

A fast way to do this is to use regular expressions: 一种快速的方法是使用正则表达式:

In [1]: import re

In [2]: text = '''234first special 34-37xy                          
   ...: 89second special 22-23xh
   ...: 647red special 55fg
   ...: another special 98
   ...: another special 107r
   ...: green special 55-59
   ...: blue special 31-39jkl'''

In [3]: [re.findall('\d*\s*(\S+)\s+(special)\s+(\d+(?:-\d+)?)', line)[0] for line in text.splitlines()]
Out[3]: 
[('first', 'special', '34-37'),
 ('second', 'special', '22-23'),
 ('red', 'special', '55'),
 ('another', 'special', '98'),
 ('another', 'special', '107'),
 ('green', 'special', '55-59'),
 ('blue', 'special', '31-39')]

In Excel, you can using a formula to extract text between two words by doing as follow: 在Excel中,您可以使用公式通过执行以下操作在两个单词之间提取文本:

  1. Select a blank cell and type this formula =MID(A1,SEARCH("KTE",A1)+3,SEARCH("feature",A1)-SEARCH("KTE",A1)-4) into it , then press Enter button. 选择一个空白单元格并输入此公式= MID(A1,SEARCH(“KTE”,A1)+ 3,SEARCH(“feature”,A1)-SEARCH(“KTE”,A1)-4),然后按Enter键按钮。

  2. Drag the fill handle to fill the range you want to apply this formula. 拖动填充柄以填充要应用此公式的范围。 Now the text strings between "KTE" and "feature" are extracted only. 现在只提取“KTE”和“feature”之间的文本字符串。

Notes: 笔记:

  1. In this formula, A1 is the cell you want to extract text from. 在此公式中,A1是要从中提取文本的单元格。

  2. KTE and feature are the words you want to extract text between. KTE和功能是您要在其间提取文本的单词。

  3. The number 3 is the characters length of KTE, and number 4 is equal to the characters length of KTE plus one. 数字3是KTE的字符长度,数字4等于KTE的字符长度加1。

In addition what @RolandSmith wrote, here is a way of using Regular Expressions in Excel - VBA 另外@RolandSmith写道,这是一种在Excel中使用正则表达式的方法 - VBA


Option Explicit
Function ExtractSpecial(S As String, Index As Long) As String
    Dim RE As Object, MC As Object
    Const sPat As String = "([a-z]+)\s+(special)\s+([^a-z]+)"

Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .ignorecase = True
    .MultiLine = False
    .Pattern = sPat
    If .test(S) = True Then
        Set MC = .Execute(S)
        ExtractSpecial = MC(0).submatches(Index - 1)
    End If
End With

End Function

The Index argument in this UDF corresponds to returning either the 1st, 2nd or 3rd submatch from the match collection, so you can easily split the original string into your three desired components. 此UDF中的Index参数对应于从匹配集合返回第1,第2或第3个子匹配,因此您可以轻松地将原始字符串拆分为三个所需的组件。

在此输入图像描述

Since you write you have "thousands of lines", you may prefer to run a macro. 既然你写了“数千行”,你可能更喜欢运行一个宏。 The macro will process the data much more quickly, but is not dynamic. 宏将更快地处理数据,但不是动态的。 The macro below assumes your original data is in Column A on Sheet2, and will put the results in columns C:E on the same worksheet. 下面的宏假设您的原始数据位于Sheet2上的A列中,并将结果放在同一工作表上的C:E列中。 You can easily change these parameters: 您可以轻松更改这些参数:


Sub ExtractSpec()
    Dim RE As Object, MC As Object
    Dim wsSrc As Worksheet, wsRes As Worksheet, rRes As Range
    Dim vSrc As Variant, vRes As Variant
    Dim I As Long

Set wsSrc = Worksheets("sheet2")
Set wsRes = Worksheets("sheet2")
    Set rRes = wsRes.Cells(1, 3)

With wsSrc
    vSrc = .Range(.Cells(1, 1), .Cells(.Rows.Count, 1).End(xlUp))
End With

Set RE = CreateObject("vbscript.regexp")
With RE
    .Global = True
    .MultiLine = False
    .ignorecase = True
    .Pattern = "([a-z]+)\s+(special)\s+([^a-z]+)"

ReDim vRes(1 To UBound(vSrc), 1 To 3)
For I = 1 To UBound(vSrc)
    If .test(vSrc(I, 1)) = True Then
        Set MC = .Execute(vSrc(I, 1))
        vRes(I, 1) = MC(0).submatches(0)
        vRes(I, 2) = MC(0).submatches(1)
        vRes(I, 3) = MC(0).submatches(2)
    End If
Next I
End With

Set rRes = rRes.Resize(UBound(vRes, 1), UBound(vRes, 2))
With rRes
    .EntireColumn.Clear
    .Value = vRes
    .EntireColumn.AutoFit
End With

End Sub

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何从 PDF 或 Word 中提取图像以及图像周围的文本? - How to extract images from PDF or Word, together with the text around images? 如何使用 python 模块从具有对象 excel 工作表的 word 文档中提取段落和表格中的文本? - How to extract text from paragraphs and table using python module from word document having objects excel sheet? 如何从Python中的文本中提取单词 - How to extract a word from text in Python 如果数字周围有文字,如何使用python中的xpath提取数字? - How to extract number with xpath in python if there is text around the number? 如何使用python从文本中提取准确的单词? - How do I extract an exact word from text by using python? 如何从python中的word文档中提取文本? (并将数据放入 df) - How to extract text from a word document in python? (and put the data in df) 如何提取文本,直到达到大写字母? 蟒蛇 - How to extract text until it reaches a capital word? Python 使用Python从Word和Excel中提取图片 - Extract pictures from Word and Excel with Python 使用Python提取包含单词的句子以及周围的句子? - Extract a Sentence Containing a Word Using Python… As well as the sentences around it? 如何使用 python 或 JavaScript 提取文本并保存为 excel 文件 - How to extract text and save as excel file using python or JavaScript
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM