捕獲“，”之間的文本

Question

我在一行中包含逗號。 我想在逗號之間捕獲數據。

line = "",,,,,,,,,ce: appears to assume ,that\n

我正在使用正則表達式捕獲模式= (""),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)\\\\n

輸出為：

Output 1
1.  ""
2.  ,
3.  Empty
4.  Empty
5.  Empty
6.  Empty
7.  Empty
8.  Empty
9.  ce: appears to assume
10. that

我想將輸出作為：

Output 2
1.  ""
2.  Empty
3.  Empty
4.  Empty
5.  Empty
6.  Empty
7.  Empty
8.  Empty
9.  Empty
10. ce: appears to assume, that

基本上我正在尋找某種通用的貪婪方法，該方法會忽略文本之間的逗號“，”

Answer 1

正則表達式在這里似乎是錯誤的解決方案。 如果您知道要進行多少次匹配（您指定了10個匹配項），那么您就知道了期望的逗號數。 使用str.split

>>> line.split(',', 9)
['""', '', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']

Answer 2

您可以在此處使用itertools.groupby來過濾長度：

import itertools

someline = '"",,,,,,,,ce: appears to assume ,that\n'

# Group by length greater than 0
res = [(i, ','.join(x)) for i,x in itertools.groupby(someline.split(','), key=lambda x: len(x)>0)]

# [(True, '""'), (False, ',,,,,,'), (True, 'ce: appears to assume ,that\n')]

# Then you can just gather your results
results = []
for i, x in res:
    if i is True:
        results.append(x)
    else:
        results.extend(x.split(','))

results
# ['""', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']

如果這不是每行的固定值，則可以避免您必須檢查一定數量的逗號。

不同格式

但是，我認為真正的問題是逗號不僅是定界符，而且還是數據中的元素，這使這個問題有點模棱兩可。 根據文檔，您似乎可以指定其他輸出格式，例如.tsv ，該格式將用\\t分隔並完全避免了該問題：

tabula.convert_into("test.pdf", "output.tsv", output_format="tsv", pages='all')

然后您的行將如下所示：

someline = '""\t\t\t\t\t\t\t\tce: appears to assume ,that\n'

# Much easier to handle
someline.split('\t')

# ['""', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']

Answer 3

不知道您是否需要所有空白。 可能這就是您想要的

separados = line.split(',,')

for i in range(len(separados)):
    try:  #you can add more custom filters here
        if separados[i][0] == ',': separados[i] = separados[i][1:]
    except: pass
    try:
        if separados[i][-1] == ',': separados[i] = separados[i][:-1]
    except: pass

這就是你得到的

'""'
''
''
''
'ce: appears to assume ,that\n'

Answer 4

問題是.*匹配的字符太多，包括逗號。 您應該創建與所有字符匹配的組（逗號除外） ，例如

^(""),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*)$

最后一個可以匹配逗號，以便能夠匹配ce: appears to assume ,that中的逗號ce: appears to assume ,that

#!/usr/bin/env python

import re

reg = re.compile('^(""),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*)$')

match = reg.match('"",,,,,,,,,ce: appears to assume ,that\n')

for i in range(1,11):
    print('{:>2s}.  {}'.format(str(i),"Empty" if len(match.group(i))==0 else match.group(i)))

提供所需的輸出

 1.  ""
 2.  Empty
 3.  Empty
 4.  Empty
 5.  Empty
 6.  Empty
 7.  Empty
 8.  Empty
 9.  Empty
10.  ce: appears to assume ,that```

捕獲“，”之間的文本

問題描述

4 個解決方案

解決方案1
2 2019-04-26 15:28:36

解決方案2
2 2019-04-26 15:36:57

不同格式

解決方案3
0 2019-04-26 16:03:14

解決方案4
0 2019-04-26 21:19:17

捕獲“，”之間的文本

問題描述

4 個解決方案

解決方案1 2 2019-04-26 15:28:36

解決方案2 2 2019-04-26 15:36:57

不同格式

解決方案3 0 2019-04-26 16:03:14

解決方案4 0 2019-04-26 21:19:17

解決方案1
2 2019-04-26 15:28:36

解決方案2
2 2019-04-26 15:36:57

解決方案3
0 2019-04-26 16:03:14

解決方案4
0 2019-04-26 21:19:17