[英]Capture text between ','
我在一行中包含逗號。 我想在逗號之間捕獲數據。
line = "",,,,,,,,,ce: appears to assume ,that\n
我正在使用正則表達式捕獲模式= (""),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*),(.*)\\\\n
輸出為:
Output 1
1. ""
2. ,
3. Empty
4. Empty
5. Empty
6. Empty
7. Empty
8. Empty
9. ce: appears to assume
10. that
我想將輸出作為:
Output 2
1. ""
2. Empty
3. Empty
4. Empty
5. Empty
6. Empty
7. Empty
8. Empty
9. Empty
10. ce: appears to assume, that
基本上我正在尋找某種通用的貪婪方法,該方法會忽略文本之間的逗號“,”
正則表達式在這里似乎是錯誤的解決方案。 如果您知道要進行多少次匹配(您指定了10個匹配項),那么您就知道了期望的逗號數。 使用str.split
>>> line.split(',', 9)
['""', '', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']
您可以在此處使用itertools.groupby
來過濾長度:
import itertools
someline = '"",,,,,,,,ce: appears to assume ,that\n'
# Group by length greater than 0
res = [(i, ','.join(x)) for i,x in itertools.groupby(someline.split(','), key=lambda x: len(x)>0)]
# [(True, '""'), (False, ',,,,,,'), (True, 'ce: appears to assume ,that\n')]
# Then you can just gather your results
results = []
for i, x in res:
if i is True:
results.append(x)
else:
results.extend(x.split(','))
results
# ['""', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']
如果這不是每行的固定值,則可以避免您必須檢查一定數量的逗號。
但是,我認為真正的問題是逗號不僅是定界符,而且還是數據中的元素,這使這個問題有點模棱兩可。 根據文檔 ,您似乎可以指定其他輸出格式,例如.tsv
,該格式將用\\t
分隔並完全避免了該問題:
tabula.convert_into("test.pdf", "output.tsv", output_format="tsv", pages='all')
然后您的行將如下所示:
someline = '""\t\t\t\t\t\t\t\tce: appears to assume ,that\n'
# Much easier to handle
someline.split('\t')
# ['""', '', '', '', '', '', '', '', 'ce: appears to assume ,that\n']
不知道您是否需要所有空白。 可能這就是您想要的
separados = line.split(',,')
for i in range(len(separados)):
try: #you can add more custom filters here
if separados[i][0] == ',': separados[i] = separados[i][1:]
except: pass
try:
if separados[i][-1] == ',': separados[i] = separados[i][:-1]
except: pass
這就是你得到的
'""'
''
''
''
'ce: appears to assume ,that\n'
問題是.*
匹配的字符太多,包括逗號。 您應該創建與所有字符匹配的組(逗號除外) ,例如
^(""),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*)$
最后一個可以匹配逗號,以便能夠匹配ce: appears to assume ,that
中的逗號ce: appears to assume ,that
#!/usr/bin/env python
import re
reg = re.compile('^(""),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),(.*)$')
match = reg.match('"",,,,,,,,,ce: appears to assume ,that\n')
for i in range(1,11):
print('{:>2s}. {}'.format(str(i),"Empty" if len(match.group(i))==0 else match.group(i)))
提供所需的輸出
1. ""
2. Empty
3. Empty
4. Empty
5. Empty
6. Empty
7. Empty
8. Empty
9. Empty
10. ce: appears to assume ,that```
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.