[英]splitting data using regex in python
I have a file of many line. 我有很多行的文件。 of format as below,
格式如下
//many lines of normal text
00.0000125 1319280 9.2 The Shawshank Redemption (1994)
//lines of text
0000011111 59 6.8 "$#*! My Dad Says" (2010) {You Can't Handle the Truce (#1.10)}
1...101002 17 6.6 "$1,000,000 Chance of a Lifetime" (1986)
I want to split the data as columns 1...101002,17,6.6,"$1,000,000 Chance of a Lifetime" (1986)
我想将数据拆分为列
1...101002,17,6.6,"$1,000,000 Chance of a Lifetime" (1986)
The program I tried is , 我尝试的程序是
import re
f = open("E:/file.list");
reg = re.compile('[+ ].{10,}[+ ][+0-9].{3,}[+ ]')
for each in f:
if reg.match(each):
print each
print reg.split(each)
It is not giving correct answer can I know the regex to use. 我知道使用的正则表达式没有给出正确的答案。
It is easier to match instead of split in this case. 在这种情况下,匹配比拆分更容易。
^\s*(\S+)\s+(\S+)\s+(\S+)\s+(.*)$
Try this.See demo. 试试看。看演示。
http://regex101.com/r/oE6jJ1/47 http://regex101.com/r/oE6jJ1/47
import re
p = re.compile(ur'^\s*(\S+)\s+(\S+)\s+(\S+)\s+(.*)$', re.IGNORECASE | re.MULTILINE)
test_str = u"00.0000125 1319280 9.2 The Shawshank Redemption (1994)\n\n 0000011111 59 6.8 \"$#*! My Dad Says\" (2010) {You Can't Handle the Truce (#1.10)}\n 1...101002 17 6.6 \"$1,000,000 Chance of a Lifetime\" (1986)"
re.findall(p, test_str)
>>> text="""0000011111 59 6.8 "$#*! My Dad Says" (2010) {You Can't Handle the Truce (#1.10)}
... 1...101002 17 6.6 "$1,000,000 Chance of a Lifetime" (1986)"""
>>> re.findall(r'([0-9\.]+)\s*([0-9]+)\s*([0-9\.]+)\s*(".*")',text)
[('0000011111', '59', '6.8', '"$#*! My Dad Says"'), ('1...101002', '17', '6.6', '"$1,000,000 Chance of a Lifetime"')]
I changed RegEx pattern. 我更改了RegEx模式。
import re
f = open("file.txt");
reg = re.compile(r" (.{10}) *(\d*) *(\d*\.\d*) (.*)")
for each in f:
if reg.match(each):
print each
print reg.split(each)
What about something like 怎么样
>>> str='1...101002 17 6.6 "$1,000,000 Chance of a Lifetime" (1986)'
>>> re.findall(r'^([^ ]+)\s+([^ ]+)\s+([^ ]+)\s+(.*)', str)
[('1...101002', '17', '6.6', '"$1,000,000 Chance of a Lifetime" (1986)')]
First you split the lines by split()
function then slice the split list (use itertools.islice()
)from leading of list to where that you have a number in parenthesis ( if re.match(r'\\(\\d+\\)',j)
) : 首先,您通过
split()
函数对行进行split()
然后将拆分列表(使用itertools.islice()
)从列表的if re.match(r'\\(\\d+\\)',j)
到括号中有数字的位置( if re.match(r'\\(\\d+\\)',j)
):
>>> s="""0000011111 59 6.8 "$#*! My Dad Says" (2010) {You Can't Handle the Truce (#1.10)}"""
>>> s.split()
['0000011111', '59', '6.8', '"$#*!', 'My', 'Dad', 'Says"', '(2010)', '{You', "Can't", 'Handle', 'the', 'Truce', '(#1.10)}']
>>> l=s.split()
>>> [list(islice(l,0,i+1)) for i,j in enumerate(l) if re.match(r'\(\d+\)',j)]
[['0000011111', '59', '6.8', '"$#*!', 'My', 'Dad', 'Says"', '(2010)']]
If you have your lines in a list (read the file with readlines()
) : 如果列表中有行(请使用
readlines()
读取文件):
>>> lines = ["""00.0000125 1319280 9.2 The Shawshank Redemption (1994)""","""0000011111 59 6.8 "$#*! My Dad Says" (2010) {You Can't Handle the Truce (#1.10)}""", """1...101002 17 6.6 "$1,000,000 Chance of a Lifetime" (1986)"""]
>>> [list(islice(line.split(),0,i+1)) for line in lines for i,j in enumerate(line.split()) if re.match(r'\(\d+\)',j)]
[['00.0000125', '1319280', '9.2', 'The', 'Shawshank', 'Redemption', '(1994)'], ['0000011111', '59', '6.8', '"$#*!', 'My', 'Dad', 'Says"', '(2010)'], ['1...101002', '17', '6.6', '"$1,000,000', 'Chance', 'of', 'a', 'Lifetime"', '(1986)']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.