[英]How do you extract content between , and Parenthesis(if present) in a csv ROW, in Python
The content of the csv is as follows: csv的内容如下:
"Washington-Arlington-Al, DC-VA-MD-WV (MSAD)" 47894 1976
"Grand-Forks, ND-MN" 24220 2006
"Abilene, TX" 10180 1977
The output required is read through the csv, find the content between "" in column 1 and fetch only DC-VA-MD-WV , ND-MN , TX and put this content in a new column. 通过csv读取所需的输出,在第1列中的“”之间找到内容,并仅获取DC-VA-MD-WV,ND-MN和TX并将此内容放入新列中。 (For Normalization)
(用于归一化)
So far tried a lot of regex patterns in python, but could not get the right one. 到目前为止,在python中尝试了很多正则表达式模式,但没有找到正确的模式。
sample=""" "Washington-Arlington-Al, DC-VA-MD-WV (MSAD)",47894,1976
"Grand-Forks, ND-MN",24220,2006
"Abilene, TX",10180,1977 """
open('sample.csv','w').write(sample)
with open('sample.csv') as sample, open('output.csv','w') as output:
reader = csv.reader(sample)
writer = csv.writer(output)
for comsplit in row[0].split(','):
writer.writerow([ comsplit, row[1]])
print open('output.csv').read()
Output Expected is: 预期输出为:
DC-VA-MD-WV
ND-MN
TX
in a new row 在新行中
I'd do it like this: 我会这样:
with open('csv_file.csv', 'r') as f_in, open('output.csv', 'w') as f_out:
csv_reader = csv.reader(f_in, quotechar='"', delimiter=',',
quoting=csv.QUOTE_ALL, skipinitialspace=True)
csv_writer = csv.writer(f_out)
new_csv_list = []
for row in csv_reader:
first_entry = row[0].strip('"')
relevant_info= first_entry.split(',')[1].split(' ')[0]
row += [relevant_info]
new_csv_list += [row]
for row in new_csv_list:
csv_writer.writerow(row)
Let me know if you have any questions. 如果您有任何疑问,请告诉我。
There is no need to use regex here provided a couple of things: 提供了以下几点,因此无需使用正则表达式:
(MSAD)
. (MSAD)
类的字母序列之后,还有一个空格。 This code gives your expected output against the sample input: 这段代码针对示例输入给出了预期的输出:
with open('sample.csv', 'r') as infile, open('expected_output.csv', 'wb') as outfile:
reader = csv.reader(infile)
expected_output = []
for row in reader:
split_by_comma = row[0].split(',')[1]
split_by_space = split_by_comma.split(' ')[1]
print split_by_space
expected_output.append([split_by_space])
writer = csv.writer(outfile)
writer.writerows(expected_output)
I believe you could use this regex pattern, which will extract any alphanumeric expression (with hyphen or not) between a comma and a parenthesis: 我相信您可以使用此正则表达式模式,该模式将提取逗号和括号之间的所有字母数字表达式(带或不带连字符):
import re
BETWEEN_COMMA_PAR = re.compile(ur',\s+([\w-]+)\s+\(')
test_str = 'Washington-Arlington-Al, DC-VA-MD-WV (MSAD)'
result = BETWEEN_COMMA_PAR.search(test_str)
if result != None:
print result.group(1)
This will print as a result: DC-VA-MD-WV
, as expected. 结果将显示为:
DC-VA-MD-WV
,如预期的那样。
It seems that you are having troubles finding the right regex
to use for finding the expected values. 似乎您很难找到用于查找期望值的正确
regex
。
I have created a small sample pythext which will satisfy your requirement. 我创建了一个小样本pythext ,它将满足您的要求。
Basically, when you check the content of every value of the first column, you could use a regex like /(TX|ND-MN|DC-VA-MD-WV)/
基本上,当您检查第一列的每个值的内容时,可以使用正则表达式,例如
/(TX|ND-MN|DC-VA-MD-WV)/
I hope this was useful! 我希望这是有用的! Let me know if you need further explanations.
让我知道您是否需要进一步的解释。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.