简体   繁体   English

grep的正则表达式问题

[英]Regex issue with grep

I am trying to write a regex that will find in CVS(Coma Separate Values) file bunch of phone numbers. 我试图写一个正则表达式,它将在CVS(Coma分隔值)文件中找到一堆电话号码。

Catch is I am interested only in phone numbers in particular column(as an only after particular amount of comas). Catch是我只对特定列中的电话号码感兴趣(仅在特定数量的昏迷之后才感兴趣)。 Bellow I have regex that will do that and it works fine per Javascript standard. 在下面,我有正则表达式可以做到这一点,并且每个Javascript标准都可以正常工作。

(?:^([^^]*\,){3}[^^]*)\d{3}-\d{3}-\d{4}

I am actually working in Bash and using sed, grep but I cannot even find what Regex standard does grep, and sed use? 我实际上在Bash中工作,并且使用sed,grep,但我什至找不到grep和sed使用的正则表达式标准?

Here is sample text. 这是示例文本。

note that right now I am using '^' instead of ',' to keep values separated, because users included comas in the value. 请注意,现在我使用'^'代替','来使值分开,因为用户在值中包括了逗号。 )

THIS IS NOT THE ACTUAL DATA, IT IS SCRAMBLED TO PRESERVE PEOPLE'S PRIVACY 这不是实际数据,只是为了保留人们的隐私而做的准备

28434658^17 Three^2013-09-19T19:57:23Z^80 W 54th St, Penthouse & 4th Fl, NY, 10018s212-409-1641^^Mary Szyb 347-340-1918^2 x week Thur 2.5hrs  & Sat 4 hrs
28937693^356 West 36th street^2013-09-19T18:17:57Z^356 West 36th street, suite 706sNew York New York 10018^null^null^on call: 
29219313^333 rector pl^2013-10-07T17:11:36Z^333 Rector Place 248-469-5859^^Jose Hernandez^2 x week Wed & Fri
28854346^50 Can^2013-09-23T13:10:54Z^152 East 28th Street, 7th Floor, NY, 10018s917-932-3962s646-710-4170^155 W 24rd St 3rd FL^null^Swlvia Smith347-933-6630sIrena Brown 347-991-1346s5 x week Mon-Fri
28434698^4Eleven^2013-09-19T19:57:23Z^112 West 28th Street, 3th Fl,sNY, 10018s917-922-3862s646-710-4170^^null^null

Let me also clarify one thing correct output would be: 我还要澄清一件事,正确的输出将是:

212-409-1641
248-469-5859
917-932-3962
646-710-4170
917-922-3862
646-710-4170

Because these are the only phone numbers in column 4 因为这些是第4列中唯一的电话号码

The following should work for you. 以下应为您工作。

grep -Po '(\d{3}-){2}\d{4}' file.csv

UPDATE: 更新:

After replacing ^ with comma's as they are in you actual data.. 按照实际数据中的逗号替换^后。

28434658,17 Three,2013-09-19T19:57:23Z,80 W 54th St, Penthouse & 4th Fl, NY, 10018s212-409-1641,Mary Szyb 347-340-1918,2 x week Thur 2.5hrs  & Sat 4 hrs
28937693,356 West 36th street,2013-09-19T18:17:57Z,356 West 36th street, suite 706sNew York New York 10018,null,null,on call: 
29219313,333 rector pl,2013-10-07T17:11:36Z,333 Rector Place 248-469-5859,Jose Hernandez,2 x week Wed & Fri
28854346,50 Can,2013-09-23T13:10:54Z,152 East 28th Street, 7th Floor, NY, 10018s917-932-3962s646-710-4170,155 W24rd St 3rd FL,null,Swlvia Smith347-933-6630sIrena Brown 347-991-1346s5 x week Mon-Fri
28434698,4Eleven,2013-09-19T19:57:23Z,112 West 28th Street, 3th Fl,sNY, 10018s917-922-3862s646-710-4170,null,null

You could try the following. 您可以尝试以下方法。

perl -nle '@F = split(/,(?!s| )/, $_); print $1 while ($F[3] =~ /((\d{3}-){2}\d{4})/g)' file.csv

Output 产量

212-409-1641
248-469-5859
917-932-3962
646-710-4170
917-922-3862
646-710-4170

Grep can use the perl or posix standard with -P or -E. Grep可以将perl或posix标准与-P或-E一起使用。 See man grep for details. 有关详细信息,请参见man grep For something like this, I normally use cut to separate fields first, assuming that none of the fields will ever contain the column delimiter. 对于这样的事情,我通常首先使用cut来分隔字段,并假设没有任何字段将包含列定界符。

echo "a,b,c,123-555-1212,d,e,f" | cut -f 4 -d','

or from a file, 或来自文件

while read line; do
   c4=$(echo $line | cut -f 4 -d',')
done < /tmp/file.csv

If any of the columns can contain commas then you're probably better off switching to a CSV library in ruby, python, etc. 如果任何一列都可以包含逗号,那么最好切换到使用ruby,python等的CSV库。

UPDATE: using -d'^' to separate columns, you can pretty easily match the columns you're interested in, as above, the tricky part with sed is extracting the phone numbers, 更新:使用-d'^'分隔列,您可以轻松地将感兴趣的列匹配,如上所述,使用sed的棘手部分是提取电话号码,

f="80 W 54th St, Penthouse & 4th Fl, NY, 10018s212-409-1641"
echo $f | sed -r 's/(.*?)([0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9]$)/\2/'
212-409-1641

Not that you have to use the extended regex sed command line argument (-r) cannot seem to use regex literals like \\d{3}. 不是必须使用扩展的regex sed命令行参数(-r)似乎不能使用\\ d {3}之类的正则表达式文字。 The documentation for sed is found in the info page, but it's usually easier to grep the net. sed的文档可在信息页面中找到,但通常更容易grep net。 This is a pretty good tutorial: http://www.thegeekstuff.com/2009/10/unix-sed-tutorial-advanced-sed-substitution-examples/ 这是一个非常不错的教程: http : //www.thegeekstuff.com/2009/10/unix-sed-tutorial-advanced-sed-substitution-examples/

An answer using awk : 使用awk的答案:

awk -F'^' '{ 
  start = 0;
  str = substr($4, start);
  while (match(str, /([0-9]{3})-[0-9]{3}-[0-9]{4}/)) {
    print substr(str, RSTART, RLENGTH);
    start = RSTART + RLENGTH;
    str = substr(str, start);
  }
}' datafile

This takes the 4th column, repeatedly matches the phone pattern, and prints it out on a line. 这将占据第4列,重复匹配电话模式,然后将其打印在一行上。

I am posting the regex that ended doing the job: 我发布结束工作的正则表达式:

([0-9]{3}-[0-9]{3}-[0-9]{4})(?=[^^]*(\^[^^]*){3}$)

thank you everyone for the helpful input 谢谢大家的帮助

I guess my lesson from that problem is if one solution does not work try to work from different angle, in this case count the columns from the back. 我想从这个问题中可以得出的教训是,如果一个解决方案不起作用,请尝试从不同角度进行工作,在这种情况下,请从背面数列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM