简体   繁体   English

正则表达式:解析GitHub用户名(JavaScript)

[英]Regex: parsing GitHub usernames (JavaScript)

I'm trying to parse GitHub usernames (that start with @) from a paragraph of text in order to link them to their associated profiles. 我试图从一段文本中解析GitHub用户名(以@开头),以便将它们链接到相关的配置文件。

The GitHub username constraints are: GitHub用户名约束是:

  • Alphanumeric with single hyphens (no consecutive hyphens) 单个连字符的字母数字(没有连续的连字符)
  • Cannot begin or end with a hyphen (if it ends with a hyphen, just match everything up until there) 不能以连字符开头或结尾(如果以连字符结尾,只需将所有内容匹配到那里)
  • Max length of 39 characters. 最大长度为39个字符。

For example, the following text: 例如,以下文字:

Example @valid hello @valid-username: @another-valid-username, @-invalid @in--valid @ignore-last-dash- an@email.com @another-valid? 示例@valid hello @ valid-username:@ another-valid-username,@ -invalid @ in - valid @ ignore-last-dash- an@email.com @ another-valid?

The script... 剧本...

Should match: 应该匹配:

  • @valid @有效
  • @valid-username @有效户名
  • @another-valid-username @另一化有效的用户名
  • @in @在
  • @ignore-last-dash @忽略 - 最后冲刺
  • @another-valid @另一个-有效

Should ignore: 应该忽略:

  • @-invalid @-无效
  • an@email.com an@email.com

I'm getting reasonably close with JavaScript by using: 我通过使用以下方式与JavaScript合理地接近:

/\B@((?!.*(-){2,}.*)[a-z0-9][a-z0-9-]{0,38}[a-z0-9])/ig

But this isn't matching usernames with a single character (such as @a). 但这不是用户名与单个字符匹配(例如@a)。

Here are my tests to far: https://regex101.com/r/rZ5eW1/2 以下是我的测试: https//regex101.com/r/rZ5eW1/2

Is the current regex efficient? 目前的正则表达式是否有效? And how can I match a single non-hyphen character? 我如何匹配单个非连字符?

/\B@([a-z0-9](?:-?[a-z0-9]){0,38})/gi

Note: When this regex runs into a character or set of characters that can't be in a username (ie . , -- ), it matches from @ up until that stopping point. 注意:当此正则表达式遇到不能在用户名中的字符或字符集(即.-- )时,它从@ up到该停止点匹配。 OP says that's fine so I'm rolling with it. OP说这很好,所以我跟它一起滚动。 So, if bold is the matched area (NOT the captured area): 因此,如果粗体是匹配区域(不是捕获区域):

@abc.123
@abc--123
@abc-

This works by using lots of nested groups. 这通过使用许多嵌套组来工作。 Regex101 has a fantastic breakdown , but here's mine anyway: Regex101有一个奇妙的故障 ,但无论如何这里是我的:

  1. \\B : This is a builtin means 'not a word boundary', which seems to do the trick, though it may be problematic if something like someones.@email.com is a valid email address. \\B :这是一个内置的意思'不是一个单词边界',这似乎可以解决问题,但如果像someones.@email.com这样的东西可能会有问题someones.@email.com是一个有效的电子邮件地址。 At that point, though, it's indistinguishable from the text of someone who doesn't put spaces after punctuation [ 1 ] when they start a sentence with an @reference. 但是,在这一点上,当它们用@reference开始一个句子时,它与标点符号[ 1 ]之后没有放置空格的人的文字没有什么区别。

    Thanks to Honore Doktorr for pointing out that negative lookbehinds don't exist in JS . 感谢Honore Doktorr 指出JS中不存在负面的外观。

  2. @ : Just the literal @ symbol. @ :只是文字@符号。 One of the few places where a character means what it is. 角色意味着什么的少数几个地方之一。

  3. (...) : The capturing group. (...) :捕获组。 The way it's placed means that it won't capture the @ symbol, it'll just match it, so it's easier to get the username -- no need to get a substring. 它的放置方式意味着它不会捕获@符号,它只是匹配它,所以更容易获得用户名 - 无需获取子字符串。
  4. [a-z0-9] : A character class to match any letter or number. [a-z0-9] :匹配任何字母或数字的字符类。 Because of the i flag, this also matches capital letters. 由于i标志,这也匹配大写字母。 Because it's the first letter, it must be present. 因为它是第一个字母,所以它必须存在。
  5. (?:...) : This is a noncapturing group. (?:...) :这是一个非捕获组。 It wraps a block of regex in a group without capturing it as a result. 它将一组正则表达式包装在一个组中,而不会捕获它。
  6. -?[a-z0-9] : The second bit is a character class, like before. -?[a-z0-9] :第二位是一个字符类,和以前一样。 The first says that it can match with or without the hyphen there. 第一个说它可以匹配或不匹配连字符。 This section is what makes consecutive - invalid -- if there is a - , it has to be followed by something that matches [a-z0-9] . 这部分是连续的-无效 - 如果有- ,它必须跟随[a-z0-9]匹配的东西。
  7. {0,38} : Match the noncapturing group between 0 and 38 times, inclusive. {0,38} 0,38 {0,38} :将非捕获组与0到38次(包括0和38次)匹配。 Combined with #4, this gives us 39 letters maximum. 结合#4,这给了我们最多39个字母。 Anything beyond that will be ignored. 除此之外的任何事情都将被忽略。

This expression will also match your one-word usernames. 此表达式还将匹配您的单字用户名。

/\B@(?!.*(-){2,}.*)[a-z0-9](?:[a-z0-9-]{0,37}[a-z0-9])?\b/ig

Sample . 样品 Explanation: 说明:

  1. (?!.*(-){2,}.*) : your negative lookahead asserts that the rest of the pattern can't contain two or more adjacent dashes. (?!.*(-){2,}.*) :你的负向前瞻断言模式的其余部分不能包含两个或多个相邻的破折号。
  2. [a-z0-9] : there must be one alphanumeric character immediately after @ . [a-z0-9]@之后必须有一个字母数字字符。
  3. (?:[a-z0-9-]{0,37}[a-z0-9])? : there may be anywhere from 0–37 alphanumeric characters or dashes, followed by one alphanumeric character, immediately after #2's pattern — or there may be none, to cover single-character usernames. 可能有0-37个字母数字字符或短划线,后跟一个字母数字字符,紧跟在#2的模式之后 - 或者可能没有,以覆盖单字符用户名。 (?:…) is for non-capturing grouping. (?:…)用于非捕获分组。
  4. \\b : the whole pattern must end at a word break (which includes - ). \\b :整个模式必须以分词结尾(包括- )结束。

I am using this simple RegEx I created to grab github usernames from a google forms and it works pretty decently (with one very rare caveat): 我正在使用我创建的这个简单的RegEx来从谷歌表单中获取github用户名,它的工作非常不错(有一个非常罕见的警告):

^@\w(-\w|\w\w|\w){0,19}$

Where: 哪里:

  • ^ : starting of the line ^ :开始行
  • @ and - : the symbols at and dash themselves. @- :符号at和dash本身。
  • \\w : [A-Za-z0-9_], numbers, letters (both cases) and underlines \\w :[A-Za-z0-9_],数字,字母(两种情况)和下划线
  • $ : end of the line $ :行尾
  • {0,19} : repeat the parenthesis before it from zero to nineteen times {0,19} :在它之前重复括号,从零到十九次

To summarize: 总结一下:

  • The matched RegEx must be an entire line (from ^ to $ ) 匹配的RegEx必须是整行(从^$
  • It will start with an @ followed by a letter (both cases), number or underline ( @A , @1 or @_ ) 它将与一个开始@后跟一个字母(这两种情况下),数或下划线( @A@1@_
  • Then it will follow one of the three options in the repetition pattern (...){0,19} : 然后它将遵循重复模式中的三个选项之一(...){0,19}

    • a dash and a \\w (1st opt) 短划线和\\w (第一选择)
    • two \\w (2nd opt) 两个\\w (第二选择)
    • a single \\w (3rd opt) 单个\\w (第3选择)

    This will repeat and give the following patterns: 这将重复并给出以下模式:

  • Zero times: a single letter username 零次:单个字母用户名

  • One time: it can be a two letter username, or three letters, or three characters with a dash in the middle @ww 一次:它可以是两个字母的用户名,或三个字母,或三个字符,中间有一个破折号@ww
  • More times: it guarantees that the dash is never in the begin or end, also not duplicated, being anywhere else. 更多次:它保证破折号永远不会在开始或结束,也不会重复,在其他地方。
  • 19 times: if using only 1st and 2nd options, it gives a maximum of 19*2=38 characters, plus the one in the begin equals to 39 characters total. 19次:如果仅使用第一个和第二个选项,则最多可提供19*2=38字符,加上开头的一个等于总共39字符。 If using anytime the third option, the total size would be smaller. 如果随时使用第三个选项,则总大小会更小。

Caveat: 警告:

  • It does not recognize patterns with @ww-w...w (a dash in the third letter and with 39 characters). 它无法识别带有@ww-w...w (第三个字母中的短划线和39个字符)。
  • Although it do recognize the pattern @ww-w...w if the size is less than 39 characters. 虽然它确实识别出模式@ww-w...w如果大小小于39个字符。

The problem is that to achieve ww-w the pattern is broke down as the first w standing alone, followed by a single w as the third option in the repeated expression (which leaves only 18 to go), followed by another repetition as w- (the first option, leaving only 17 to go), and then, with this 17 left, we can only get 17*2=34 characters. 的问题是,实现ww-w的图案被打破了作为第一w独自站立,随后通过单个w作为重复表达式第三个选项(只留下18去),然后是另一重复为w- (第一个选项,只剩17个),然后,剩下17个,我们只能得到17*2=34字符。 That means, the maximum would be 38 ( 34+2+1+1 ) characters, not 39. 这意味着,最大值为38( 34+2+1+1 )个字符,而不是39个。

But that is really ok for my purposes, so if you need simplicity, here it is a RegEx that can give you pretty good answers. 但这对我的目的来说确实没问题,所以如果你需要简单,这里的RegEx可以给你很好的答案。 I hope it helps you understand it when translating to javascript . 我希望在翻译为javascript时帮助您理解它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM