`

POSIX.2正则表达式说明(确认一个正则表达式是否正确的唯一方法就是去测试它)

阅读更多

POSIX.2正则表达式说明

关于在Linux/Bash中正则表达式(POSIX.2 regular expressions)的语法形式,可以使用 man 7 regex 去查看。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
An expression is a string of characters. Those characters having an interpretation above and beyond their literal meaning are called metacharacters.

正则表达式是特殊的字符串。在正则表达式中,有些字符具有特殊含义,而不是它本身的字面含义,这种字符称之为元字符(metacharacter)。

 

一个系统中的命令对正则表达式的支持程度,往往取决于其具体实现,所以有这么一种说法:“确认一个正则表达式是否正确的唯一方法就是去测试它 ”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
The only way to be certain that a particular RE works is to test it.

 

下面说明一下 man 7 regex 所描述的 POSIX.2 正则表达式。

 

POSIX.2 正则表达式(Regular expressions, 简称RE),有两种形式:

一种是 modern RE, 或者称之为 extended RE,比如 egrep命令所支持的;

一种是 obsolete RE, 或者称之为 basic RE,比如 ed命令所支持的。

man 7 regex 写道
Regular expressions (‘‘RE’’s), as defined in POSIX.2, come in two forms: modern REs (roughly those of egrep;
1003.2 calls these ‘‘extended’’ REs) and obsolete REs (roughly those of ed(1); 1003.2 ‘‘basic’’ REs). Obsolete
REs mostly exist for backward compatibility in some old programs; they will be discussed at the end. 1003.2
leaves some aspects of RE syntax and semantics open; ‘(!)’ marks decisions on these aspects that may not be
fully portable to other 1003.2 implementations.
  

下面讲的是 modern RE,一个 modern RE 由一个或多个 branch 用 竖线(|) 分隔。一个字符串只需要匹配其中一个 branch 就认为是匹配该 RE。

比如:abc|def 既可匹配 abc 也可匹配 def。

man 7 regex 写道
A (modern) RE is one(!) or more non-empty(!) branches, separated by ‘|’. It matches anything that matches one
of the branches.

 

一个 branch 由一个或多个 piece 串接而成:一个 piece 是由 atom 或者 atom 加上 modifier 组成。在匹配的时候是依次匹配。

modifier的作用是指定 atom 的出现次数,比如:

*  前面的 atom 出现 0次或多次;

+ 前面的 atom 出现 1次或多次;

? 前面的 atom 出现 0次或1次;

man 7 regex 写道
A branch is one(!) or more pieces, concatenated. It matches a match for the first, followed by a match for the
second, etc.
A piece is an atom possibly followed by a single(!) ‘*’, ‘+’, ‘?’, or bound. An atom followed by ‘*’ matches a
sequence of 0 or more matches of the atom. An atom followed by ‘+’ matches a sequence of 1 or more matches of
the atom. An atom followed by ‘?’ matches a sequence of 0 or 1 matches of the atom.
 

modifier也可以是 bound ({}),即指定范围,前面的 atom 出现的次数在{}内指定,如下:

{n} 前面的 atom 刚好出现 n次,n必须在 0 到 RE_DUP_MAX 之间,其中 RE_DUP_MAX 最大为255;

{n,} 前面的 atom 出现 n次及以上;

{n,m} 前面的 atom 出现 n次到m次,必须 n <= m。

man 7 regex 写道
A bound is ‘{’ followed by an unsigned decimal integer, possibly followed by ‘,’ possibly followed by another
unsigned decimal integer, always followed by ‘}’. The integers must lie between 0 and RE_DUP_MAX (255(!))
inclusive, and if there are two of them, the first may not exceed the second. An atom followed by a bound con-
taining one integer i and no comma matches a sequence of exactly i matches of the atom. An atom followed by a
bound containing one integer i and a comma matches a sequence of i or more matches of the atom. An atom fol-
lowed by a bound containing two integers i and j matches a sequence of i through j (inclusive) matches of the
atom.
 

下面讲到 atom 是指哪些东西,一个 atom 可以如下之一:

(RE)               匹配一个正则表达式,子表达式

()                   匹配一个空串

[CHAR-SET]    匹配指定字符集合中的任意字符

.                    匹配任意单个字符

^                   匹配行首

$                   匹配行尾

\跟上^.[$()|*+?{\之一    转义,使这些元字符的特殊含义丧失,匹配这些字符本身

\跟上其他字符      匹配就是这些字符本身

其他单个字符     匹配这些字符本身

{跟上非数字字符     此时{是个普通字符

结尾为\           非法

man 7 regex 写道
An atom is a regular expression enclosed in ‘()’ (matching a match for the regular expression), an empty set of
‘()’ (matching the null string)(!), a bracket expression (see below), ‘.’ (matching any single character), ‘^’
(matching the null string at the beginning of a line), ‘$’ (matching the null string at the end of a line), a
‘\’ followed by one of the characters ‘^.[$()|*+?{\’ (matching that character taken as an ordinary character),
a ‘\’ followed by any other character(!) (matching that character taken as an ordinary character, as if the
‘\’ had not been present(!)), or a single character with no other significance (matching that character). A
‘{’ followed by a character other than a digit is an ordinary character, not the beginning of a bound(!). It
is illegal to end an RE with ‘\’.
 

 

方括号 [ ] 中可以指定一个字符的集合,并且不能是空集合。

如果这个集合以 ^ 开头,那么表示不匹配该集合中的字符。

类似 a-z 的形式可以指定字符的范围,但是 a-c-e 这种形式是非法的。

比如 [0-9] 表示匹配数字字符,[^0-9] 表示不匹配数字字符。

man 7 regex 写道
A bracket expression is a list of characters enclosed in ‘[]’. It normally matches any single character from
the list (but see below). If the list begins with ‘^’, it matches any single character (but see below) not
from the rest of the list. If two characters in the list are separated by ‘-’, this is shorthand for the full
range of characters between those two (inclusive) in the collating sequence, e.g. ‘[0-9]’ in ASCII matches any
decimal digit. It is illegal(!) for two ranges to share an endpoint, e.g. ‘a-c-e’. Ranges are very collating-
sequence-dependent, and portable programs should avoid relying on them.
 

在 [ ] 中,

如果字符集合需要包含 ] 呢,可以写成 []],即 ]为集合中的第一个字符,而 [^]] 表示不匹配 ];

如果字符集合需要包含 - 呢,必须把 - 放在第一个字符的位置或者最后一个字符的位置,[-] 表示匹配 -,[^-] 表示不匹配 - ;

man 7 regex 写道
To include a literal ‘]’ in the list, make it the first character (following a possible ‘^’). To include a
literal ‘-’, make it the first or last character, or the second endpoint of a range. To use a literal ‘-’ as
the first endpoint of a range, enclose it in ‘[.’ and ‘.]’ to make it a collating element (see below). With
the exception of these and some combinations using ‘[’ (see next paragraphs), all other special characters,
including ‘\’, lose their special significance within a bracket expression.
 

关于多字符序列,形式为 [.chars.],比如 [[.ch.,]] 可以匹配 ch。

man 7 regex 写道
Within a bracket expression, a collating element (a character, a multi-character sequence that collates as if
it were a single character, or a collating-sequence name for either) enclosed in ‘[.’ and ‘.]’ stands for the
sequence of characters of that collating element. The sequence is a single element of the bracket expression’s
list. A bracket expression containing a multi-character collating element can thus match more than one charac-
ter, e.g. if the collating sequence includes a ‘ch’ collating element, then the RE ‘[[.ch.]]*c’ matches the
first five characters of ‘chchcc’.
 

关于等价类,形式为 [=c=],但这个等价类目前我还没有明白怎么用法。

man 7 regex 写道
Within a bracket expression, a collating element enclosed in ‘[=’ and ‘=]’ is an equivalence class, standing
for the sequences of characters of all collating elements equivalent to that one, including itself. (If there
are no other equivalent collating elements, the treatment is as if the enclosing delimiters were ‘[.’ and
‘.]’.) For example, if o and ^ are the members of an equivalence class, then ‘[[=o=]]’, ‘[[=^=]]’, and ‘[o^]’
are all synonymous. An equivalence class may not(!) be an endpoint of a range.
 

 

 

在 [ ] 中,可以指定字符类,形式为 [:class:],比如 [[:digit:]] 匹配数字,[[:alpha:]] 匹配字母,常用的标准字符类如下:

alnum    字母和数字

alpha     字母

blank     空白,包括空格、制表符等

digit      数字

lower     小写字母

space   空白,包括空格、制表符、竖向制表符、换行、回车,注意与 blank 类的区别

upper    大写字母

xdigit    十六进制数字字符

这些字符类的判断方式与C语言中的字符类判断是一样的,比如在C语言中用 isalpha(c) 来判断是否字母,以此类推。

man 7 regex 写道
       Within a bracket expression, the name of a character class enclosed in ‘[:’ and ‘:]’ stands for the list of all
       characters belonging to that class.  Standard character class names are:

              alnum       digit       punct
              alpha       graph       space
              blank       lower       upper
              cntrl       print       xdigit

       These  stand  for  the character classes defined in wctype(3).  A locale may provide others.  A character class
       may not be used as an endpoint of a range.
 

C语言中关于字符类的判断函数说明。

man 3 isalpha 写道
       isalnum()
              checks for an alphanumeric character; it is equivalent to (isalpha(c) || isdigit(c)).
       isalpha()
              checks  for  an  alphabetic  character;  in  the standard "C" locale, it is equivalent to (isupper(c) ||
              islower(c)).  In some locales, there may be additional characters for which  isalpha()  is  true—letters
              which are neither upper case nor lower case.
       isascii()
              checks whether c is a 7-bit unsigned char value that fits into the ASCII character set.
       isblank()
              checks for a blank character; that is, a space or a tab.
       iscntrl()
              checks for a control character.
       isdigit()
              checks for a digit (0 through 9).
       isgraph()
              checks for any printable character except space.
       islower()
              checks for a lower-case character.
       isprint()
              checks for any printable character including space.
       ispunct()
              checks for any printable character which is not a space or an alphanumeric character.
       isspace()
              checks  for white-space characters.  In the "C" and "POSIX" locales, these are: space, form-feed ( ?[1m\f ?,
              newline ( ?[1m\n ?, carriage return ( ?[1m\r ?, horizontal tab ( ?[1m\t ?, and vertical tab ( ?[1m\v ?.
       isupper()
              checks for an uppercase letter.
       isxdigit()
              checks for a hexadecimal digits, i.e. one of
              0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F.
 

要注意的是,POSIX.2正则表达式不支持类似Java中的字符类的写法,比如在Java中 \d表示匹配数字,\w表示匹配字母数字下划线。

 

正则表达式的匹配,从字符串中最早匹配的位置开始,到最长匹配结束,是匹配的长度越长越好,即贪婪匹配。

man 7 regex 写道
In the event that an RE could match more than one substring of a given string, the RE matches the one starting
earliest in the string. If the RE could match more than one substring starting at that point, it matches the
longest. Subexpressions also match the longest possible substrings, subject to the constraint that the whole
match be as long as possible, with subexpressions starting earlier in the RE taking priority over ones starting
later. Note that higher-level subexpressions thus take priority over their lower-level component subexpres-
sions.
 

匹配长度以字符数计算。即使只匹配空串,也被认为比完全不匹配要长。比如:

bb* 匹配 abbbc 的中间三个字符;

(wee|week)(khights|nights)  匹配 weeknights 整个串;

(.*).*  匹配 abc,其中(.*) 匹配 abc,剩下的 .* 匹配空串;

(a*)*  匹配 bc,其中 (a*)* 和 (a*) 都只匹配空串。

man 7 regex 写道
Match lengths are measured in characters, not collating elements. A null string is considered longer than no
match at all. For example, ‘bb*’ matches the three middle characters of ‘abbbc’, ‘(wee|week)(knights|nights)’
matches all ten characters of ‘weeknights’, when ‘(.*).*’ is matched against ‘abc’ the parenthesized subexpres-
sion matches all three characters, and when ‘(a*)*’ is matched against ‘bc’ both the whole RE and the parenthe-
sized subexpression match the null string.
 

关于不区分大小匹配的说明。x 匹配 x和X,相当于 [xX],而 [^x] 相当于 [^xX]。

man 7 regex 写道
If case-independent matching is specified, the effect is much as if all case distinctions had vanished from the
alphabet. When an alphabetic that exists in multiple cases appears as an ordinary character outside a bracket
expression, it is effectively transformed into a bracket expression containing both cases, e.g. ‘x’ becomes
‘[xX]’. When it appears inside a bracket expression, all case counterparts of it are added to the bracket
expression, so that (e.g.) ‘[x]’ becomes ‘[xX]’ and ‘[^x]’ becomes ‘[^xX]’.
 

正则表达式的长度限制,一般不超过256字节,但具体实现也可以不限定长度。

man 7 regex 写道
No particular limit is imposed on the length of REs(!). Programs intended to be portable should not employ REs
longer than 256 bytes, as an implementation can refuse to accept such REs and remain POSIX-compliant.
 

最后来讲 Obsolete RE (或 basic RE) 与 前面的 modern RE (或 extended RE)的区别:

在 basic RE 中, 竖线(|)、加号(+)、问号(?)是普通字符;

范围用 \{ \} 来表示,而 { } 只是普通字符;

子表达式用 \( \) 来表示,而 ( ) 只是普通字符;

当^不是开头、$不是结尾、*在开头时,它们是普通字符;

\非0数字的形式表示对前面匹配的子串的引用,比如 \([bc]\)\1 匹配 bb 或 cc,但不匹配 bc 。

man 7 regex 写道
Obsolete (‘‘basic’’) regular expressions differ in several respects. ‘|’, ‘+’, and ‘?’ are ordinary characters
and there is no equivalent for their functionality. The delimiters for bounds are ‘\{’ and ‘\}’, with ‘{’ and
‘}’ by themselves ordinary characters. The parentheses for nested subexpressions are ‘\(’ and ‘\)’, with ‘(’
and ‘)’ by themselves ordinary characters. ‘^’ is an ordinary character except at the beginning of the RE
or(!) the beginning of a parenthesized subexpression, ‘$’ is an ordinary character except at the end of the RE
or(!) the end of a parenthesized subexpression, and ‘*’ is an ordinary character if it appears at the beginning
of the RE or the beginning of a parenthesized subexpression (after a possible leading ‘^’). Finally, there is
one new type of atom, a back reference: ‘\’ followed by a non-zero decimal digit d matches the same sequence of
characters matched by the dth parenthesized subexpression (numbering subexpressions by the positions of their
opening parentheses, left to right), so that (e.g.) ‘\([bc]\)\1’ matches ‘bb’ or ‘cc’ but not ‘bc’.
 

重复前面说过的:“确认一个正则表达式是否正确的唯一方法就是去测试它 ”。

Advanced Bash-Scripting Guide: 18.1. A Brief Introduction to Regular Expressions 写道
The only way to be certain that a particular RE works is to test it.

 

本文链接:http://codingstandards.iteye.com/blog/1195592

 

 

3
3
分享到:
评论

相关推荐

    精通正则表达式~~~

    精通正则表达式第三版 搜集于网络 前言..........I 第1章:正则表达式入门.... 1 解决实际问题... 2 作为编程语言的正则表达式... 4 以文件名做类比... 4 以语言做类比... 5 正则表达式的知识框架... 6 对于...

    linux下的C语言POSIX正则表达式头文件和源文件: regex.h regex.cpp

    linux下posix标准的正则表达式库,支持regcomp, regexec等,可以放到vs2010或vs2012下进行编译。

    Go-POSIX基本正则表达式伪随机字符串生成器

    POSIX基本正则表达式伪随机字符串生成器

    PHP正则表达式基本语法和使用方法

    在PHP中支持PCRE(Perl Compatible Regular Expression)和POSIX(Portable Operation System interface)两套正则表达式处理函数,两套函数库功能相似,在执行...,因此自PHP 5.3.0版本以后POSIX正则表达式拓展被摒弃...

    解析posix与perl标准的正则表达式区别

    正则表达式(Regular Expression,缩写为regexp,regex或regxp),又称正规表达式、正规表示式或常规表达式或正规化表示法或正规表示法,是指一个用 来描述或者匹配一系列符合某个句法规则的字符串的单个字符串。...

    正则表达式必知必会pdf

    目录 · · · · · ·第1章 正则表达式入门1.1 正则表达式的用途1.2 如何使用正则表达式1.3 什么是正则表达式1.4 使用正则表达式1.5 在继续学习之前1.6 小结第2章 匹配单个字符2.1 匹配纯文本2.2 匹配任意字符2.3 ...

    PHP中的正则表达式函数介绍

    正则表达式(Regular Expression) 正则表达式系统: 1.POSIX 2.Perl PHP中使用的regex是...一个正则表达式最少含有一个原子 3.当需要匹配诸如”(“、”[“、”^”等含有语义的符号时需要用”\”反斜线进行转义 原子字符:

    swift-POSIXRegex-为Swift3.0提供POSIX正则表达式

    POSIXRegex - 为Swift 3.0提供POSIX 正则表达式

    Linux 正则表达式详解

    1.grep : 最早的文本匹配程序,使用POSIX定义的基本正则表达式(BRE)来匹配文本。 2.egrep : 扩展式grep,其使用扩展式正规表达式(ERE)来匹配文本。 3.fgrep : 快速grep,这个版本匹配固定字符串而非正则表达式...

    第4章 数据处理-php正则表达式-郑阿奇(续)

    编写正则表达式 表4.3 POSIX正则表达式语法格式列表 字 符 描 述 \ 转义字符,用于转义特殊字符。例如,’.’匹配单个字符,’\.’匹配一个点号。’\-‘匹配连字符’-‘,’\\’匹配符号’\’ ^ 匹配...

    PHP学习之正则表达式

    PHP支持两种正则表达式,POSIX风格的正则表达式和兼容Perl风格的正则表达式。

    PHP 正则表达式函数库(两套)

    在PHP中有两套正则表达式函数库,两者功能相似,只是执行效率略有差异: 一套是由PCRE(Perl Compatible Regular Expression... 一个正则表达式中至少包含一个原子。 原子(普通字符,如英文字符) 元字符(有特殊功用

    Oracle正则表达式参考手册

    全英oracle正则表达式参考手册 Oracle Regular Expressions Pocket Reference is part tutorial and part quick-reference. It's suitable for those who have never used regular expressions before, as well as ...

    PHP 正则表达式小结

    说明:mode参数—- 正则的模块,也就是正则表达式(语法) subject参数—- 正则的内容 matches参数—- 正则的结果(获得一个数组的形式) b.ereg 正则函数,以POSIX基础(Unix、Script) 语法:ereg(mode ,string ...

    oracle正则表达式

    很详细的介绍oracle正则表达式.Oracle数据库内建了符合IEEE POSIX (Portable Operating System for Unix)标准的正则表达式。熟练使用正则表达式,可以写出简洁,强大的SQL语句。

    一个java正则表达式工具类源代码.zip(内含Regexp.java文件)

    以前写了一个java的正规表达式的java工具类,分享一下,有用到的欢迎下载使用。 如果你有常用的定义好的,且测试通过的正规表达式,欢迎跟贴,也让我享用一下 . 类中用到了 jakarta-oro-2.0.jar 包,请大家自己在 ...

    ocaml-re:纯OCaml正则表达式,支持Perl和POSIX样式的字符串

    Posix扩展正则表达式(模块Re.Posix ); Emacs样式的正则表达式(模块Re.Emacs ); Shell样式的文件Re.Glob (模块Re.Glob )。 也可以通过组合更简单的正则表达式(模块Re )来构建正则表达式。 最显着的缺失...

    PHP 正则表达式常用函数使用小结

    POSIX扩展的正则表达式由POSIX 1003.2定义,一般使用以“ereg_”为前缀命名的函数。 两套函数库的功能相似,执行效率稍有不同。一般而言,实现相同的功能,使用PCRE库的效率略占优势。下面详细介绍其使用方法。 ...

    pcre-8.45,Nginx的http模块使用pcre来解析正则表达式

    PCRE(Perl Compatible Regular Expressions)库是一组函数,使用与Perl 5相同的语法和语义实现正则表达式模式匹配。除了一组POSIX兼容的包装函数外,PCRE还拥有自己的原生API。 Nginx的http模块使用pcre来解析正则...

Global site tag (gtag.js) - Google Analytics