Reg. Expression is a string pattern written in a compact syntax, that allows us to quickly check whether a given string matches or contains a given pattern.
It is widely used in natural language processing, web applications that require validating string input (like email address) and pretty much most data science projects that involve text mining.
A regex pattern is a special language used to represent generic text, numbers or symbols so it can be used to extract texts that conform to that pattern.
Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression.
The re module must be imported to use the regex functionalities in python.
import re
Regular expression has various different functions used in python which are as follows.
SN | Function | Description |
---|---|---|
1 | match | This method matches the regex pattern in the string with the optional flag. It returns true if a match is found in the string otherwise it returns false. |
2 | search | This method returns the match object if there is a match found in the string. |
3 | findall | It returns a list that contains all the matches of a pattern in the string. |
4 | split | Returns a list in which the string has been split in each match. |
5 | sub | Replace one or many matches in the string. |
Regular expression can only be used by using the mix of meta-characters, special sequences, and sets.
As the name suggests, these characters have a special meaning, similar to * in wild card.
There are different meta characters which is given below:
Metacharacter | Description | Example |
---|---|---|
[] | It represents the set of characters. | "[a-z]" |
\ | It represents the special sequence. | "\r" |
. | It signals that any character is present at some specific place. | "pyt.hon." |
^ | It represents the pattern present at the beginning of the string. | "^aimtocode" |
$ | It represents the pattern present at the end of the string. | "tutorial" |
* | It represents zero or more occurrences of a pattern in the string. | "pyt*" |
+ | It represents one or more occurrences of a pattern in the string. | "python+" |
{} | The specified number of occurrences of a pattern the string. | "aim{2}" |
| | It represents either this or that character is present. | "to|code" |
() | Capture and group |
Here are the most commonly used operators that helps to generate an expression to represent required characters in a string or file.
It is commonly used in web scrapping and text mining to extract required information.
Operators | Description |
---|---|
. | Matches with any single character except newline ‘\n’. |
? | match 0 or 1 occurrence of the pattern to its left |
+ | 1 or more occurrences of the pattern to its left |
* | 0 or more occurrences of the pattern to its left |
\w | Matches with a alphanumeric character whereas \W (upper case W) matches non alphanumeric character. |
\d | Matches with digits [0-9] and /D (upper case D) matches with non-digits. |
\s | Matches with a single white space character (space, newline, return, tab, form) and \S (upper case S) matches any non-white space character. |
\b | boundary between word and non-word and /B is opposite of /b |
[..] | Matches any single character in a square bracket and [^..] matches any single character not in square bracket |
\ | It is used for special meaning characters like \. to match a period or \+ for plus sign. |
^ and $ | ^ and $ match the start or end of the string respectively |
{n,m} | Matches at least n and at most m occurrences of preceding expression if we write it as {,m} then it will return at least any minimum occurrence to max m preceding expression. |
a| b | Matches either a or b |
( ) | Groups regular expressions and returns matched text |
\t, \n, \r | Matches tab, newline, return |
The match function is used to match the RE pattern to string with optional flags.
In this method, the expression "w+" and "\W" will match the words starting with letter 'g' and thereafter, anything which is not started with 'g' is not identified.
Aimtocode NOT FOUND Python FOUND Java NOT FOUND C NOT FOUND C++ NOT FOUND Php FOUND
<pre class="pre-style"> <class 're.Match'> <re.Match object; span=(17, 26), match='Aimtocode'>
Looking for "Aimtocode" in "Learning Python at Aimtocode" -> found a match! Looking for "Python" in "Learning Python at Aimtocode" -> found a match! no match [email protected] [email protected] [email protected]