Regex Tutorial Thursday, Jan 11 2007
Inside The Engine 3:10 pm
INTRODUCTION
The regex (regular expressions) are very useful for programmers. Using this device you can describe every string that presents to its inside a certain regularity.
We don’t want to talk about formal languages or formal grammars, we are going to bring you some examples that show how it works.
Think about having a web page with a form with the following fields:
- - Name
- - Surname
- - Phone number
Once you have filled in the format and sent the data to the script, it’s very important to check if they are correct.
You need to define the specific area:
- Name: It’s made by one word only and by alphabetical letter. According to us this is not a compulsory camp.
- Surname: It’s made by one or more words that can be made only by alphabetical letter
- Email: It’s made by 3 part: the first one is made by alphanumeric issues, underscore (_) and period (.), there’s the second one made by alphanumeric issues and dash, followed by a period , which is always followed by 2 to 4 alphabetical letters. This one is compulsory.
- Phone number: made by 2 part. It’s divided by a dash.
All the following fields owe a specific regularity and there are specific expressions that defines them. These are the expression:
- name: [a-zA-Z]*
- surname: [a-zA-Z’ ]+
- email: [a-zA-Z0-9_\.]+@[a-zA-Z0-9-]+\.[a-zA-Z]{0,4}
- phone number: [0-9]+\-[0-9]+
- - -
CLASSES
The operator sign [] it’s made by two square brackets. In this metacharacter can be insert several constant character. Trough this metacharacter it’s possibile to characterize a single occurancy of one of the present characters to its inside, if it’s insert like normal characters or if it’s insert using constants: the characters set defined trough this operator takes the class name. For example the class [a] represents the single occurancy of the a character and allows to verify if it is inside a string and in that case executing some operation on it. Otherwise the class [abcd] represents the single occurency in one of the four characters presents inside it and permit to verify if they are present in the strings and execute operations on them.
- - -
RANGE OPERATOR
- it’s an operator that permits to identify a range, for example:
- a-z for all the lower case letters
- A-Z for all the upper case letters
- 0-9 for all the numbers.
A part from this classic range, it’s always possibile to personalize them, for example the a-f contains all the lower case letters from a to f, it’s very useful to characterize hexadecimal numbers.
The class [a-fA-F0-9] individualizes all the figures and the letters from a to f ( lower case and upper case) all the characters that are inside an hexadecimal figure.
- - -
CLASS REPETITION
Now we are going to describe the class repetition operators.
The first one we are going to analyse it’s the star *. It’s the one that can verify how many time a class is repeated inside a string and to select all the consecutive occurency.
For example, the following regular expression [a-z]* selects in a string all the consecutive occurency of alphabetical letters, how it’s shown here:
I have got 7 telephone number, but this is my cell-phone: 0004578907
This operator considers an empty set as positive solution and it’s used to verify the exactness of the NAME field, it could also be empty, but if it’s not it must be made by one word only. The regular expression refereed to it is: [a-zA-Z]*
That expression contains all the alphabetic letter a-z and A-Z.
Very similar to the star it’s the plus + operator that works in the same way, but it verify if a class it’s repeated inside a string one or more times. We use it for the SURNAME fields, that can contain one or more words separated by spaces. This is the regular expression refereed to it: [a-zA-Z]+
Another operator it’s made by 2 {} braces, in their inside it can be a number {3} or a numerical range {12,58}. The first one individualizes all the repetitions of 3 characters that verify the class. The second one individualizes from 12 to 58 repetitions of characters that verify the class.
For example [0-9]{3,4}\-[0-9]{7} individualizes all the telephone number in an area code made by 3 or 4 figures and a suffix of 7 figures.
- - -
BACKSLASH
In the last example we also talked about another operator the backslash \. We put this sign before a character if it is an operator and it makes not considering it as character, if we put it before a letter it is a constant. The dash it’s used to indicate a range and therefore if we want to use as a character we have to write it down this way: \-
Now you can understand the regex that we used to verify the email:
- [a-zA-z0-9_\.]+@[a-zA-Z0-9-]+\.[a-zA-Z]{0,4}
And the one for the telephone number:
- [0-9]+\[0-9]+
- - -
REPETITION OPERATOR’S SPECIFIC CHARACTERISTIC
One of the characteristics of the repetition operators is selecting everything is related to the expressions. This characteristic could be counterproductive sometimes. If we want to eliminate from a html page all the tags, we can use the following regular expression:
- <.+>
This kind of regex selects a consecutive series of characters inside a string. The first one is < followed by some different consecutive characters followed by a >. Therefore the regular expression described before in the following string will be such this:
- <HTML><HEAD> <TITLE> REGULAR EXPRESSIONS </TITLE> </HEAD> <BODY> </BODY> </HTML>
Inside a line we take everything that is between the firs part of the character < and the last part of the character >.
If this operation doesn’t satisfy our demand we need to use one of the following method:
- 1. <.+?>
- 2. <[ ^<>]+>
The first one makes the repetition operator less strong and it makes it stops in the first part of the closing character.
The second individuates inside a strings a series of characters that start with < followed by any characters different from < and > followed by an >.
The regex that we have just described will appear in the former string like this:
- <HTML><HEAD> <TITLE> REGULAR EXPRESSIONS </TITLE> </HEAD> <BODY> </BODY> </HTML>
- - -
CLASS DENING
Let’s focus on a different problem. Let’s suppose having a story and we need to individuate all the sentences present inside it. If inside the story the period is used only at the end of the sentences, we have to deny a class in order to individuate a sentence in a easier way.
- [^\.]+
The ^ sign if it’s put immediately after the first bracket of a class, it denies the class. Therefore in our case it’s individuated the consecutive repetition of all that characters that are not the period. Basically a sentences it is individuated.
- - -
THE PERIOD
The period it’s a constant, and if it is inserted in a regex it’s equivalent to a class that has all the characters but the “new line”.
This is just an example to better understand the function of the period:
- c.s.
The former regex individuates all the 4 characters sequences that starts with c followed by any characters and then followed by a an s. It creates different combinations such as:
- case
- casa
- cosa
- cose
- c%s9
- c£sl
- - -
ALTERNANCY OPERATOR
Another very useful operator is the pipe | which has the same function of th OR. For example, the regex george|stuart individuates inside a string the word george or the word stuart:
- Both george and stuart are two famous seo, but george has a forum, stuart has a web agency.
- - -
ANCHORS
Another problem can be faced if you need to modify one or more elements inside a CSV (comma-separeted value) database, a textual database in which fields are separated by commas and which records are divided by a new line. The following database is an example that represents the daily gain of an adsense made by three friends.
- 12€, 50€, 70€
- 30€, 46€, 68€
- 15€, 52€, 73€
- 16€, 30€, 85€
If one day one of the friends was banned from adsense, his data would not be useful anymore and could be necessary to remove them. In the former example there are very few data therefore it is very easy to do a manual change. If there were thousands data the regex would be the fastest solution. If the data of the banned friend is the ones in the third column, the fastest solution to remove them would be to eliminate all the occurency in the following regex:
- ,[0-9]*€$
The $ character doesn’t identify any characters, but a position, the end of a line. Therefore the former regex finds all the consecutive characters series that start with a comma followed by some numbers, followed by the €, followed by the ending of a line.
It’s always possible to identify the beginning of a line with the ^ character. This one has to be used very carefully because you can use it to deny a class itself. Therefore you always have to remember to use it outside a class. Also the $ operator must be used this way, if it is used inside a class you can refer to it as a character.
- - -
GROUPS
We can consider a characters series as a single group, we can operate on it using some of the operators that build the regex. We could find out inside a text a code we don’t know its lenght, which is composed by 5 numbers followed by a letter, followed by 5 numbers followed by a letter etc etc…until it terminates with a new line. There is only a solution to find this code, we need to use a group. In this example the group it’s made by a class which has numbers only repeated five times, followed by an only letters class. This group has to be repeated at least once and must end with a new line. It could be written down as:
- ([0-9]{5}[a-zA-Z])+$
The regex creates this effect:
- My secret code is 12345T45345R12343F34567j
- Phil’s secret code is 34526g54638j92725K63723H72829D12345I
- 12345T45345R12343F34567j is not phil’s code.
- - -
BACKREFERENCES
We could need to modify the positions of different part of text inside a string. For example, let’s suppose having a database csv made by 5 columns and 10000 rows with an error: the second column is in the fourth column position. Changing the position manually it will take hours and hours, but with regex we can solve that problem in less than 5 seconds.
One of the group property is to memorize in a variable the selected text trough them, in order to use it in a substitution phase. For example, we need to create 5 groups that selects the fields present in a inside a rows of our csv. We have to admit that the database it’s structured as it follows:
- 1,45,589, phil, bob
- 2,56,79,mary,bob
- 3,57,89,phil,frank
- ..,..,..,..,..
We can use the following regex to select to select each of the single fields inside a row:
- ([^,]+),([^,]+),([^,]+),([^,]+),([^,]+),$
With the former regex each fields will be memorized in a variable, the first will have the first one, the second will have the second one and etc etc. We need only to substitute the text selected with a new structure (1,4,3,2,5) to get the result we desire.
But there’s a little problem because the way to retrieve the variables is very different.
Htaccess, dreamweaver, PERL retrieve the variables using the $ character. Example: $1 to retrieve the first one, $2 to retrieve the second one. Furthermore $0 retrieves the match of the whole regex. In the former example we would have replaced with this row:
- $1,$4,$3,$2,$5
EditPad pro, PowerGREP retrieve the variables using the \ character. Example: \1 retrieves the first one, \2 retrieve the second one. Furthermore \0 retrieves the match of the whole regex. In the former example we would have replaced the regex with the following expression:
- \1,\4,\3,\2,\5
.NET, Javascript, PHP, etc.. each of them retrieves the variables in different ways and we advice you to read their guides.
WARNING: if you use the repetition to repeat whole groups, the variables will be refereed each to a single selected group and not to the whole group repeated another time. If you use the regex ([0-9]{5}[a-zA-Z])+$ to select the code on this text
My secret code is 12345T45345R12343F34567j
the 1 variable will correspond with the selected part of the text and not to the whole code. This kind of things happens because the repetition is outside the backreference, therefore to solve this problem the solution is to change the group we have to repeat without backreference ( in order not to save it) and set up the backreference on the whole repetition:
- ((?:[0-9]{5}[a-zA-Z]?)+)$
In the former regex you can notice this particular structure (?: ?) in which are inserted two classes. This kind of structure is a group without backreferences; we can apply a repetition and memorize it.
Now the variable 1 the following code (bold):
My secret code is 12345T45345R12343F34567j
- - -
QUESTION MARK
In the groups the question mark can be used to avoid the match memorization. We have already seen that question mark could be used to restrict the repetitions. Now we will see that exit a lot of different functions for this simple character.
The first function makes a group optional, as you can see in the following example:
- michael (owen)?
In the former regex the group (owen) is made optional and therefore it will be possible to select both the simple occurency of the word michael and the occurency of the word couple michael owen.
The second function is being an anchor. As we have already seen there are a lot of operators such as ^ and $ that could be keepers, they individuate inside a string a position. The question mark can be also used in the groups as a keeper, to individuate it as a position inside the text. Example:
- michael(?=owen)
The former regex selects the word michael in a text only if it is followed by the group (owen) that will not be selected. Examples:
- michael owen
- michael
- today owen scored a goal
- yesterday michael owen didn’t scored
You can also use the question mark to individuate the absence of a position. For example the following function selects the word michael only if it’s not followed by the group ( owen):
- michael(?!owen)
Example:
- michael owen
- michael
- today owen scored a goal
- yesterday michael owen didn’t scored
The two properties that we have just described works only when the anchor follows the text (or group or class) that has to be selected. If the anchor it’s placed before the selected text, we have to use two structure, the first one to verify the presence of the anchor, the second to verify the absence:
- (?<=owen) michael
- (?<!owen) michael
Basically the character < is inserted after the question mark.





January 27th, 2007 at 7:00 am
what should be entered in [find what:] if the Email addressess mentioned in microsoft word documents are to be replaced.
January 27th, 2007 at 11:14 am
Can u make an example? I did’t use word
February 9th, 2007 at 12:33 pm
Hello,
I wish translate your excellent post in french and post it on my blog. Is it possible if I keep the structure of your post and I cite the source of it ?
Bests regards,
February 9th, 2007 at 12:47 pm
NICE!!!
Thanx for the info
February 9th, 2007 at 7:24 pm
Here is a nice tool for testing regex in php
http://www.not-a-blog.com/phpMyDesktop/regextester.php
PS, not written by me, just found it
February 9th, 2007 at 10:44 pm
FYI, your example “michael(?=owen)” has a missing “owen” in the last line. It was supposed to match “michael(?!owen)” but it doesn’t.
February 9th, 2007 at 10:47 pm
@typo:
very useful
tnx
February 10th, 2007 at 11:20 pm
[…] » Regex Tutorial (tags: programming reference) […]
February 11th, 2007 at 4:06 am
In your first example:
“I have got 7 telephone number, but this is my cell-phone: 0004578907″
The ‘I’ should not be selected, because you are only using [a-z], so only lowercase letters should have been selected. For the sake of readability, it might be better if you use a background color to show the ’selected’ letters/text instead of making them bold. It took me a little while to figure out what you meant.
February 11th, 2007 at 10:23 am
@ Eric:
tnx
February 12th, 2007 at 7:54 am
I have the same problem as Raj Babbar i.e. I’m using Microsoft Word XP and Word 2003, both of which support RegEx in the Find and Replace dialog. However, your E-Mail RegEx doesn’t seem to work in Word. Can you tell me how to modify the expression so it can work in MS Word?
February 12th, 2007 at 10:55 am
@Andrew:
In five minutes i have done this..it’s not the better way for match mail but work:
<[a-zA-Z0-9_\.]*\@[a-zA-Z0-9-]*\.[a-zA-Z]*>
I suggest you to read this.
March 12th, 2007 at 4:31 pm
[…] Originalmente inviato da SunDowner Cosa ne pensate dei vari digg, technorati, delicious ecc? Secondo voi porterebbe dei vantaggi segnalare, per esempio, tutti i migliori thread del forum? So che non danno più pagerank, ma per il resto potrebbero aiutare in termini di trust? Ciao. In questo caso non va pensato al social bookmarking in termini seo, ma semplicemente come un ottimo metodo per scatenare viral e soprattutto buzz. Tramite il passaparola generato dai social bookmarking è infatti possibile ricevere parecchi backlink spotanei e molte visite più o meno targettizzate. Va comunque pianificata una precisa azione di marketing, in modo da scalare anche i social più compwetitivi come digg e delicious. Io per esempio adotto questa tecnica: Utilizzo reddit.com per promuovere i miei articoli inglese relativi a programmazione e seo, in modo da raccogliere l’interesse di un pubblico strettamente targettizzato (gli utenti di reddit.com sono prevalentemente matematici e programmatori). Una volta arrivato in homepage su reddit.com cerco di spingere anche del.icio.us in modo da raggiungere pure la sua homepage per qualche oretta. Il successo di questa azione dipende da vari fattori, così come i vantaggi possono essere minimi o enormi in relazione al tipo di articolo promosso. Posso farti 3 esempi, due riguardanti me e uno riguardante petro. La prima volta che sono stato sull’homepage di del.icio.us non ero iscritto a nessun social bookmarking e non avevo l’intenzione di sfruttarli in nessun modo. Per "colpa" di Mauro Lupi che mi ha citato su un articolo (uscito su PuntoInformatico), sono arrivato a scalare del.icio.us nonostante il mio blog no fosse predisposto per la promozione nei social bookmarking. Ho avuto un buon vantaggio, ma dato che si trattava di un articolo in italiano non ho potuto sfruttare appieno la presenza su del.icio.us che ha un traffico prevalentemente di lingua inglese. La seconda volta che sono stato sull’home di del.icio.us ci sono arrivato tramite la promozione che avevo fatto su reddit.com della mia guida alle regex (in inglese). Da questa promozione ho ricevuto grandi vantaggi dato che da reddit è giunta molta utenza di alto livello e da del.icio.us circa 4000 utenti di medio-basso livello. In entrambi i casi subito dopo la mia presenza su del.icio.us ho ricevuto molti benefici anche tramite stumbleupon. Il terzo esempio invece riguarda petro che sta partecipando al seo contest mondiale e ha impostato la propria promozione online sui social bookmarking. Ha realizzato un articolo sul global worming che è stato per una giornata sull’home di digg e di del.icio.us e che gli ha procurato oltre ad un gran numero di utenti, anche parecchi links da blog e anche da siti tematizzati con pagerank alto (un bel link da un PR8 su un sito istituzionale). Detto ciò è chiaro come i social bookmarking non siano paragonabili alle vecchie tecniche seo di incremento della LP (come le directory o l’acquisto di link, etc..), ma piuttosto sono affini alle tecniche di buzz e viral marketing. Questa è la mia firma. […]
May 7th, 2007 at 6:20 am
[…] The absolute bare minimum every programmer should know about regular expressions Regex Tutorial Regular Expressions Cheat Sheet regular-expressions.info Learn Regular Expression (Regex) syntax with C# and .NET […]
May 10th, 2007 at 3:08 pm
i have name validation that requires:
- if it contains only alphanumeric characters is valid.
- if it contains only special characters(!@#$%^&*()_+.{}[]\|:;’”,.?/) is not valid
- if it contains mix alphanumeric and special characters is valid.
- if it contains a double quote is invalid.
What’s the reqex for this?
May 10th, 2007 at 7:39 pm
[…] [Regex Tutorial] regular expressions examples, regexp reference [Regex Tutorial] (tags: Perl RegEx Tutorial) […]
May 14th, 2007 at 10:43 pm
@Zeron
use this regex:
^[\!\@\#\$\%\^\&\*\(\)\_\+\.\{\}\[\]\\\|\:\;\’\,\.\?\/a-zA-Z0-9]*[a-zA-Z0-9]+[\!\@\#\$\%\^\&\*\(\)\_\+\.\{\}\[\]\\\|\:\;\’\,\.\?\/a-zA-Z0-9]*$
July 20th, 2007 at 5:29 pm
This should be simple but for some reason I can’t get it to work. I to make the string “valid” only if it contains at least 1 number AND 1 letter. What would the reqular expression be for this?
July 20th, 2007 at 7:02 pm
@josillor:
([a-zA-Z][0-9]|[0-9][a-zA-Z])
August 10th, 2007 at 12:32 pm
[…] Regex Tutorial ~ ótimo tutorial, em inglês. […]
February 5th, 2008 at 10:50 pm
I couldn’t read this because of the numerous spelling mistakes and grammatical errors. How hard is it to run something trough (oh sorry, THROUGH) a spell check?