Markov Chains [Spam that Search Engines like - Pt 1] Thursday, Nov 2 2006
Grey Hat Blogs 3:21 pm
With this post I am starting a series of articles dealing about topics related to spam, as seen from a Grey Hat SEO’s point of view.
Let this be quite clear: I am not going to provide you with codes to develop spam engines, but I am going to use what I’ve learnt through these engines in order to provide you with theories and/or information that can be implemented both in white hat and black hat sites.
It’s up to you how to make best use of this knowledge and it’s also up to you to prove me wrong whenever you want to.
Enough of digressions, let’s start…
(please read slowly or you’ll have to learn about Markov Chains on Wikipedia)
Let’s suppose we are in a town of 100,000 inhabitants where everyone meets, during his/her life, only a certain number of people from that town, each of them at a different frequency. And let’s suppose that I am an inhabitant of this strange place and that one day I will meet by chance a good friend of mine called Giorgio.
It’s no neews, on the contrary, I have to say that I do meet this friend of mine very often, almost once a day to be precise. But today something has happened that does not happen very often: we met in front of a bar at 8am and there was no-one there. So we decided to go to Stefano’s place, a friend of ours, who lives not far away from this bar. I do not see him very often, but he does know him very well, he has known him for a couple of years and he says they meet up at least five times a week. Now that I’m thinking about it I have never happened to meet Stefano alone, the few times I’ve seen him Giorgio was almost always with us.
Anyway, we get to his place in a couple of minutes and ring the intercom, we are told to wait a little, Stefano is getting ready; also, he says we have to go and see a certain Stuart Delta, a friend of his who has just called a few minutes before, telling him he’s moving and asking whether he could help him. Stefano says hello, he says he will be ready right away and I ask Giorgio who this Stuart is. He replies that he has just got to know him, adding that he is a very good friend of Stefano’s, and they are almost always together and that the three of them go out every saturday night: he’s the person Giorgio sees the most when he goes out with Stefano.
But here comes Stefano, he says he’s sorry for taking so much time. We are told that Stuart lives just a bit further on and that it would be nice of us to help him moving.
We get to his place and a few minutes later Enrico, a very good friend of Stefano’s and Stuart’s, joins us: he has come to help. To my great surprise I find out that he is an old friend of mine and I tell him I was not expecting to see him there. He says “Hi” and asks me how come I am at Stuart’s and then, without waiting for a reply, turns around and shakes hands with Giorgio, introducing himself.
Markov chains base on one or more starting parameters (Giorgio and Me) to obtain one or more final parameters according to the probability the starting parameters are connected to the final parameter (Stefano): I do not know Stefano very well, Giorgio often sees him, I almost always meet him when Giorgio is with us: Me -/ Giorgio -/ Stefano.
Giorgio does not know Stuart very well, Stefano often sees him, that’s why he and Giorgio are meeting Stuart today, because, as I said, he is the person they see the most when they are together, so: Giorgio -/ Stefano -/ Stuart.
And this is why I have met Stuart today, because: Me -/ Giorgio -/ Stefano -/ Stuart.
And, finally, the fact that I know Enrico is not decisive for him coming to Stuart’s place. He actually did not even take too much notice of me. Giorgio had never met him, anyway.
And yet I might meet Jacopo tomorrow: a friend of mine, of Giorgio’s and Stefano’s. We might meet him just because we are together (Giorgio, Stefano and Me). Or maybe one day we decide to meet him, only Giorgio and Me. Or maybe I will meet him alone.
With a Markov chain, it is not necessary to have a certain number of starting parameters to obtain a final parameter. The final parameter, though, is likely to be generated by the starting paramenters, and the probability depends on these.
Just don’t worry if it’s not all clear, I will give more examples and as we go further into the issue you wil find out why it is interesting both for the Black Hat SEOs and for the White Hat SEOs.
Let’s now suppose we have a start set (it was the town in the above example) where the two following cases are established: an ordered relationship among the elements (in the above example it was: Giorgio and Me; Stefano and Me, Enrico and Me; Stuart and Me; Giorgio and Me; Giorgio and Stefano; Giorgio and Stuart; Giorgio and Enrico; ..), and an ordered relationship between an ordered subset of elements and one element (In the town example it was: [Me and Giorgio] -/ Stefano is different from [Giorgio and Me] -/ Stefano).
Also, let’s suppose that every kind of relationship is coupled with a frequency ranging from 0 to 1.
Example (forget about Me, Stefano, Giorgio and so on.. let’s now use letters only and let’s define a new type of relationship):
- I -/ G = 1/2;
- G -/ I = 1/3;
- I -/ S = 1/7;
- S -/ G = 1/4;
- G -/ S = 1/6;
- [G -/ I] -/ S = 1/20;
- [I -/ G] -/ S = 1/10
As you can see now, unlike in the town example, frequency is based on the elements order, too. But don’t worry, let’s get rid of letters and give an example with words.
Let’s suppose we have a text document dealing with dogs. This document is our set, whereas the words composing it are the elements.
Each word in the document will be followed by another with a certain frequency (number of cases of the ordered couple divided by the number of cases of the first word). For example, the word the is followed in the text by the word dog with a frequency of 1/20, as there are 200 cases of the word the and 10 of the ordered couple the dog. I’ll make clear what I mean with ordered couple for those who haven’t quite understood it: the dog is a couple formed by the and by dog, but it is different from the couple dog the: in fact, the couple the dog can have a different frequency compared to the couple dog the.
Also, in our document every ordered couple of words is followed by another couple with a certain frequency, for example in our document only these words follow the ordered couple the dog:
- is
- white
- Snoopy
The word is follows the couple the dog with a frequency of 1/2; in fact, there are 5 cases of the ordered triple the dog is in the document, out of a total of 10 cases of the couple the dog. The word white follows the couple the dog with a frequency of 2/5 and the word Snoopy follows the couple the dog with a frequency of 1/10.
Let’s suppose we know the frequency linking every couple of the document to each word of the document; we are talking about an ordered type of link. We will use this information in an array and will define a start couple, randomly chosen among all couples.
If we have a big array available we can get a fairly long and correct text, even if it’s nonsense to us humans.
Let’s suppose our start couple is the dog.
If we look for it in the array we will find out that it can be followed by 3 words (is; white; Snoopy), each of them with a certain frequency. One of these word is chosen at random (is) and we continue.
We now have the couple dog is, we look for it in the array and find out that it can be followed by 2 words (white, black). One of these two words is chosen randomly, based on the frequency (the word white is chosen) and we keep on with looking for the couple is white in the array …
Clearly, the text produced will be getting to a dead end in the following cases only:
- look in the array for a couple not associated with any other word (if it exists, there is only one for each array)
- look in the array for a couple A associated with a single word only and the second word in the couple forms a couple associated with one word only, and so on until you either get to a couple not associated with any word, or to a couple among those that necessarily follow couple A. In this second case, we are basically creating a loop.
Ok.. I have now explained to you what a Markov chain is, or at least I have explained them to you from the point of view of random text production. At this point spammers can stop reading. This is what my friends White SEOs might be interested in:
The question to be asked is: why do spammers use these texts? Because they are lazy? Or just because they want to reach many rankings right away and for many keywords without having to write the contents?
Sure! 99% do it for that reason, but the rest 1% is wondering: why do search engines “like” these texts?
My idea is that these texts, if they are produced correctly, can interpret the trend of a SERP both from the point of view of the topic and of the language. This way, they climb up the very SERP, because search engines consider them appropriate.
This kind of pages may be considered as a good feeler to roughly comprehend what is the trend of a SERP and to understand how to create clean pages based on the results of the Markov pages.
Ranking processes most likely rely on the mathematical/geometric idea of similitude. Therefore a search engine will consider appropriate a page that is similar to all the other pages that have been considered interesting inside a certain SERP and a page that is, at the same time, different from each one of these pages, as much as the copyproof filters are overcome.
It can be deduced that Search Engines Trend can actually exist, at least as an ideal model a web page tends to, determining its ranking in a certain SERP.





November 8th, 2006 at 12:31 pm
I did not get the idea about Markov Chains, also I think supposing engines use this procedure to rank websites is pure fiction with no bases - No offence
November 8th, 2006 at 3:05 pm
@Jeff Roy:
You didn’t get the idea about M.C. and at the same time you think that what I wrote is wrong
Markov Chains are a statistical method that analyze the relationship between adjacent words in a text.
For example, with M.C. , Google can deduct that a page which contains only noons, probably is a spam page.
If I write a spam page using M.C. , google can’t ban my pages through an Algo. Only a Quality Rater could do it.
November 14th, 2006 at 4:42 pm
I was searching for Markov Chain when i fell into this page… had to do some markov problem as an assignment… am wondering if your idea can be used… but its interesting for sure
March 15th, 2007 at 1:02 am
This is a great post. You should add more content to your website, it sounds like you have some interesting ideas.
March 22nd, 2007 at 2:07 pm
Where is part 2? I’m looking forward to more posts by you
March 22nd, 2007 at 4:40 pm
I never wrote the part 2 of this post. This english blog is the translation of my italian one. Now i’m writing about other stuff like Black Hat techniques for retriving information from user cache.
Maybe someday i will write the part 2 of this post in which i will talk also about google phraserank algos
If u want to talk with me about something u can contact me at kerouac3001 [AAATTT] hotmail.com
But my english is not good
April 25th, 2007 at 8:45 pm
Very interesting post, I just Stumbled you
June 17th, 2007 at 8:15 pm
I find this quite “interesting”, let’s hope that, on the third reading, that some of the mist clears!
June 21st, 2007 at 10:51 am
>>I did not get the idea about Markov Chains, also I think supposing engines use this procedure to rank websites is pure fiction with no bases - No offence
I didn’t agree with you!
October 10th, 2007 at 5:13 pm
[…] Posted by smoothspan on October 10th, 2007 I read an interesting article in Grey Hat SEOblog about some of the techniques SEO spammers use. Before I go further, let me be absolutely crystal clear: I. Hate. Spam! […]
January 29th, 2008 at 8:03 pm
Landed here from Wikipedia. Found your writeup useful - tx