Skip to content

Search engines

No surprise. Google is the number one in many aspects.

Looking in the numbers of September 2016, I can see: 3,828,299 pages scanned by the Google engine. At this number, you can add 909,422 more for mobile rendering. Then a total of just less than 5 millions pages scanned by Google on my website. I’m impressed.

The second crawler found on my website is… MJ12. I mean the Majestic Bot. Not a useful one, just a software which check for backlinks. Basically, there is no reason to accept it.

Then, you find Bing, the Microsoft search engine with 45,874 scanned pages! Too bad… Don’t be surprised to see so bad results in Bing when requesting it. They do not refresh their information. I suspect they scan more the “premium ones”. I mean, I do not challenge with some competitors like The Washington Post. Of course. I am just an obscure website. But, compared to Google, I found Bing not very fair. And, as you can expect, I do not receive visits from Bing.

Of course, I have more than 3 search engines but others are insignificant. Qwant, the french search engine has scanned about 1000 pages in 5 days, about 6,000 in a month. This is not ridiculous for this one because it is a fairly new search engine and I suspect they buy results from other famous search engines. But 6000 compared to 46000 (Bing), this is amazing. Because the newcomer could compete with a giant.

Note crawling the web means 2 things: having space to store the results. When it is text, it is quite simple. And a good bandwidth. Because scanning a page, it’s about 10 kilobytes. Scanning my website means for Google, loading about 500Mb a day. You can imagine the bandwidth you need to achieve this…

 

Advertisements

Sitemap

The sitemaps are a very good way to say what are the pages of your website. In many cases, you don’t have to bother with it. Mainly because the robots are quite smart to get all your pages if they can access them.

You could need a sitemap file if you have quite difficult files to access (like PDF files) or, like my website, a huge number of pages with many changes every day. Basically, I have about 300000 pages in my website (because I have events from everywhere in the world). This is important to inform the robots about the pages.

Basically, you can put all the pages of your website. If you have more than 50000, you have to split the sitemap in different parts. This is what I have done. Note you have a property in the file named the “priority”. You can set it from 0.0 to 1.0 (it is a floating point value).

This can be important. Not for Google because they have a strong knowledge for crawling your website. It is for Bing. Basically, Bing is the search engine used by “noobs”. Those who use the Internet Explorer browser. And, by default, they use Bing.

You have a Bing Webmaster Tools (like for Google). I am not sure because everything is magic and secret. But I noticed that using the priority can help Microsoft and its search engine. Because, since the last new pages have high priority, the number of indexed pages have increased. Then I expect to see more results in the visits in the next weeks.

Because I have a website in HTTP and the same in HTTPS, I decides to provide the same sitemap for both.

 

UBUNTU LTS

I’ve discussed before about the new Ubuntu LTS upgrade. I can say now some tricks I’ve learned.

First of all, going to PHP7 can be frustrating. First because the PHP-FastCGI is installed by default during the upgrdae. This is mainly to run PHP7 with the nginx server. When you come from PHP5 and a running install of Apache 2.4, you don’t want this stuff. Then I stopped the PHP Fast-CGI which is basically a PHP server and I’ve configured my Apache 2.4 to run correctly.

Due to this configuration, PHP was “out-of-service” but not the PHP pages. I have exposed the code source of my pages to the world. Then be careful… On my side, nothing to regret: all the sensible files containing the configuration are stored outside the public area of the site. Just some users have seen some very surprising pages. Note Google has also indexed these pages. Then having your sensible data accessible from the public area is very dangerous. Note the users of WordPress can store their config.php file a folder above the public area. This is a very good advice.

Then, after this, I needed to re-install all the extensions including the MongoDB in his new release. Basically, the new version of the driver is easy to install if you are prepared. Better to test with a PHP5 working installation.

The last (but very annoying) thing about the update, this is the rotation file for Apache logs. Edit the “/etc/logrotate.d/apache2″ file. I was surprised (some time after) that only one week of logs were retained. Not enough for me. I changed it manually.

The very bad news is concerning MongoDB. First of all, MongoDB goes to version 2.6. Which is quite ridiculous. For a long-term release, a version 3.0 was expected. Nevertheless, you can update your database to a recent (a decent) version with the MongoDB repository. Note this version 2.6 do NOT start automatically when the server boots. I don’t know why. As I restart my server twice a year, I don’t mind really and I will upgrade to version 3 instead of looking for a solution.

Note the upgrade was running during one hour but the server can continue to serve your webpages through Apache2 because there is no many reboots and things are conceived to go straight.

have a good upgrade (if you have not done it yet).

 

Robots.txt

Beaucoup de gens connaissent le fichier robots.txt. Placé dans le répertoire racine du serveur WEB, il permet d’indiquer aux moteurs de recherche les répertoires à éviter.

Bien entendu, ce fichier n’est donné qu’à titre indicatif et seuls les moteurs de recherche sérieux le prennent en compte. Il est relativement facile de créer un fichier robots.txt

Dans le cas de mon site WEB, celui contient les informations suivantes:

User-agent: *
Disallow: /cgi-bin/
SITEMAP: http://www.koikonfait.com/sitemap.xml

Pas davantage. Pour mieux comprendre:

La partie “User-agent” permet de définir quels user agents a le droit de passer et lequel sera interdit (comprendre: persona non grata). À priori, il est contre-productif d’interdire l’accès à certains moteurs de recherche. Si vous aimez la politique du fou, vous pouvez interdire à Google d’indexer votre site. Pourquoi pas? Google respecte vos prescriptions à la lettre.

Idem pour les autres moteurs de recherches. Encore faut-il avoir avoir accès aux logs de votre serveur pour connaître la liste des robots qui passent sur votre site. Si votre site est hébergé, il est probable que vous n’ayez pas accès à l’information. Dans le cas où vous héberger votre site, cette information se trouve généralement dans /var/logs/apache2.

Certains robots semblent stupides voire gênants (je citerai majestic par exemple). D’autres peuvent surprendre (comme Baidu, le moteur de recherche chinois). D’autres plus justifiés comme BingBot (le moteur de recherche de Microsoft). Personnellement, je ne fais pas de sectarisme: tout le monde a accès.

En revanche, j’ai interdit l’accès au répertoire “/cgi-bin/” qui est censé recevoir des programmes exécutables. C’est une vieille habitude qui n’a aucun sens car ce répertoire est vide sur mon serveur!

Il existe également une directive très intéressante que, personnellement, je n’utilise pas:

Crawl-delay: 10

Cette information permet d’indiquer aux moteurs de recherche d’attendre 10 secondes entre chaque requête. Cela permet de limiter la charge du serveur. Personnellement, je n’ai pas pris la peine de régler ce paramètre. Je ne suis pas sûr qu’il soit respecté. Mais la raison principale, c’est que mon serveur est prévu “pour tenir le choc”. Est-ce que les moteurs de recherche sont des sauvages?

Oui pour Bing, le moteur de recherche de Microsoft. Il est capable d’envoyer 5 requêtes dans la même seconde. Pas très fair-play. De son côté, Google va être incroyable. Ses requêtes sont régulièrement espacées. Généralement d’au moins 5 secondes (avec des exceptions). Dans le cas de redirection (type 301 ou 302), il ne va pas se forcer à lire la page immédiatement, il va la mettre dans sa liste “à scanner”. Je n’imagine pas l’algorithme mis en place mais il présente de gros avantages: fair-play avec le site WEB et économique pour la bande passante Google. Une stratégie “gagnant-gagnant”.

 

Ubuntu 16.04 LTS

For those who uses an Ubuntu server 14.04 LTS, you will be able to move forward… To the 16.04 LTS. Quite easily.

You will be able to move from the old 14.04 to the new 16.04 in the next days (when the version 16.04.1 will be out, it is planned for July 21).

Why upgrading the system? To keep your machine with recent versions. From my point of view, the main differences will be PHP (from 5.5 to 7! Not a big deal.. But much faster) and MongoDB (expected to go from 2.4 to 3.2). Concerning my server, this is the main expectation. Having a Mongo 3.2 with WireTiger could be a very good deal. Note, the distribution upgrade will NOT upgrade the Mongo file system: you have to do it manually by first backuping your data then modifying the data directory and the engine for mongo. Then restoring the data.

For the release itself. I just put the way to do (if you have a server and do the upgrade through ssh). NOTE: YOU HAVE FIRST TO BACKUP YOUR SYSTEM. My provider gives me the capability to backup the disks without fear. Then I consider everything done about backup.

First step, install the “update manager”. It is very easy:

sudo apt-get install update-manager-core

Starting form July 21, check if you have the brand new version:

do-release-upgrade -c

If yes, do the backup. Then run the upgrade:

sudo do-release-upgrade

Wait and see… In my next post, I will give you my point of view and the time needed to upgrade. And if I failed… basically, the upgrade should be done next week (July 26th or 27th, perhaps later). Currently, I serve more than 150 pages per day and a mobile application. Stopping the server for upgrade should be done in less than 1 hour… We will see.

IMPORTANT NOTE: there is a change about ssh connection (I think the default connection will be without password). If you upgrade through the network, be very careful.

There is a page about upgrading than giving more or less the same advices: How to upgrade to Ubuntu 16.04 LTS .

Good luck…

NOTE 27/07: it seems the upgrade for servers is not ready. See bug #1605474 for more information. The upgrade could be available by the end of July. Using the “-d” option is not recommended. If you use a LTS version, I guess you are patient. 

 

Du concept au produit: fait!

Le concept était simple: trouver une activité à proximité de chez soi dans moins de 2h.

Ben voilà. Simple.

Le produit: un site WEB (juin 2015). Simple: 1 mois de codage (en PHP et MongoDB parce que ça permet de coder vite). Et relativement pas trop mal (moi qui suis architecte et développeur JAVA, je ne devrais pas dire cela).

Juillet 2015: les cinémas et quefaire.paris.fr disponibles.

Puis l’application sur iPhone (et iPad): une carte qui affiche les activités pour les deux prochaines heures.

Puis l’application sur Android. Pareil: une carte.

Septembre 2015: un site d’administration pour me donner quelques statistiques.

Octobre 2015: les brocantes.

Janvier 2016: BilletRéduc (excellente idée) et EventBrite (le site de billeterie).

10000 évènements disponibles principalement sur Paris et Genève en février 2016. Des versions améliorées des applications.

Février 2016: arrivée de la FNAC. On passe la barre des 20000 évènements.

Mai 2016: on retire les doublons… Parce que nos fournisseurs nous renvoient parfois les mêmes infos (concerts, théâtre…). On atteint les 40000 évènements dans la base avec MeetUp.

Juin 2016: 50000 évènements. France, Belgique et Suisse. Nouvelle page d’accueil.

Juillet 2016: 75000 évènements. Avec Facebook, on sature la pauvre machine avec son 1Go de mémoire et les temps de réponses faiblissent mais c’est sans compter sur quelques astuces de programmation qui se profilent à l’horizon…

En 1 année, un concept est devenu une machine de guerre. koikonfait.com est passé de concept à réalité. Pas de morale à donner. Pas de conclusion. Une constatation. Avec des outils Open Source, un serveur VPS et un budget de 30 euros/mois (l’hébergement, le nom de domaine, l’abonnement Apple Store pour l’application iPhone…), je suis arrivé à mes fins. Je peux prendre mon appli et décider de ma soirée: ce sera sans doute un concert quelque part à Genève…

Capture d’écran 2016-07-13 à 00.18.49.png

Répartition des activités sur la zone francophone

 

 

 

Now Unix works in Windows 10

I’m very impressed by the demo: you can get a Linux box in a Windows 10 environment. Just seen the demo and I am very excited. It is not a joke, it a something like a Linux emulator (basically, the kernel calls have been mapped into the Windows NT DLL).

It is not completed yet but I fell like we had run a MS/DOS emulation under Solaris back to 90’s. Full of hope. This is the end of the war between Linux and Microsoft, at last.

Just type “bash” in a simple command line window…! Well, you have to wait a little for the deployment in Windows 10 or subscribe for Developper Preview