Relevance and Performance
Yet another geocoder ?
There are already other geocoders available, so why Gisgraphy? First, things have to be clear: You should be aware that my goal, here, is not to denigrate, or to tell you that Gisgraphy is the best geocoder - not at all!. I strongly respect the developers and the community that work hard and for free to develop geocoders or geoloc stuff. But it simply doesn't fit all my needs, and that's the reason why I have developed Gisgraphy. For me, the relevance / usability or price of the existing geocoder is not satisfactory (it is my opinion). let's take the most popular : Google geocoder, Nominatim and Geo::Coder::US :
A different philosophy : An Address is not a string !
- Google Maps is not free and not open source and really expensive ($25,000USD for one year). But I must admit that the relevance is really good...
- Geo::Coder::US is not worldwide. That's a real limitation...
- On the wiki page of Nominatim it is writen that : "Search terms are processed left to right. This Search will work: pilkington avenue, birmingham, This search will fail: birmingham, pilkington avenue ". Thats not very usable.
After more than 7 years of doing geocoding, that's the conclusion - An address is not a string ! It is the cardinality of all of the possible names of the different parts :
'3355 S Las Vegas Blvd, Las Vegas ' and 'the Venitian, The Strip NV 89109 USA' is the same address but there is no common word between the two addresses
Gisgraphy has its own importer and mixes the best open source gazeteers / databases : OpenStreetMap
of course, but also GeoNames
, and it transform them into an address and POI database. It links each house to its street, each street to its city / ZIPcode, and each city to its administrative divisions. All of them are associated to their alternate names.
Gisgraphy has a different philosophy than the other geocoders. It use a fulltext engine but it uses an address parser
to separate the several parts of the address. It is then simpler to do geocoding and remove ambiguity : streets that have city names, unit informations that can disturb the fulltext search , house numbers that looks like ZIPcodes,... to have better relevance, it use the shapes when possible, it avoids putting a street in a city that its center is closer than the center of the city to which it belongs.
Because of the parsing, Gisgraphy can do roof top geocoding, that means that it can do geocoding up to house numbers
Gisgraphy is more than a geocoder, it provides also a full stack that allows to do common things as reverse geocoding, search around for place or POI by name, (different than search for address), restrict search around a point for a given radius, and so on. It also comes with autocompletion and spell checking-that is very useful and users are accustomed to it.
Technically, Gisgraphy comes with some facilities :
- Gisgraphy support more languages (XML, JSON, PHP, Ruby, Python, Atom, RSS / GeoRSS )
- An import wizard helps you choose the country you need, and do all the stuff : No need to to be a geocoding guru :) Gisgraphy will download, extract and import the files for you
- You have full control on the data, you can add / delete / modify data via a web interface.
- Autocompletion / autosuggestion
- Partial search
- All words mandatory or not
Gisgraphy has also its own limitations, the main one is that it does not yet manage street numbers. but it will in a future version :) (no date)
In the next section I will explain how you can set up Gisgraphy to get the best relevance and performance.
Relevance and performance are the two most important things for a geocoder. A common question is "Does Gisgraphy has a good relevance for my country" and "How many requests can it handle".
The relevance is strongly dependent of the datasets - it depends of the number of entries in OpenStreetMap and in GeoNames. To see how many entries there are for a specific country, I have computed some statistics on streets
, and house numbers
. There is also a good way to see if a particular region is well covered: simply look at the OpenStreetMap maps
. It is the best way to see if there is a lot of streets, or POIs, or if a lot of house numbers are in the dataset.
The other thing that improves relevance is the address parser. As it says on the home page, the address parser divides a single address (as a string) into its individual component parts : house number, street type (boulevard, street, ..), street name, unit (apt, building, ...), ZIPcode, state, country, city... this is an important part when geocoding because we knows the meaning of each word. To do so, the parser must try to detect the address patterns of the Universal Postal Union (UPU)
. The address parser is not implemented for all countries yet (see already implemented countries
). If a country is not implemented, we geocode the address as a string, with fulltext search and the relevance can be decreased (If you choose premium services, we can implement your country prior to the other if needed). If you don't geocode postal addresses or if you think that the parsing is not pertinent, you can disable it by setting the useAddressParserWhenGeocoding option to false or specify the 'postal' parameter to false at query time (finer grain).
Geocoding is the process of find GPS coordinates for a given place, but if you only need to search for address and don't care about common places (e.g : Eiffel Tower, hotel XXX), you can set the searchForExactMatchWhenGeocoding option to false. it will also increase the performance.
The relevance needs to be tuned again and again. But it is very important to avoid regression, and we should know what the impact of the changes are along the development process. That's why, as for performance
, I do some relevance tests. The number of tests grow day after day, and for each feedback I get on relevance I try to create a test.
There is a dedicated page
to make feedback on geocoding and address parsing. Feedback on relevance is very important because I don't know every countries specific details. Be asssured that every feedback is taken into account and the necessary changes will be done if there is a bug. Thanks in advance.
A single server has to process a lot of requests, since the beginning of Gisgraphy development, it has been a priority to get good performance internally and to be scalable. Gisgraphy is designed internally to have good performance. It
- uses indexed data
- Uses some preprocessed data (that's one of the reason why the import took so long!), street length, middle point, administrative division in tree, link streets to their city,...
- uses caching - that takes a little bit of memory if you doing worldwide geocoding, but it is worthwhile
Improvements and tuning
Apart from Gisgraphy, there are some common consideration that can be done on PostgreS and SolR. For SolR you can read this article
. For PostgreS, I suggest you read the PostgreS wiki page on performance
and the one on tuning
Now let's talk about Gisgraphy :)
Import only the country you need : the PostgreS query planner does not use the plan when there is a lot of entries. and the more you import, the more you need memory.
For the street service, restrict the radius in 'deep zone' (where there is a lot of streets) or use distance=false. because in 10 km around a point in New York there is a lot of streets and if you want to sort by distance, we have to calculate the distance for each street found to sort the results.
Check to see if the database indexes are created (connect to the database '\d+ openstreetmap' or '\d+ city' and verify that GIST indexs are on location or shape. If not, run the createGISTindex.sql file (in the sql directory of the distrib). Run a vacuum analyse on the database.)
optimise the fulltext search engine, simply call the following URL http://localhost:8080/solr/update?optimize=true&waitFlush=false
For information, a single Gisgraphy server (core I7, 4Gb of memory) can handle :
|Webservice||Number of requests per second|
|Street/reverse geocoding||44 req/s|
|Find nearby||74 req/s|
|Address parser||196 req/s|
All the JVMs do not perform equally. After several tests, the best configuration is the Sun/Oracle one in server mode
. On the last Ubuntu version, the JVM is OpenJDK and is not really very fast.
Configure the JVM memory: The needed memory depends on the amount of data in the fulltext engine. You will need 2 Gb if you import all countries (you also have to leave some memory for the operating system and the PostgreS server).
In general, one server is enough, but sometimes one instance is not sufficient and you will need to run several instances. The next paragraphs are for those who want to scale with more than one server.
To process as many requests as possible, Gisgraphy has been designed to be scalable : you can have as many servers as you want.
Once the import is done, the data is in a read-only mode and the webservices simply (hum ;) ) run queries on the database or on the fulltext engine. That's the key of Gisgraphy scalability.
The data is stored in two datastores :
- In the PostgreS database
- In the SolR fulltext search engine
Both offers distributed/replicated/cluster capabilities.
For PostgreS you have the choice to use one clustered PostgreS server
or use several PostgreS servers independently. Same thing for SolR : put it in a cluster
or to use several servers independently, there is a SolrCloud
It can be a little bit tricky to set up all of this but a simple architecture can be : n * (1 Gisgraphy server + 1 PostgreS +1 SolR) and then put a load balancer to share the load across the several servers, and that's done :)
Gisgraphy tries to focus on relevance and performance. If you have specific needs and don't know how to optimize for them, feel free to post a message on the forum