Relevance and Performance
Yet another geocoder ?
There are already other geocoders available, so why Gisgraphy? Firstly and to be clear, you should be aware that my goal, here, is not to denigrate, or to tell you that Gisgraphy is the best geocoder - not at all!. we strongly respect the developers and the community that work hard and for free to develop geocoders or geoloc applications, tools and utilities. Other solutions have yet to fit all my needs, and that's the reason why we have developed Gisgraphy. For me, the relevance / usability or price of the existing geocoder is not satisfactory (it is my opinion). let's take the most popular : Google geocoder, Nominatim and Geo::Coder::US :
- Google Maps is not free and not open source and really expensive ($25,000USD for one year). But we must admit that the relevance is really good...
- Geo::Coder::US is not worldwide. That's a real limitation...
- On the wiki page of Nominatim it is written : "Search terms are processed left to right. This Search will work: pilkington avenue, birmingham, This search will fail: birmingham, pilkington avenue ". That's not very versatile.
A different philosophy : An Address is not a string !
After more than 11 years of doing geocoding, that's our conclusion - An address is not a string ! It is the cardinality of all of the possible names of the different parts :
'3355 S Las Vegas Blvd, Las Vegas ' and 'the Venitian, The Strip NV 89109 USA' are the same address but there is no common word between the two addresses
Gisgraphy has its own importer and mixes the best open source gazetteers / databases such as OpenStreetMap of course, but also Openaddresses, GeoNames and Quattroshapes, and it transform them into an address and POI database. It links each house to its street, each street to its city / ZIPcode, and each city to its administrative divisions. All of these are associated with alternate names.
It uses a full-text engine with an address parser to separate the several parts of the address. Then it is simpler to do geocoding and remove ambiguity such as streets that have city names, unit information that can complicate the full-text search, house numbers that looks like ZIPcodes, etc and so as to improve relevance it also uses shapes where possible. Gisgraphy also avoids putting a street in a city such that its centrer is closer than the center of the city it belongs to.
Gisgraphy is more than a geocoder, it provides also a full stack that allows doing common things as reverse geocoding, search around for place or POI by name, (different than search for address), restrict search around a point for a given radius, and so on. It also comes with autocompletion and spell-checking that is very useful and users are accustomed to it.
Technically, Gisgraphy comes with some facilities :
- Gisgraphy support more languages (XML, JSON, PHP, Ruby, Python, Atom, RSS / GeoRSS )
- An import wizard helps you choose the country you need, and do all the stuff : No need to to be a geocoding guru :) Gisgraphy will download, extract and import the files for you
- You have full control on the data, you can add / delete / modify data via a web interface.
- Auto-completion / auto-suggestion
- Partial search
- All words mandatory or not
In the next section, we will explain how you can set up Gisgraphy to get the best relevance and performance.
Relevance and performance are the two most important things for a geocoder. A common question is "Does Gisgraphy has a good relevance for my country" and "How many requests can it handle".
The relevance is strongly dependent of the datasets - it depends of the number of entries in OpenStreetMap and in GeoNames. To see how many entries there are for a specific country, we have computed some statistics on streets. There is also a good way to see if a particular region is well covered: simply look at the OpenStreetMap maps. It is the best way to see if there is a lot of streets or POIs, or if a lot of house numbers are in the dataset.
The other thing that improves relevance is the address parser. As it says on the home page, the address parser divides a single address (as a string) into its individual component parts : house number, street type (boulevard, street, ..), street name, unit (apt, building, ...), ZIPcode, state, country, city... this is an important part when geocoding because we know the meaning of each word. To do so, the parser must try to detect the address patterns of the Universal Postal Union (UPU). The address parser is not implemented for all countries yet (see already implemented countries). If a country is not implemented, we geocode the address as a string, with full-text search and the relevance can be decreased (If you choose premium services, we can implement your country prior to the other if needed). If you don't geocode postal addresses or if you think that the parsing is not pertinent, you can disable it by setting the useAddressParserWhenGeocoding option to false or specify the 'postal' parameter to false at query time (finer grain).
Geocoding is the process of find GPS coordinates for a given place, but if you only need to search for addresses and don't care about common places (e.g : Eiffel Tower, hotel XXX), you can set the searchForExactMatchWhenGeocoding option to false. it will also increase the performance.
The relevance needs to be tuned again and again. But it is very important to avoid regression, and we should know what the impact of the changes are along the development process. That's why, as for performance and functionality, we do some relevance tests. The number of tests grows day after day, and for all each feedback we get on relevance we try to create a test.
. There is a dedicated page to give feedback on geocoding and address parsing. Feedback on relevance is very important because we don't know every country specific details. Be assured that every feedback is taken into account and the necessary changes will be done if there is a bug. Thanks in advance.
A single server has to process a lot of requests, since the beginning of Gisgraphy development, it has been a priority to get good performance internally and to be scalable. Gisgraphy is designed internally to have good performance. It
- uses indexed data
- Uses some preprocessed data (that's one of the reasons why the import took so long!), street length, middle point, administrative division in a tree, link streets to their city,...
- uses caching - that takes a little bit of memory if you doing worldwide geocoding, but it is worthwhile
Improvements and tuning
Apart from Gisgraphy, there are some common considerations that can be done on PostgreSQL and SolR. For SolR you can read this article. For PostgreS, we suggest you read the PostgreSQL wiki page on performance and the one on tuning.
Now let's talk about Gisgraphy :)
Import only the country you need : the PostgreSQL query planner does not use the plan when there is a lot of entries. and the more you import, the more you need memory.
For the street service, restrict the radius in 'deep zone' (where there is a lot of streets) or use distance=false. because in 10 km around a point in New York there is a lot of streets and if you want to sort by distance, we have to calculate the distance for each street found to sort the results.
Check to see if the database indexes are created (connect to the database '\d+ openstreetmap' or '\d+ city' and verify that GIST indexes are on location or shape. If not, run the createGISTindex.sql file (in the sql directory of the distribution). Run a 'vacuum analyse' on the database.)
optimise the full-text search engine, simply call the following URL http://localhost:8080/solr/update?optimize=true&waitFlush=false
For information, a single Gisgraphy server (core I7, 16Gb of memory) can handle :
|Web service||Number of requests per second|
|Street/reverse geocoding||44 req/s|
|Find nearby||74 req/s|
|Address parser||196 req/s|
All the JVMs do not perform equally. After several tests, the best configuration is the Sun/Oracle one in server mode. On the last Ubuntu version, the JVM is OpenJDK and is not really very fast.
Configure the JVM memory: The needed memory depends on the amount of data in the full-text engine. You will need 2 Gb if you import all countries (you also have to leave some memory for the operating system and the PostgreSQL server).
In general, one server is enough, but sometimes one instance is not sufficient and you will need to run several instances. The next paragraphs are for those who want to scale with more than one server.
To process as many requests as possible, Gisgraphy has been designed to be scalable : you can have as many servers as you want. Once the import is done, the data is in a read-only mode and the web services simply (hum ;) ) run queries on the database or on the full-text engine. That's the key of Gisgraphy scalability.
The data is stored in two datastores :
- In the PostgreSQL database
- In the SolR full-text search engine
Both offers distributed/replicated/cluster capabilities.
For PostgreSQL you have the choice to use one clustered PostgreSQL server or use several PostgreSQL servers independently. Same thing for SolR : put it in a cluster or to use several servers independently, there is a SolrCloud. It can be a little bit tricky to set up all of this but a simple architecture can be : n * (1 Gisgraphy server + 1 PostgreSQL +1 SolR) and then put a load balancer to share the load across the several servers, and that's done :)
Gisgraphy tries to focus on relevance and performance. If you have specific needs and don't know how to optimize for them, feel free to contact us.