Fast nearest neighbour in PostGIS

A common task when using GIS is to solve nearest neighbour queries, for example to find closest lake or closest school to a table of households. If the table to search for the closest neighbour is large with thousands or even millions of records, it will be a very heavy and costly to do the search by computing and comparing the distance to every feature. A common solution is to narrow the search by first applying an indexed based search with ST_Dwithin. Different methods to do this is discussed at

PostGIS online

Maybe you have seen it, it has been up and running for a few weeks now, but now it is also starting to take shape as I want it. What?…
http://www.postgisonline.org, a site for testing, training and showing PostGIS.

It is a web site with the intention to make it possible to find the power in handling spatial data in a relational database environment. You can write any SELECT statement against the database that PostGreSQL/PostGIS latest releases support and get the answer back as a table or as a map if it includes spatial data in a field named “the_geom”.

It is also possible for anyone to write tutorials and run them through the site. The tutorial can present a text, a sql-code, a background picture and a background map for each page. This concept have to be further developed but I think it is a good start and I hope it will be used to show the capabilities of PostGIS. I will use it to continue writing about “How to use the new distance functions”. An instruction, how to write tutorials will show up in the documentation area of the site. So far I have written two of them that can be found here

The spatial functionality is done with MapServer showing tables created by the sql query written by the user.

So, this is a first version. I hope that it will be used and maybe it can become a project with more people involved. There is a lot of functionality that can be added like PL/R with some way to show diagrams and pgrouting and some way to show WKTraster.

Oh, yes the logo…. It is a creation made by my wife 🙂

Welcome !

Strange behavior by design of the spatial function Filter in SQL Server 2008

In SQL Server 2008 there is a spatial function called Filter, documented here:
http://msdn.microsoft.com/en-us/library/cc645883.aspx
This function makes a fast index-based scan for geometry intersection. It guarantees to return all intersecting cases, but might return cases not intersecting as well. So this is a first filtering and as I understand it, it is an internal part of STIntersects. This functionality has nothing to do with the bounding box comparison  PostGIS does as a first filtering before the real ST_intersts calculation. In SQL Server Filter returns a more accurate answer to the intersection question. This discussion explains a lot of how it works:
social.msdn.microsoft.com/Forums/en/sqlspatial/thread/6e1d7af4-ecc2-4d82-b069-f2517c3276c2

The problem with this function is that, as the documentation says:

“In cases where an index is not available, or is not used, the method will return the same values as STIntersects() when called with the same parameters.”

Why is this a problem? Well, this means that the function gives different answers with and without an index from the exactly same query and dataset. This I think is a little problematic in itself for a function not being totally internal. But it might become nearly absurd in some cases. Look at this query:

Select a.id, b.id from table1 a ,  table2 b where a.geom.Filter(b.geom)=1;

If the geometries in table1 and table2 are indexed, we will get a fast answer, which might contain more geometry combinations than those actually intersecting with each other.  That is no problem, that is the whole idea. The problem shows if we want to see the result of the Filter function in the select part of the query like this:

Select.geom.Filter(b.geom) asFiltered,  a.id, b.id from table1 a ,  table2 b where a.geom.Filter(b.geom)=1;

Then the interesting thing happens that since the index won’t be used in the select part (using index makes no sense in select part) the Filter function here will give the same result as STIntersects, just as the documentation says. So, even if we “filter away” all cases where Filter returns anything else than 1 we will get rows where the column “Filtered” returns 0.

Here is a picture of my practical example from SQL Server Management console. As you see I get 144 rows back. I use the exactly same function in the where-part and in the select part but apparently get different answers. Fully logical from how the function is designed, but I don’t like it.

I guess that the reason for this design is that it is difficult or impossible to get the same result without the index

But to me it looks quite ugly.

Good news to windows users, PostGIS 1.5 now in Stack builder

Today PostGIS 1.5.0 reached Stack builder. That means that if you are using windows it’s now very simple to try the new functionality. Thanks Regina and Leo

That means that a fully functional installation with PostGreSQL 8.4.2 and PostGIS 1.5.0 is just a few clicks away.

Download the oneclick-installer of PostGreSQL from
http://www.enterprisedb.com/products/pgdownload.do#windows

Then, in the end of the installation choose yes to start Stack builder and mark PostGIS 1.5 under spatial extensions.

Ready to go 🙂

PostGIS 1.5.0 released

I hardly believed my eyes the other day when I saw the headlines roll. PostGIS 1.5.0 is released and totally without any suffix like beta or rc 🙂

With 1.5.0 the knowledge has reached PostGIS that the planet Tellus is not flat but a globe. You can now calculate the distance from Stockholm, Sweden to Oshakati in northern Namibia without the need to dig a tunnel to get the same answer from practical measuring. If you use the new geographical storage type you will get the distance around the globe from point 1 to point 2. 

But still most of the PostGIS functionality is only applicable to the planar way of looking at the earth, one part at a time.

The full release document can be found here:

http://postgis.org/news/20100204

A few words about this blog

I haven’t tried this before, to blog and I didn’t thought I ever would. But I have found myself getting interesting discussions and information from other blogs, so I thought I should give it a try.

About the name jordogskog it is Norwegian. It is three words, jord og skog. The direct translation to english would be soil and forest, but is used as agriculture and forestry. My education is forestry my daily work is related to forestry. I have a great interest in how we use our land areas and in the larger view, our earth. To me, the step from that to GIS and open source is very small, so that is what I use a lot of my spare time for.

ST_Distance, the faster edition or Birgers Boost

When I was working on the new functions described in previous post I found that the distance calculation in general is very heavy and slow. The distance function gets two geometries to find the shortest distance in between. The approach has been to calculate the distance between all possible combinations of vertex-vertex and vertex-edge between the two geometries. That means that two geometries with 1000 vertexes each causes one million iterations and even if computers are fast, that takes some time.

The ideas how to make it faster came to me by the time of the birth of my son. I guess you get some extra boost from something like that. I was home from job for 10 days to help my wife and son, and I did, I promise 🙂 But I also had time to try some ideas of getting distance calculations faster. Because of this I call  it Birgers Boost from my son Birger.

The idea was to find a way to not do this distance calculation between all and every vertexes. I thought that at least the ones behind the middle of the geometry must be possible to avoid. I imagined like a wall that I projected against the geometries and then I could sort the vertexes as they appear on the other side of the wall as I move it through the geometry. I guess it maybe doesn’t make sense but I thought it was a little fun to describe how the idea appeared. The resulting algorithm uses a line from the middle of the first geometry to the middle of the second geometry. Then it orders the vertexes along that line and calculates the distances in the order of how close they are along that line. The big difference from the old function is that the preparation here, giving the vertexes a value along this line only happens once per vertex. So in the example of 1000 vertexes per geometry it takes only 2000 calculations to get those values. Then, when the vertexes is ordered we can do the distance calculations in the right order. And when the distance between those abstract walls that I imagined is bigger than the smallest found distance, then we know that the shortest distance is found. How many distances we have to calculate before we know this will vary depending on how the geometries is related to each other.

From the testing we have done it seems like it in general gives a quite good increase in speed. For larger geometries it is between 10 and 100 times faster than the old algorithm. In some special cases it is not that fast and in some cases it is even faster.

This way of doing it will not work if the geometries overlap. The easiest way to be sure they don’t overlap is to check for overlapping bounding boxes. So, if there is overlapping bounding boxes the calculation is sent to the old hard way of doing it. The same is the situation if one of the geometries is a point because then there is no gain to get. Then it is done the same way as before

This is a problem but hopefully this will be solved. Paul Ramsey have come up with ideas that might make my way of doing it short lived, see his blog:
http://blog.cleverelephant.ca/2009/11/is-good-enough-good-enough.html
He is mostly discussing his new geography functions but probably it will be a good way of doing it for geometry too. So in PostGIS 2.0 the development will continue 🙂

Those distance calculations enhancements might be quite important because it makes it possible to calculate directly with the geometries in nearest neighbor calculations and thing like that instead of using the centroids. Using points will still be faster bu sometimes it may be useful to be able to run on the whole geometry and before it was often more or less impossible because of too heavy calculations.

This will be in PostGIS 1.5. A Beta release will hopefully be out soon. For windows there is experimental builds already available here:
http://postgis.org/download/windows/experimental.php
And of course the source code is available to compile for other platforms.

I have wrote some lines in the wiki too, to describe this
http://trac.osgeo.org/postgis/wiki/NewDistCalcGeom2Geom

Shortest line and other new functionality in PostGIS 1.5

One and a half year ago I found PostGIS. I did fast become a fan. Handling spatial data with sql is a wonderful way of doing it. PostGIS also have a great amount of functionality and if something is missing no one will be stopped from creating that functionality. When I realized that I understood that I no longer could complain about a functionality I have missed in other GIS systems. I have done some avenue scripting in Arcview 3.x and solved a lot of tasks that way. But I have missed an easy way to get the information about between which points the distance-function gets that min distance.

Let’s say you are working with linestrings of rivers and you want to know how close a linestring that represents a road is to that river. Ok, the distance-function tells you that the minimum distance is 20 meters. Great, but the next question will be, where. Where is the road only 20 meters away from the river. In a couple of times I have wanted that information and I have always imagined that the information have to be somewhere in there, in the function. To find the minimum distance you first have to identify where to measure, was my thought. That was partly right I found.

That’s the great thing about open source, if you are wondering how it is done the code is there to read. Since I have never studied C before I didn’t have very high expectations of understanding anything. But from quite good commenting and clean structure I successes to put this together
http://www.jordogskog.no/distance.html
The minimum distance between to geometries have to be between two vertexes or between one vertex and one edge. The distance calculation iterated through the vertexes and edges defining the inputted geometries comparing their relations one by one. How to find the distance between two vertexes is just done with the Pythagorean theorem. Little bit worse is it to calculate the distance between one vertex and an edge. Search for “How do I find the distance from a point to a line?” in this link
http://www.faqs.org/faqs/graphics/algorithms-faq/
There is a description how to get the distance from the line to the point. That is the way it was done before. But there is also a description how to identify the point on the edge (line) from where the shortest distance is found. Time for copy and paste. When the overall shortest distance is found the points defining that distance is returned to the user as a line. I found a line being the best way of returning the information because than the user can get both first and last point from that and the distance from the length of the line. The use of this functionality will probably, as described in the beginning be to identify where the minimum distance is found. Let’s say you are sitting on an big Island with your laptop and asking yourself from where you should swim to get the shortest way to shore. Now that problem is solved. For convenience the first point of ST_Shortestline can also be found with function ST_Closestpoint.

From this rewriting a also successes to get maximum distance calculation working, ST_Maxdistance. Then it was natural to also add longest line function which relates to ST_Maxdistance as ST_Shortestline relates to ST_Distance.
To make the symmetry complete I also added ST_DFullywithin. That function returns true if the maxdistance between two geometries is smaller or the same as the inputted last parameter. Just like ST_DWithin but with maximum distance instead of minimum distance.

So as summary
the old functions working with minimum distance, ST_Distance and ST_DWithin has now got a new friend ST_Shortestline and there is also the corresponding functions for max distance, ST_Maxdistance, ST_Longestline and ST_DFullywithin.

I will get back soon and tell about how I found the maybe fastest distance calculation, included in 1.5