PostgreSQL: Comparing sets of (similar) street names

Have you ever had the need to join tables by a street name string? Maybe to assign a street code from one table to another. A common problem here is that names could be written in different ways when they actually refer to the same street, for example Via Giacomo Leopardi should match Via Leopardi G.Queen Victoria Street should match Victoria St. and so on...

This join can hardly be performed with EQUAL, LIKE or REGEXP operators. How can we overcome this unequality problem for a proper join?

I made up a short query that did the trick for me (*).

First, you have to import the Trigram module into your database. The Trigram module offers string similarity operators and functions. Login as db superuser on the local host and insert the module like this:

  • PostgreSQL 9.1+
    you might locate and execute pg_env script before proceeding

    $ psql -U postgres -h localhost mydb
    mydb=# CREATE EXTENSION pg_trgm;
    mydb=# \q

  • PostgreSQL 8+
    you might locate and execute pg_env script before proceeding

    $ pg_config --sharedir
    $ cd /usr/share/postgresql/8.4/contrib
    $ psql -U postgres -h localhost mydb <pg_trgm.sql

Now, consider the following example as an explanation.

Table 1: list of street names to be coded
select * from table1;

Table 2: official street name list and street codes
select * from table2;

To get the right comparison I simply executed the next query. The underlying reasoning is this: for each street name in table 1, scan all the street name items in table 2 and suggest the best fitting in terms of similarity, meaning the first ranked in the trigram similarity function. 

select * from
   (select *, rank() over (partition by st_name_table1 order by similarity desc) from
      (select *, similarity(st_name_table1, st_name_table2) from
          table1, table2
      ) q0
   ) q1
where rank = 1
order by similarity desc

And this is the result (actually I removed from the real output the exactly identical pairs).
You get a table with all the entries of table 1 (non coded) each one related to the best fitting in table 2 (coded streets), just like a left join.
  • green: successfully matched (astonishing!)
  • orange: matched, but you might be careful
  • red: table 1 street names don't get a valid companion, should remain uncoded

Caution! This is suitable for quite small data tables (approximately <1000 rows). Otherwise, you have to set up column indexes and try other ways to get faster performance.

(*) these notes are intended for a PostgreSQL EnterpriseDB regular install or a linux distribution install

No comments:

Post a Comment