Prep addresses for geocoding

I needed to geocode several thousand addresses from an old database. The addresses were exported into a pipe-delimited CSV file which contained extra info I didn’t need. After extracting and prepping the addresses into a separate text file I geocoded the addresses via a simple bash script.

Using Sed & Awk to extract and prep the addresses

# the exported CSV file using pipe | as delimiter

"SomeID"|"BusinessName"|"StreetAddress1"|"StreetAddress2"|"City"|"State"|"Zip"|"Phone"|"Fax"
"34"|"Super Store"|"12345 Main St"|""|"Amarillo"|"TX"|"79109"|"123-456-7890"|"234-567-8901"
"85"|"OK Store"|"234 Wescott"|"Suite 100"|"Oklahoma City"|"OK"|"73104"|"345-678-9012"|"456-789-0123"

# piped commands

$ sed '1,1 d' exported.csv | sed s/"//g | sed s/ /+/g  
| awk -f addresses.awk | sed '/^ *$/d' > prepped.txt

Let’s take a look at each piped command

# remove 1st line (header info we don’t need)

$ sed '1,1 d' exported.csv

# results are piped to remove double quotes

| sed s/"//g

# and those results are piped to replace spaces with +
# (Google’s geocoding service recommends separating words with pluses + rather than spaces)

| sed s/ /+/g

# results are piped to an awk script (see below) to get only the address fields we want to use
# awk -f specifies the script to use

 | awk -f addresses.awk

# those results are piped to remove blank lines
# and save results to the text file the bash script will use

| sed '/^ *$/d' > prepped.txt

A closer look at the straightforward awk script

# addresses.awk - - print addresses

# set the pipe as delimiter
# (our CSV file used pipe delimiter rather than commas)
BEGIN { FS = "|" }

{  
 # the business name is in field 2
 # business = $2

 # concatenate typical address fields
 # street in 4, city in 6, state in 7, zip in 8
 # we earlier replaced spaces with pluses within address fields
 # hence "Oklahoma City" became "Oklahoma+City"
 # now we insert pluses between fields
 address = $4"+"$6"+"$7"+"$8

 # the data we're looking for
 # will later be processed by the bash script
 print address
}

# print the number of lines that were processed
# use this as a quick validation of the number of addresses
END { print NR }

# and the prepped.txt file

12345+Main+St+Amarillo+TX+79109
234+Wescott+Oklahoma+City+OK+73104

With a text file of addresses to geocode, the bash script is up next.

Until this project, I’d never worked much with Sed & Awk, but using these powerful tools has really crystalized for me the Unix Philosophy.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s