I needed to geocode several thousand addresses from an old database. The addresses were exported into a pipe-delimited CSV file which contained extra info I didn’t need. After extracting and prepping the addresses into a separate text file I geocoded the addresses via a simple bash script.
Using Sed & Awk to extract and prep the addresses
# the exported CSV file using pipe | as delimiter
"SomeID"|"BusinessName"|"StreetAddress1"|"StreetAddress2"|"City"|"State"|"Zip"|"Phone"|"Fax" "34"|"Super Store"|"12345 Main St"|""|"Amarillo"|"TX"|"79109"|"123-456-7890"|"234-567-8901" "85"|"OK Store"|"234 Wescott"|"Suite 100"|"Oklahoma City"|"OK"|"73104"|"345-678-9012"|"456-789-0123"
# piped commands
$ sed '1,1 d' exported.csv | sed s/\"//g | sed s/\ /\+/g \ | awk -f addresses.awk | sed '/^ *$/d' > prepped.txt
Let’s take a look at each piped command
# remove 1st line (header info we don’t need)
$ sed '1,1 d' exported.csv
# results are piped to remove double quotes
| sed s/\"//g
# and those results are piped to replace spaces with +
# (Google’s geocoding service recommends separating words with pluses + rather than spaces)
| sed s/\ /\+/g
# results are piped to an awk script (see below) to get only the address fields we want to use
# awk -f
\ | awk -f addresses.awk
# those results are piped to remove blank lines
# and save results to the text file the bash script will use
| sed '/^ *$/d' > prepped.txt
A closer look at the straightforward awk script
# addresses.awk - - print addresses
# set the pipe as delimiter
# (our CSV file used pipe delimiter rather than commas)
BEGIN { FS = "|" }
{
# the business name is in field 2
# business = $2
# concatenate typical address fields
# street in 4, city in 6, state in 7, zip in 8
# we earlier replaced spaces with pluses within address fields
# hence "Oklahoma City" became "Oklahoma+City"
# now we insert pluses between fields
address = $4"+"$6"+"$7"+"$8
# the data we're looking for
# will later be processed by the bash script
print address
}
# print the number of lines that were processed
# use this as a quick validation of the number of addresses
END { print NR }
# and the prepped.txt file
12345+Main+St+Amarillo+TX+79109 234+Wescott+Oklahoma+City+OK+73104
With a text file of addresses to geocode, the bash script is up next.
Until this project, I’d never worked much with Sed & Awk, but using these powerful tools has really crystalized for me the Unix Philosophy.