Skip to content
 

CSV parsing with awk

This is another thing people do all the time with awk. Although the CSV format should be more or less standardized, it seems there are still a number of subtle variations floating around. Let's look at some of them.

Simple CSV files (with fields separated by commas, and commas cannot appear anywhere else) are easily parsed by setting FS to ",", so we won't go into further detail here, as there's not much to add to it.

If there can be spaces around fields, and we don’t want them, like eg

    field1  ,   field2   , field3   , field4  

Exploiting the fact that FS can be a regex, we could try using something like

FS="^ *| *, *| *$"
# or, on the command line
awk -F '^ *| *, *| *$' ...

However, this can be problematic because the actual data fields might end up corresponding either to awk fields 1 … NF or 2 … NF (or NF-1), depending on whether the line had leading/trailing spaces or not. So in this case, it’s probably better to just parse using "," for FS as usual, and remove leading and trailing spaces from each field:

# FS=","
for(i=1;i<=NF;i++){
  gsub(/^ *| *$/,"",$i)
  print "Field " i " is " $i
}

If commas can appear in fields provided that they are escaped, like this

field1,field2\,with\,commas,field3,field4\,

a common trick is to replace all the \, sequences beforehand with some character that is not (or should not be!) present in the file, parse the records as usual, and then replace back the sequences that were previously removed. While some implementations of awk allow using any character via the \xnn syntax, so you could use some ASCII control character, usually the special variable SUBSEP (whose default value is 0x1c hex, or 034 octal) is good enough for the job:

# FS="," or whatever appropriate
gsub(/\\,/,SUBSEP)
for (i=1;i<=NF;i++) {
  f=$i
  gsub(SUBSEP,"\\,",f)
  print "Field " i " is " f
}

Another common CSV format is

"field1","field2","field3,with,commas","field4"

Assuming double quotes cannot occur in fields, but here fields can contain unescaped commas. This is easily parsed using

FS="^\"|\",\"|\"$"
# or, if you like
FS="\",\"|\""
# or on the command line (perhaps clearer)
awk -F '^"|","|"$' ...
awk -F '","|"' ...

but keeping in mind that the actual fields will be in position 2, 3 … NF-1. We can extend that to allow for spaces around fields, like eg

   "field1"  , "field2",   "field3,with,commas" , "field4"    

by using

FS="^ *\"|\" *, *\"|\" *$"
# or, on the command line
awk -F '^ *"|" *, *"|" *$' ...

usable fields will still be in positions 2 … NF-1, so there will be no ambiguities. You can NOT use the good old "," for FS here, as the commas within fields will be used as separators.

Another CSV format is as follows (the only quoted fields are those that contain commas):

  field1, "field2,with,commas"  ,  field3  ,  "field4,foo"   

We have a mixture of quoted and unquoted fields here, which cannot parsed directly by any value of FS (that I know of, at least). However, we can still get the fields using match() in a loop (and cheating a bit):

c=0
$0=$0","                                   # yes, cheating
while($0) {
  match($0,/ *"[^"]*" *,|[^,]*,/)
  f=substr($0,RSTART,RLENGTH)             # save what matched in f
  gsub(/^ *"?|"? *,$/,"",f)               # remove extra stuff
  print "Field " ++c " is " f
  $0=substr($0,RLENGTH+1)                 # "consume" what matched
}

As the complexity of the format increases (for example when quotes are allowed in fields, escaped in some way), awk solutions become more fragile. Although I should not say this here, for anything more complex than the last example, I suggest using other tools (eg, Perl just to name one). Btw, it looks like there is an awk CSV parsing library here (I have not tried it, but it looks quite good).

Be Sociable, Share!

2 Comments

  1. IF_Rock says:

    Huh? How are you using apostrophes as string delimiters in awk/gawk?

    • waldner says:

      Well yes, rereading it I see that you are right. The fact is that they were meant to be used in the shell as in -F '^ *"|" *, *"|" *$' 'awk code here', but I ended up writing them in that hybrid and incorrect way.
      I've fixed it now (and included both ways), thanks.

Leave a Reply