CSV import using CLI (header, NULL)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

CSV import using CLI (header, NULL)

Christof Böckler
Hi,

I want to share some thoughts and make some suggestions about the SQLite
3 command line interface (CLI) tool, especially its behaviour when
importing CSV files.

CSV files are probably even more common for data exchange than SQLite
database files. I consider it to be good practice to include a header
line with column names in every CSV file. Metadata should go with the
data. This prevents me from mixing up two columns that contain numbers
with similiar value distributions.

Let’s look at an example. A file named data.csv contains three lines:

id,val1,val2
A,27,8
B,3,12

Now …
sqlite3
sqlite> .import data.csv tab

works and looks good at first, but there is a problem with numerical data.

sqlite> SELECT * FROM tab ORDER BY val2;
B,3,12
A,27,8

This is because on import all three columns were created with affinity
(not to say data type) TEXT (see .schema). As a consequence all numbers
were imported as strings.
'12' < '8' is lexicographically OK, but not so in a mathematical sense.
Having the CSV file in mind I clearly expect to see 8 on the first line
of the above result.

How to work around this? Just define the table in advance with
appropriate data types (affinity INTEGER):

sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT);

But now the above .import command will not work as expected, because it
will result in three rows in our table tab. The first row contains the
header line.

Two different workarounds come to my mind:
a) sqlite> DELETE FROM tab WHERE rowid = 1;  -- Dirty hack!
b) sqlite> .import data.csv temp_tab
    sqlite> INSERT INTO tab SELECT * FROM temp_tab;
    sqlite> DROP TABLE temp_tab;  -- Two tedious extra lines

Both approaches are not very appealing to me. To make CSV files with a
header line first class citizens, I suggest this instead:
sqlite> .import -h data.csv tab
should ignore the first line of data.csv. This import should fail if
table tab does not already exist.
This is both shorter and more elegant than both workarounds.


Now on to a second issue. Let’s assume you have sucessfully imported a
file containing these four lines:
id,val1,val2
A,27,8
B,3,12
C,,1
into the table tab mentioned above, resulting in three rows. Notice the
missing value in column val1 on the last line. This missing value is
imported as an empty string '' regardlesse of the affinity of column val1.

That leads to (mathematically) unexpected results from aggregate functions:
sqlite> SELECT SUM(val1) FROM tab;
30    -- Well, treating '' like 0 is OK in this case
sqlite> SELECT COUNT(val1) FROM tab;
3     -- but not so here; only two values/numbers were given in data.csv
sqlite> SELECT AVG(val1) FROM tab;
10    -- the average of 3 and 27 is 15
sqlite> SELECT MAX(val1) FROM tab;
''    -- not to be expected when looking at data.csv

OK, I hear you say, what about this workaround:
sqlite> UPDATE tab SET val1 = NULL WHERE val1 = '';
This makes the surprises above go away, but it is again tedious to do for
all columns containing only numbers.

My suggestion is: If someone goes the extra mile and defines a table in
advance in order to have an appropriate numerical affinity (INTEGER, REAL
or NUMERIC) for a column, then it is OK to encode a missing value as NULL
instead of ''. It seems right though to keep the current behaviour for
columns with affinity TEXT (the default) or BLOB.

To sum things up:
1. There should be no penalty for using header lines in CSV files. Thus a
new flag -h for .import is much appreciated.
2. Missing values in columns with numerical affinity should show up as
NULL values.

Thanks for reading, I look forward to your opinions about these issues.

Greetings
Christof
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: CSV import using CLI (header, NULL)

Shawn Wagner
To import a csv file with headers into an existing table, you can use

.import '| tail -n +2 yourfile.csv' yourtable

to skip the header line.

On Sat, Feb 29, 2020, 4:30 AM Christof Böckler <[hidden email]> wrote:

> Hi,
>
> I want to share some thoughts and make some suggestions about the SQLite
> 3 command line interface (CLI) tool, especially its behaviour when
> importing CSV files.
>
> CSV files are probably even more common for data exchange than SQLite
> database files. I consider it to be good practice to include a header
> line with column names in every CSV file. Metadata should go with the
> data. This prevents me from mixing up two columns that contain numbers
> with similiar value distributions.
>
> Let’s look at an example. A file named data.csv contains three lines:
>
> id,val1,val2
> A,27,8
> B,3,12
>
> Now …
> sqlite3
> sqlite> .import data.csv tab
>
> works and looks good at first, but there is a problem with numerical data.
>
> sqlite> SELECT * FROM tab ORDER BY val2;
> B,3,12
> A,27,8
>
> This is because on import all three columns were created with affinity
> (not to say data type) TEXT (see .schema). As a consequence all numbers
> were imported as strings.
> '12' < '8' is lexicographically OK, but not so in a mathematical sense.
> Having the CSV file in mind I clearly expect to see 8 on the first line
> of the above result.
>
> How to work around this? Just define the table in advance with
> appropriate data types (affinity INTEGER):
>
> sqlite> CREATE TABLE tab (id TEXT, val1 INT, val2 INT);
>
> But now the above .import command will not work as expected, because it
> will result in three rows in our table tab. The first row contains the
> header line.
>
> Two different workarounds come to my mind:
> a) sqlite> DELETE FROM tab WHERE rowid = 1;  -- Dirty hack!
> b) sqlite> .import data.csv temp_tab
>     sqlite> INSERT INTO tab SELECT * FROM temp_tab;
>     sqlite> DROP TABLE temp_tab;  -- Two tedious extra lines
>
> Both approaches are not very appealing to me. To make CSV files with a
> header line first class citizens, I suggest this instead:
> sqlite> .import -h data.csv tab
> should ignore the first line of data.csv. This import should fail if
> table tab does not already exist.
> This is both shorter and more elegant than both workarounds.
>
>
> Now on to a second issue. Let’s assume you have sucessfully imported a
> file containing these four lines:
> id,val1,val2
> A,27,8
> B,3,12
> C,,1
> into the table tab mentioned above, resulting in three rows. Notice the
> missing value in column val1 on the last line. This missing value is
> imported as an empty string '' regardlesse of the affinity of column val1.
>
> That leads to (mathematically) unexpected results from aggregate functions:
> sqlite> SELECT SUM(val1) FROM tab;
> 30    -- Well, treating '' like 0 is OK in this case
> sqlite> SELECT COUNT(val1) FROM tab;
> 3     -- but not so here; only two values/numbers were given in data.csv
> sqlite> SELECT AVG(val1) FROM tab;
> 10    -- the average of 3 and 27 is 15
> sqlite> SELECT MAX(val1) FROM tab;
> ''    -- not to be expected when looking at data.csv
>
> OK, I hear you say, what about this workaround:
> sqlite> UPDATE tab SET val1 = NULL WHERE val1 = '';
> This makes the surprises above go away, but it is again tedious to do for
> all columns containing only numbers.
>
> My suggestion is: If someone goes the extra mile and defines a table in
> advance in order to have an appropriate numerical affinity (INTEGER, REAL
> or NUMERIC) for a column, then it is OK to encode a missing value as NULL
> instead of ''. It seems right though to keep the current behaviour for
> columns with affinity TEXT (the default) or BLOB.
>
> To sum things up:
> 1. There should be no penalty for using header lines in CSV files. Thus a
> new flag -h for .import is much appreciated.
> 2. Missing values in columns with numerical affinity should show up as
> NULL values.
>
> Thanks for reading, I look forward to your opinions about these issues.
>
> Greetings
> Christof
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: CSV import using CLI (header, NULL)

Dominique Devienne
On Sat, Feb 29, 2020 at 1:42 PM Shawn Wagner <[hidden email]>
wrote:

> To import a csv file with headers into an existing table, you can use
> .import '| tail -n +2 yourfile.csv' yourtable
> to skip the header line.


On unix. And by shell’ing out to native tools, so not portable.
The cli ought to have something built in, if it doesn’t already.

>
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: CSV import using CLI (header, NULL)

Warren Young
In reply to this post by Christof Böckler
On Feb 27, 2020, at 11:51 PM, Christof Böckler <[hidden email]> wrote:
>
> 1. There should be no penalty for using header lines in CSV files. Thus a new flag -h for .import is much appreciated.

More than that, SQLite should be able to *actively use* the header when present.

For instance, given:

foo,bar,qux
1,2,3

then there should be a way to import only the first and third columns, perhaps with a command like so:

.import data.csv tablename foo,qux

I’ve needed such a thing multiple times.

One case that comes to mind is that tablename has only two columns, and you’re purposefully subsetting the CSV data in SQLite.  This is common when importing DBs from outside sources, where the other DB has more columns than your local SQLite DB.  It’s annoying to need to preprocess the data to strip the “extra” columns out.  It’s even more annoying to subset it in SQLite by matching the outside DB’s structure in SQLite and then do the common workaround to a lack of ALTER TABLE DROP COLUMN.

Another case I’ve run into before is that the table layout of the CSV is the same as in SQLite, but you want one or more of the columns to get their default value for some reason, not take them from the CSV.  You can see this with time-varying data when the CSV is output from SQLite at time t0 and reimported at t1 after some of the data has gone stale, so that letting the stale columns revert to their defaults is better than importing obsolete data.

To make the second example concrete, imagine an HVAC monitoring system’s DB: the names of the stations and the wear data are still useful, but the last known temperature shouldn’t be imported because the old data could be entirely inaccurate, and fresh data should be coming along soon after the DB table is restored from the backup.  Meanwhile, report “I don’t know” for the temperature measurements.

> 2. Missing values in columns with numerical affinity should show up as NULL values.

I don’t think one hard-and-fast rule works for everyone here, if for no other reason than that there must be people depending on the current behavior.

Instead, let it be a mode:

.blank default
.blank blank
.blank zero
.blank null

This proposed setting tells the CSV importer how to handle blanks: use the DB column’s default value, leave them be, make them zeroes, or make them SQL NULLs.

The default value for this setting would be “blank” for backwards compatibility.

Parallel .zero and .null settings could be added to take care of similar cases.  The defaults would obviously be “zero” and “null”, respectively.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users