Is there a way to perform a muti-level sort and extract of large data sets?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Is there a way to perform a muti-level sort and extract of large data sets?

Ron Barnes
Hello All,

I have approximately 600 million records that need to be sorted and then extracted to a flat file.  I am unable to code a solution using visual Basic .NET.  It was suggested to me that a DB engine could perform my task for me.

Is there a way to accomplish this using the multi-level sort example below?

There are 18 table entries.  I need to extract all 18 entries to create individual records but in a certain order.  I also need to deduplicate the records based upon the sorted output file.  I can take care of the deduplication (I think).  I just need to get the records in the right order first.

I need to sort them as follows...

Sort Field 1 Ascending
Sort Field 2 Ascending WITHIN field 1
Sort Field 3 Ascending WITHIN field 2 WITHIN field 1
Sort Field 4 Descending WITHIN field 3 WITHIN field 2 WITHIN field 1 <== This is a Date field and the most current (Highest) Date to float up
Sort Field 5 Ascending WITHIN field 4 WITHIN field 3 WITHIN field 2 WITHIN field 1

Thank you in advance!

-Ron

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Simon Slavin-3


On 26 Sep 2017, at 6:24pm, Ron Barnes <[hidden email]> wrote:

> I need to sort them as follows...
>
> Sort Field 1 Ascending
> Sort Field 2 Ascending WITHIN field 1
> Sort Field 3 Ascending WITHIN field 2 WITHIN field 1
> Sort Field 4 Descending WITHIN field 3 WITHIN field 2 WITHIN field 1 <== This is a Date field and the most current (Highest) Date to float up
> Sort Field 5 Ascending WITHIN field 4 WITHIN field 3 WITHIN field 2 WITHIN field 1

Trivial in any SQL engine including SQLite.  It looks something like

SELECT * FROM MyTable ORDER BY field1,field2,field3,field4 DESC,field5

SQL users do that sort of thing all the time.

However, if you use just the above command, SQLite will have to perform this sorting of 600M records each time you execute the command, which could take quite a long time — minutes or hours depending on your hardware.  So for any flavour of SQL you would probably tell it to create an index …

CREATE INDEX m_12345 ON MyTable (field1,field2,field3,field4 DESC,field5)

This tells SQL to perform the sorting and save the resulting order on disk.  Then every time you perform the above SELECT command SQL notices it already has the sort-order saved and just uses that one.  This can change the amount of time taken to a few seconds.

My one concern in reading your post is how your dates are formatted.  When putting your date fields into your SQL table you will have to ensure that dates are saved as a day number, or as text which naturally sorts into date order, e.g. YYYY/DD/MM.  You should not expect SQL to sort text such as "19 October 16" correctly.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Warren Young
In reply to this post by Ron Barnes
On Sep 26, 2017, at 11:24 AM, Ron Barnes <[hidden email]> wrote:
>
> I have approximately 600 million records that need to be sorted

Where is the data now?

> There are 18 table entries.

You mean 18 columns per row, right?

> I also need to deduplicate the records based upon the sorted output file.

You speak of VB.NET, which means you don’t have a uniq tool as on POSIX systems:

   https://linux.die.net/man/1/uniq

If you can install Cygwin or WSL on these Windows boxes, then you’d have uniq, as well as a cross-platform solution.  SQLite is available for both Cygwin and WSL.

> I can take care of the deduplication (I think).

The basic functionality of uniq is indeed pretty simple: given sorted input, write as output only lines that don’t repeat the content of the previous input line.

The primary reason to mess with Cygwin or WSL on Windows is simply because using pre-built tools, you don’t have to debug and maintain it.  There’s value in “just run it through uniq.”  Even if you can write it in VB.net in half an hour, you’re vastly over-budget compared to the half second it takes me to type “ | uniq”.

> Sort Field 1 Ascending
> Sort Field 2 Ascending WITHIN field 1

I’m not sure what you mean by “WITHIN”.  Are you simply saying that you want the data sorted first by field 2 and then by field 1, so that when two records have the same field 1 content, that the output has that pair of records ordered by field 2?  E.g.

    Field 1    Field 2
    ---------- ------------
    A          B
    A          C

As opposed to:

    Field 1    Field 2
    ---------- ------------
    A          C
    A          B

If so, that’s trivial SQL, well-covered in Simon’s reply.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Keith Medcalf
In reply to this post by Simon Slavin-3

SELECT DISTINCT * FROM table ORDER BY field1,field2,field3,field4 DESC,field5;

to do the whole sorting and de-duplication in one step ... assuming you want to report duplicate entire rows only once ...


---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Simon Slavin
>Sent: Tuesday, 26 September, 2017 11:37
>To: SQLite mailing list
>Subject: Re: [sqlite] Is there a way to perform a muti-level sort and
>extract of large data sets?
>
>
>
>On 26 Sep 2017, at 6:24pm, Ron Barnes <[hidden email]> wrote:
>
>> I need to sort them as follows...
>>
>> Sort Field 1 Ascending
>> Sort Field 2 Ascending WITHIN field 1
>> Sort Field 3 Ascending WITHIN field 2 WITHIN field 1
>> Sort Field 4 Descending WITHIN field 3 WITHIN field 2 WITHIN field
>1 <== This is a Date field and the most current (Highest) Date to
>float up
>> Sort Field 5 Ascending WITHIN field 4 WITHIN field 3 WITHIN field 2
>WITHIN field 1
>
>Trivial in any SQL engine including SQLite.  It looks something like
>
>SELECT * FROM MyTable ORDER BY field1,field2,field3,field4
>DESC,field5
>
>SQL users do that sort of thing all the time.
>
>However, if you use just the above command, SQLite will have to
>perform this sorting of 600M records each time you execute the
>command, which could take quite a long time — minutes or hours
>depending on your hardware.  So for any flavour of SQL you would
>probably tell it to create an index …
>
>CREATE INDEX m_12345 ON MyTable (field1,field2,field3,field4
>DESC,field5)
>
>This tells SQL to perform the sorting and save the resulting order on
>disk.  Then every time you perform the above SELECT command SQL
>notices it already has the sort-order saved and just uses that one.
>This can change the amount of time taken to a few seconds.
>
>My one concern in reading your post is how your dates are formatted.
>When putting your date fields into your SQL table you will have to
>ensure that dates are saved as a day number, or as text which
>naturally sorts into date order, e.g. YYYY/DD/MM.  You should not
>expect SQL to sort text such as "19 October 16" correctly.
>
>Simon.
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Ron Barnes
In reply to this post by Simon Slavin-3
Thank you so much - I will test this as soon as I get home!

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Simon Slavin
Sent: Tuesday, September 26, 2017 1:37 PM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Is there a way to perform a muti-level sort and extract of large data sets?



On 26 Sep 2017, at 6:24pm, Ron Barnes <[hidden email]> wrote:

> I need to sort them as follows...
>
> Sort Field 1 Ascending
> Sort Field 2 Ascending WITHIN field 1
> Sort Field 3 Ascending WITHIN field 2 WITHIN field 1 Sort Field 4
> Descending WITHIN field 3 WITHIN field 2 WITHIN field 1 <== This is a
> Date field and the most current (Highest) Date to float up Sort Field
> 5 Ascending WITHIN field 4 WITHIN field 3 WITHIN field 2 WITHIN field
> 1

Trivial in any SQL engine including SQLite.  It looks something like

SELECT * FROM MyTable ORDER BY field1,field2,field3,field4 DESC,field5

SQL users do that sort of thing all the time.

However, if you use just the above command, SQLite will have to perform this sorting of 600M records each time you execute the command, which could take quite a long time — minutes or hours depending on your hardware.  So for any flavour of SQL you would probably tell it to create an index …

CREATE INDEX m_12345 ON MyTable (field1,field2,field3,field4 DESC,field5)

This tells SQL to perform the sorting and save the resulting order on disk.  Then every time you perform the above SELECT command SQL notices it already has the sort-order saved and just uses that one.  This can change the amount of time taken to a few seconds.

My one concern in reading your post is how your dates are formatted.  When putting your date fields into your SQL table you will have to ensure that dates are saved as a day number, or as text which naturally sorts into date order, e.g. YYYY/DD/MM.  You should not expect SQL to sort text such as "19 October 16" correctly.

Simon.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Ron Barnes
In reply to this post by Warren Young
Thank you very much!

I will research the two suggestions below.

As for your sort assumption, you are correct.

A  A  1  2
A  A  2  1
A  B  1  2
A  B  1  3
A  C  1   1

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Warren Young
Sent: Tuesday, September 26, 2017 2:14 PM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Is there a way to perform a muti-level sort and extract of large data sets?

On Sep 26, 2017, at 11:24 AM, Ron Barnes <[hidden email]> wrote:
>
> I have approximately 600 million records that need to be sorted

Where is the data now?

> There are 18 table entries.

You mean 18 columns per row, right?

> I also need to deduplicate the records based upon the sorted output file.

You speak of VB.NET, which means you don’t have a uniq tool as on POSIX systems:

   https://linux.die.net/man/1/uniq

If you can install Cygwin or WSL on these Windows boxes, then you’d have uniq, as well as a cross-platform solution.  SQLite is available for both Cygwin and WSL.

> I can take care of the deduplication (I think).

The basic functionality of uniq is indeed pretty simple: given sorted input, write as output only lines that don’t repeat the content of the previous input line.

The primary reason to mess with Cygwin or WSL on Windows is simply because using pre-built tools, you don’t have to debug and maintain it.  There’s value in “just run it through uniq.”  Even if you can write it in VB.net in half an hour, you’re vastly over-budget compared to the half second it takes me to type “ | uniq”.

> Sort Field 1 Ascending
> Sort Field 2 Ascending WITHIN field 1

I’m not sure what you mean by “WITHIN”.  Are you simply saying that you want the data sorted first by field 2 and then by field 1, so that when two records have the same field 1 content, that the output has that pair of records ordered by field 2?  E.g.

    Field 1    Field 2
    ---------- ------------
    A          B
    A          C

As opposed to:

    Field 1    Field 2
    ---------- ------------
    A          C
    A          B

If so, that’s trivial SQL, well-covered in Simon’s reply.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Ron Barnes
In reply to this post by Keith Medcalf
That is exactly what I want to do!!!!

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Keith Medcalf
Sent: Tuesday, September 26, 2017 3:04 PM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Is there a way to perform a muti-level sort and extract of large data sets?


SELECT DISTINCT * FROM table ORDER BY field1,field2,field3,field4 DESC,field5;

to do the whole sorting and de-duplication in one step ... assuming you want to report duplicate entire rows only once ...


---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume.


>-----Original Message-----
>From: sqlite-users [mailto:sqlite-users-
>[hidden email]] On Behalf Of Simon Slavin
>Sent: Tuesday, 26 September, 2017 11:37
>To: SQLite mailing list
>Subject: Re: [sqlite] Is there a way to perform a muti-level sort and
>extract of large data sets?
>
>
>
>On 26 Sep 2017, at 6:24pm, Ron Barnes <[hidden email]> wrote:
>
>> I need to sort them as follows...
>>
>> Sort Field 1 Ascending
>> Sort Field 2 Ascending WITHIN field 1 Sort Field 3 Ascending WITHIN
>> field 2 WITHIN field 1 Sort Field 4 Descending WITHIN field 3 WITHIN
>> field 2 WITHIN field
>1 <== This is a Date field and the most current (Highest) Date to float
>up
>> Sort Field 5 Ascending WITHIN field 4 WITHIN field 3 WITHIN field 2
>WITHIN field 1
>
>Trivial in any SQL engine including SQLite.  It looks something like
>
>SELECT * FROM MyTable ORDER BY field1,field2,field3,field4
>DESC,field5
>
>SQL users do that sort of thing all the time.
>
>However, if you use just the above command, SQLite will have to perform
>this sorting of 600M records each time you execute the command, which
>could take quite a long time — minutes or hours depending on your
>hardware.  So for any flavour of SQL you would probably tell it to
>create an index …
>
>CREATE INDEX m_12345 ON MyTable (field1,field2,field3,field4
>DESC,field5)
>
>This tells SQL to perform the sorting and save the resulting order on
>disk.  Then every time you perform the above SELECT command SQL notices
>it already has the sort-order saved and just uses that one.
>This can change the amount of time taken to a few seconds.
>
>My one concern in reading your post is how your dates are formatted.
>When putting your date fields into your SQL table you will have to
>ensure that dates are saved as a day number, or as text which naturally
>sorts into date order, e.g. YYYY/DD/MM.  You should not expect SQL to
>sort text such as "19 October 16" correctly.
>
>Simon.
>_______________________________________________
>sqlite-users mailing list
>[hidden email]
>http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Stephen Chrzanowski
In reply to this post by Simon Slavin-3
On Tue, Sep 26, 2017 at 1:36 PM, Simon Slavin <[hidden email]> wrote:

>
>
> My one concern in reading your post is how your dates are formatted.  When
> putting your date fields into your SQL table you will have to ensure that
> dates are saved as a day number, or as text which naturally sorts into date
> order, e.g. YYYY/DD/MM.  You should not expect SQL to sort text such as "19
> October 16" correctly.
>
> Simon.
>

@OP, Simon is dead on, however, the only correction and clarification to
that statement is you'll want (if required) to sort by YYYY/MM/DD, not
YYYY/DD/MM.

Also, for deduplication, if you're executing one instruction, you can add a
UNIQUE clause after the SELECT, unless you've got other requirements that
make defines what a duplicate entry is, then you'd have to rely on your
software for those decisions.
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Stephen Chrzanowski
uh.. UNIQUE... DISTINCT... ,

On Tue, Sep 26, 2017 at 4:30 PM, Stephen Chrzanowski <[hidden email]>
wrote:

>
>
> On Tue, Sep 26, 2017 at 1:36 PM, Simon Slavin <[hidden email]>
> wrote:
>
>>
>>
>> My one concern in reading your post is how your dates are formatted.
>> When putting your date fields into your SQL table you will have to ensure
>> that dates are saved as a day number, or as text which naturally sorts into
>> date order, e.g. YYYY/DD/MM.  You should not expect SQL to sort text such
>> as "19 October 16" correctly.
>>
>> Simon.
>>
>
> @OP, Simon is dead on, however, the only correction and clarification to
> that statement is you'll want (if required) to sort by YYYY/MM/DD, not
> YYYY/DD/MM.
>
> Also, for deduplication, if you're executing one instruction, you can add
> a UNIQUE clause after the SELECT, unless you've got other requirements that
> make defines what a duplicate entry is, then you'd have to rely on your
> software for those decisions.
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Ron Barnes
😊

Than you!

-----Original Message-----
From: sqlite-users [mailto:[hidden email]] On Behalf Of Stephen Chrzanowski
Sent: Tuesday, September 26, 2017 4:31 PM
To: SQLite mailing list <[hidden email]>
Subject: Re: [sqlite] Is there a way to perform a muti-level sort and extract of large data sets?

uh.. UNIQUE... DISTINCT... ,

On Tue, Sep 26, 2017 at 4:30 PM, Stephen Chrzanowski <[hidden email]>
wrote:

>
>
> On Tue, Sep 26, 2017 at 1:36 PM, Simon Slavin <[hidden email]>
> wrote:
>
>>
>>
>> My one concern in reading your post is how your dates are formatted.
>> When putting your date fields into your SQL table you will have to
>> ensure that dates are saved as a day number, or as text which
>> naturally sorts into date order, e.g. YYYY/DD/MM.  You should not
>> expect SQL to sort text such as "19 October 16" correctly.
>>
>> Simon.
>>
>
> @OP, Simon is dead on, however, the only correction and clarification
> to that statement is you'll want (if required) to sort by YYYY/MM/DD,
> not YYYY/DD/MM.
>
> Also, for deduplication, if you're executing one instruction, you can
> add a UNIQUE clause after the SELECT, unless you've got other
> requirements that make defines what a duplicate entry is, then you'd
> have to rely on your software for those decisions.
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Jungle Boogie
In reply to this post by Ron Barnes
Hi there,

Just because I'm interested, I'm wondering if you can identify your
hardware, and how long it takes your system to do your desired operation on
such a large number of records.

Do let us know which option you performed the query with as well.

P.S. For best results, I'd recommend using the latest release of SQLite.

Thanks!
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Is there a way to perform a muti-level sort and extract of large data sets?

Donald Griggs
Hi Ron,

It may be that increasing the sqlite cache size will substantially reduce
the time for either the CREATE INDEX or the SELECT ... ORDER BY....
instructions (depending upon which method you choose).

https://sqlite.org/pragma.html#pragma_cache_size



On Wed, Sep 27, 2017 at 1:58 AM, jungle Boogie <[hidden email]>
wrote:

> Hi there,
>
> Just because I'm interested, I'm wondering if you can identify your
> hardware, and how long it takes your system to do your desired operation on
> such a large number of records.
>
> Do let us know which option you performed the query with as well.
>
> P.S. For best results, I'd recommend using the latest release of SQLite.
>
> Thanks!
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users