Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Yannick Duchêne
Hi people,

I though automatic index was controlled by the pragma `automatic_index`, which seems to do nothing for me, unless I'm doing it the wrong way, while `WITHOUT ROWID` on table creations, seems to effectively prevent automatic indexes, the latter with or without the `PRAGMA automatic_index=0`.

Is this expected or is this me doing something wrong?

I'm concerned about it mainly for file size, more than for speed, while it matters too, as the program is far more slow than expected. For a program I'm running I got this:

  * With ROWID: execution time is 22 seconds and DB file size is 1.9MB ;
  * Without ROWID, execution time is 19 seconds and DB file size is 788KB (2.4 times smaller) ;
  * With insertion disabled, execution time is 6 seconds, so most of the execution time is spent in rows insertion.

The program indexes HTML files (*), the test runs on only 52 average sized HTML files, so the difference will be even more important with bigger files set.

I'm feeling lost about this: the pragma seems to do nothing and it's `WITHOUT ROWID` on table creations, which seems to do it instead. The pragma is executed right after the cursor creation. I'm using APSW (**). Pragma seems to works, as I tested with `PRAGMA synchronous=0` (just for a test, I really don't want to use this).


I don't know if it's OK to ask for this kind of question here. If it's not, just tell me, I won't bother.


(*) There are bout 103 000 insertions queries ending into 15 600 rows, using this:

    CREATE TABLE element_attribute_values
      (element TEXT NOT NULL,
       attribute TEXT NOT NULL,
       value TEXT NOT NULL
         CHECK ((attribute IS NOT '') OR (value IS '')),
       count INTEGER NOT NULL DEFAULT 1,
       PRIMARY KEY (element, attribute, value))

    INSERT OR REPLACE INTO element_attribute_values
      (element, attribute, value, count)
      VALUES (:element, :attribute, :value,
        (SELECT count + 1 FROM element_attribute_values
          WHERE (element IS :element)
            AND (attribute IS :attribute)
            AND (value IS :value)))

(**) Standard Python SQLite3 binding seems perceptibly slow. With APSW, the program is 17 times faster than with standard Python SQLite3 binding … execution time is around 370 seconds in the latter case, whatever how I manage transactions.


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Dominique Devienne
On Thu, Jan 28, 2016 at 7:17 PM, Yannick Duchêne <[hidden email]>
wrote:

> I though automatic index was controlled by the pragma `automatic_index`,
> which seems to do nothing for me, unless I'm doing it the wrong way, while
> `WITHOUT ROWID` on table creations, seems to effectively prevent automatic
> indexes, the latter with or without the `PRAGMA automatic_index=0`.
>
> Is this expected or is this me doing something wrong?
>

Not sure to understand what you expect this pragma to do, but inserts
typically don't involve automatic indexes, which are used only in queries
(selects).

And automatic indexes are used only if there isn't an suitable index
already, and SQLite query planner things one will help.

Your insert does have a query, but it fits the PK, so no automatic index is
needed. (and you can likely use AUTOINCREMENT to avoid that query in the
first place).

The 2.4x size difference is kinda expected, since your PK includes your
biggest column I suspect (i.e. value), so both the table, and its PK index
most store it, doubling at least the memory, and with page waste, clearly
can account for the difference.

And the time difference can be in-part at least attributed to the increased
IO.

It seems unusual to use the HTML text in the PK. Are you sure you need it?
--DD
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Yannick Duchêne
On Thu, 28 Jan 2016 19:59:02 +0100
Dominique Devienne <[hidden email]> wrote:

> Not sure to understand what you expect this pragma to do, but inserts
> typically don't involve automatic indexes, which are used only in queries
> (selects).

I though it was responsible for the `sqlite_autoindex_"table_name"_N` which gets generated and augments the DB size. That's the name and the documentation which makes me believe this. I can see these index in sqlitebrowser and at the file size.
 
> Your insert does have a query, but it fits the PK, so no automatic index is
> needed. (and you can likely use AUTOINCREMENT to avoid that query in the
> first place).

"count" is not a primary key, and AUTOINCREMENT may be used only on primary key.

It indeed does not need a primary key, a unique constraint is enough. Whether or not I define a PK or a unique constraint, it ends into the same speed and DB size. I just wanted to test if it differs when a PK is defined (it does not).

 
> The 2.4x size difference is kinda expected, since your PK includes your
> biggest column I suspect (i.e. value), so both the table, and its PK index
> most store it, doubling at least the memory, and with page waste, clearly
> can account for the difference.
>
> And the time difference can be in-part at least attributed to the increased
> IO.

That was my conclusion too, and that's why I would like to prevent the creation of this extraneous index. I can achieve it using `WITHOUT ROWID`, and not with the pragma, and that's what I don't understand. So, I can achieve it, just that it's in a way I don't understand, which give me the feeling I'm using a trick and I don't like it. Unless ROWID is indeed what's responsible for the creation of these extraneous indexes …

> It seems unusual to use the HTML text in the PK. Are you sure you need it?

It's short texts (one to ten characters for an average), just element name, attribute name and attribute value, unique triplets with a count of their occurences. There are only a few attribute values which may be a bit long, the href of <a> elements. I must say I simplified my explanations, it's not really the index, it's a prior step which makes some statistics which are later used to decide what elements and attributes will be focused on to index the contents. The text the final index stores is reasonably short title-like texts.


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

R Smith
In reply to this post by Yannick Duchêne


On 2016/01/28 8:17 PM, Yannick Duchêne wrote:

> Hi people,
>
> I though automatic index was controlled by the pragma `automatic_index`, which seems to do nothing for me, unless I'm doing it the wrong way, while `WITHOUT ROWID` on table creations, seems to effectively prevent automatic indexes, the latter with or without the `PRAGMA automatic_index=0`.
>
> Is this expected or is this me doing something wrong?
>
> I'm concerned about it mainly for file size, more than for speed, while it matters too, as the program is far more slow than expected. For a program I'm running I got this:
>
>    * With ROWID: execution time is 22 seconds and DB file size is 1.9MB ;
>    * Without ROWID, execution time is 19 seconds and DB file size is 788KB (2.4 times smaller) ;
>    * With insertion disabled, execution time is 6 seconds, so most of the execution time is spent in rows insertion.
>
> The program indexes HTML files (*), the test runs on only 52 average sized HTML files, so the difference will be even more important with bigger files set.
>
> I'm feeling lost about this: the pragma seems to do nothing and it's `WITHOUT ROWID` on table creations, which seems to do it instead. The pragma is executed right after the cursor creation. I'm using APSW (**). Pragma seems to works, as I tested with `PRAGMA synchronous=0` (just for a test, I really don't want to use this).

I think you are misunderstanding the Pragma and the idea of automatic
indices. An automatic Index might be created on a table that doesn't
have an adequately assigned primary key. It might also be created during
a query (mostly SELECT or sub-SELECT) for which there is no useful Index
created by the table designer and the Query planner figures it will be
quicker to make an Index than to do table scans through the query. This
habit of creating indices during select queries can be forced to not
happen by setting the "PRAGMA automatic_index=0;", but this needs to
happen when you open the DB connection, or at a minimum, before you try
any query - not after the cursor is created,  by that time the index
might already be made. (This is why you are not seeing any speed
improvement).

I only mention all the above so you understand what the automatic
indexing is about, but it has almost certainly nothing to do with your
query slowness, and even if you switch it off at a more opportune time,
I would be surprised if it changes the query speed.

>
>
> I don't know if it's OK to ask for this kind of question here. If it's not, just tell me, I won't bother.

It's always Ok to ask here. Welcome to SQLite-World. :)


>
>
> (*) There are bout 103 000 insertions queries ending into 15 600 rows, using this:

103k insertions to produce 15k rows... that is ~15% efficiency - the
opposite of good design. Perhaps we can help you find better SQL to
solve your problem. Let's see...

>
>      CREATE TABLE element_attribute_values
>        (element TEXT NOT NULL,
>         attribute TEXT NOT NULL,
>         value TEXT NOT NULL
>           CHECK ((attribute IS NOT '') OR (value IS '')),
>         count INTEGER NOT NULL DEFAULT 1,
>         PRIMARY KEY (element, attribute, value))
>
>      INSERT OR REPLACE INTO element_attribute_values
>        (element, attribute, value, count)
>        VALUES (:element, :attribute, :value,
>          (SELECT count + 1 FROM element_attribute_values
>            WHERE (element IS :element)
>              AND (attribute IS :attribute)
>              AND (value IS :value)))
>
> (**) Standard Python SQLite3 binding seems perceptibly slow. With APSW, the program is 17 times faster than with standard Python SQLite3 binding … execution time is around 370 seconds in the latter case, whatever how I manage transactions.
>
>

APSW is far better and maintained by Roger Binns (whom you might see
post here sometimes). Using that is your best bet, but it also cannot
help heavy-set queries. The best strategy is to make better queries.

I do not know the shape of your data (it matters), but I'm guessing
"element" represents standard HTML tags with "attribute" and "value"
giving basic expansion of the attributes list. it'd probably be safer to
use non case-sensitive values and use standard equation tests in Selects.

Could you try these in your Query loop perhaps:


-- Table: Renamed field count to cnt because "count" is an SQL reserved
word (though it will work, just a better habit)
CREATE TABLE element_attribute_values (
   element TEXT COLLATE NOCASE NOT NULL,
   attribute TEXT COLLATE NOCASE NOT NULL,
   value TEXT COLLATE NOCASE NOT NULL CHECK ((attribute <> '') OR (value
= '')),
   cnt INTEGER NOT NULL DEFAULT 0,
   PRIMARY KEY (element, attribute, value)
);

-- Insert Loop start:
-- This will simply fail if the PK already exists, else start the line
with 0 count.
INSERT OR IGNORE INTO element_attribute_values VALUES (:element,
:attribute, :value, 0);

-- This might be faster since it uses only one lookup loop, but it might
also not be.
-- If you share a list of example data to be inserted, we can find a
faster way. Try it and let us know...
WITH EAV(id,icnt) AS (
   SELECT rowid,cnt+1
     FROM element_attribute_values
    WHERE (element = :element) AND (attribute = :attribute) AND (value =
:value)
    LIMIT 1
)
UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
  WHERE rowid = (SELECT id FROM EAV)
;
-- Insert Loop end.



_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

R Smith


On 2016/01/28 10:08 PM, R Smith wrote:

>
>
> -- Insert Loop start:
> -- This will simply fail if the PK already exists, else start the line
> with 0 count.
> INSERT OR IGNORE INTO element_attribute_values VALUES (:element,
> :attribute, :value, 0);
>
> -- This might be faster since it uses only one lookup loop, but it
> might also not be.
> -- If you share a list of example data to be inserted, we can find a
> faster way. Try it and let us know...
> WITH EAV(id,icnt) AS (
>   SELECT rowid,cnt+1
>     FROM element_attribute_values
>    WHERE (element = :element) AND (attribute = :attribute) AND (value
> = :value)
>    LIMIT 1
> )
> UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
>  WHERE rowid = (SELECT id FROM EAV)
> ;
> -- Insert Loop end.
>

Upon re-reading I realised my descriptions were confusing. To be clear,
this is the entire loop SQL that should form part of your program:

    -- Insert Loop start:
    INSERT OR IGNORE INTO element_attribute_values VALUES (:element,
    :attribute, :value, 0);

    WITH EAV(id,icnt) AS (
       SELECT rowid,cnt+1
         FROM element_attribute_values
        WHERE (element = :element) AND (attribute = :attribute) AND
    (value = :value)
        LIMIT 1
    )
    UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
      WHERE rowid = (SELECT id FROM EAV)
    ;
    -- Insert Loop end.


_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Yannick Duchêne
In reply to this post by R Smith
On Thu, 28 Jan 2016 22:08:02 +0200
R Smith <[hidden email]> wrote:

> I think you are misunderstanding the Pragma and the idea of automatic
> indices. An automatic Index might be created on a table that doesn't
> have an adequately assigned primary key. It might also be created during
> a query (mostly SELECT or sub-SELECT) for which there is no useful Index
> created by the table designer and the Query planner figures it will be
> quicker to make an Index than to do table scans through the query. This
> habit of creating indices during select queries can be forced to not
> happen by setting the "PRAGMA automatic_index=0;", but this needs to
> happen when you open the DB connection, or at a minimum, before you try
> any query - not after the cursor is created,  by that time the index
> might already be made. (This is why you are not seeing any speed
> improvement).
>
> I only mention all the above so you understand what the automatic
> indexing is about, but it has almost certainly nothing to do with your
> query slowness, and even if you switch it off at a more opportune time,
> I would be surprised if it changes the query speed.

Indeed, I was misunderstanding. Are these indexes visible? Is there any relation with the ones I see from sqlitebrowser?
 
> 103k insertions to produce 15k rows... that is ~15% efficiency - the
> opposite of good design. Perhaps we can help you find better SQL to
> solve your problem. Let's see...

(red-face)

>
> I do not know the shape of your data (it matters), but I'm guessing
> "element" represents standard HTML tags with "attribute" and "value"
> giving basic expansion of the attributes list. it'd probably be safer to
> use non case-sensitive values and use standard equation tests in Selects.
>

The shape will change, it's far from final. The tables at that step depends on the next steps in the overall procedure, which is not entirely fixed for now.

> Could you try these in your Query loop perhaps:
>
>
> -- Table: Renamed field count to cnt because "count" is an SQL reserved
> word (though it will work, just a better habit)
> CREATE TABLE element_attribute_values (
>    element TEXT COLLATE NOCASE NOT NULL,
>    attribute TEXT COLLATE NOCASE NOT NULL,
>    value TEXT COLLATE NOCASE NOT NULL CHECK ((attribute <> '') OR (value
> = '')),
>    cnt INTEGER NOT NULL DEFAULT 0,
>    PRIMARY KEY (element, attribute, value)
> );
>
> -- Insert Loop start:
> -- This will simply fail if the PK already exists, else start the line
> with 0 count.
> INSERT OR IGNORE INTO element_attribute_values VALUES (:element,
> :attribute, :value, 0);
>
> -- This might be faster since it uses only one lookup loop, but it might
> also not be.
> -- If you share a list of example data to be inserted, we can find a
> faster way. Try it and let us know...
> WITH EAV(id,icnt) AS (
>    SELECT rowid,cnt+1
>      FROM element_attribute_values
>     WHERE (element = :element) AND (attribute = :attribute) AND (value =
> :value)
>     LIMIT 1
> )
> UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
>   WHERE rowid = (SELECT id FROM EAV)
> ;
> -- Insert Loop end.
>

The default may even be omitted, and may be the `LIMIT 1` too, as each triplet is unique.

I tried it, with the variant (no `LIMIT 1`) and it's about the same, even a bit more slow (not much, between one and two seconds more long). While speed is not the only concern, or perhaps I should not care that much about the DB file size (the other matters).

It's close to what I had at the beginning, which gave similar timings (just changed to match your recommendation about "cnt"'s name):

    INSERT OR IGNORE INTO element_attribute_values
      VALUES (:element, :attribute, :value, 0);
    UPDATE element_attribute_values
      SET cnt = cnt + 1
      WHERE (element = :element)
        AND (attribute = :attribute)
        AND (value = :value);

> -- If you share a list of example data to be inserted, we can find a
> faster way. Try it and let us know...

I don't mind, I can upload an archive somewhere. I guess you mean table's content?

I'm aware this use case may be a bit pathological, as I could use Python's dictionary. However, I decided to not to, for three reasons:

 * I wanted to see what it's like to use an SQLite DB as an application data container (file or memory) ;
 * Using a persistent DB is better for incremental process (may stop and resume later) ;
 * Persistent data is welcome for human review (I think about defining views in sqlitebrowser to dissect the results) ;

For persistence, I first tried CSV files, but this shows to be a inadequate. An SQL DB and a DB browser, looks better than CSV for this use‑case.

--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

R Smith


On 2016/01/28 11:44 PM, Yannick Duchêne wrote:
> On Thu, 28 Jan 2016 22:08:02 +0200
> R Smith <[hidden email]> wrote:
>
> Indeed, I was misunderstanding. Are these indexes visible? Is there
> any relation with the ones I see from sqlitebrowser?

If you mean you are looking at the DB schema and seeing automatic
indices there, then those are mostly the created ones where the table
doesn't contain a suitable index or perhaps the rowid. If you mean the
automatic indices that would be created in a select query for which the
query planner decides to add an index to facilitate faster querying -
well those can't be seen like that, but you can inspect the output from
"EXPLAIN QUERY PLAN" or just "EXPLAIN" (added to the front of your
query). These will show the query planner's steps and scan plans
respectively (showing any such auto indices it might use).

>
> The shape will change, it's far from final. The tables at that step depends on the next steps in the overall procedure, which is not entirely fixed for now.

Good to know.

>> Could you try these in your Query loop perhaps:
>>
>>
>> -- Table: Renamed field count to cnt because "count" is an SQL reserved
>> word (though it will work, just a better habit)
>> CREATE TABLE element_attribute_values (
>>     element TEXT COLLATE NOCASE NOT NULL,
>>     attribute TEXT COLLATE NOCASE NOT NULL,
>>     value TEXT COLLATE NOCASE NOT NULL CHECK ((attribute <> '') OR (value
>> = '')),
>>     cnt INTEGER NOT NULL DEFAULT 0,
>>     PRIMARY KEY (element, attribute, value)
>> );
>>
>> -- Insert Loop start:
>> -- This will simply fail if the PK already exists, else start the line
>> with 0 count.
>> INSERT OR IGNORE INTO element_attribute_values VALUES (:element,
>> :attribute, :value, 0);
>>
>> -- This might be faster since it uses only one lookup loop, but it might
>> also not be.
>> -- If you share a list of example data to be inserted, we can find a
>> faster way. Try it and let us know...
>> WITH EAV(id,icnt) AS (
>>     SELECT rowid,cnt+1
>>       FROM element_attribute_values
>>      WHERE (element = :element) AND (attribute = :attribute) AND (value =
>> :value)
>>      LIMIT 1
>> )
>> UPDATE element_attribute_values SET cnt = (SELECT icnt FROM EAV)
>>    WHERE rowid = (SELECT id FROM EAV)
>> ;
>> -- Insert Loop end.
>>
> The default may even be omitted, and may be the `LIMIT 1` too, as each triplet is unique.

Yeah, the LIMIT 1 is just to force single-row-results which are required
for the sub-query - it is however not possible to obtain multiple row
results here, so the limit is superfluous in this case, but I like
keeping it in for legibility.

> I tried it, with the variant (no `LIMIT 1`) and it's about the same, even a bit more slow (not much, between one and two seconds more long). While speed is not the only concern, or perhaps I should not care that much about the DB file size (the other matters).
>
> It's close to what I had at the beginning, which gave similar timings (just changed to match your recommendation about "cnt"'s name):

Yeah, I did not have a test-bed but I thought it might not improve much.
Post a table somewhere with typical values (the 103k rows) that would be
needing insertion, we might find a real fast way.


>
> I'm aware this use case may be a bit pathological, as I could use Python's dictionary. However, I decided to not to, for three reasons:
>
>   * I wanted to see what it's like to use an SQLite DB as an application data container (file or memory) ;
>   * Using a persistent DB is better for incremental process (may stop and resume later) ;
>   * Persistent data is welcome for human review (I think about defining views in sqlitebrowser to dissect the results) ;
>
> For persistence, I first tried CSV files, but this shows to be a inadequate. An SQL DB and a DB browser, looks better than CSV for this use‑case.
>

I think the dictionary would be faster for this use-case (mostly cause
it runs in-memory and we do not require all the DB data guarantees for
the initial sort). However, the fastest way in SQL would be to use a
temporary table, do you have any problems with that? It would mean you
would do the inserts in an insert loop and then simply calculate the
incidence values and insert to the real table - so it would require a 2
- step process.

In fact, let me write the SQL quick here. The table create is just like
before, and then:

BEGIN TRANSACTION;

   -- Make Temporary table
CREATE TEMPORARY TABLE eav (
   e TEXT COLLATE NOCASE,
   a TEXT COLLATE NOCASE,
   v TEXT COLLATE NOCASE
);

   -- The insert loop:
INSERT INTO eav VALUES (:element, :attribute, :value);
   -- end of Insert loop


CREATE INDEX Idx_eav ON eav(e,a,v);

INSERT OR IGNORE INTO element_attribute_values (element,attribute,value,cnt)
   SELECT e,a,v,COUNT(*)
     FROM EAV
    WHERE 1
    GROUP BY e,a,v;

   -- Cleanup
DROP TABLE eav;

COMMIT;


If that is not a LOT faster then I will be very surprised...

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Keith Medcalf
In reply to this post by Yannick Duchêne

> > Not sure to understand what you expect this pragma to do, but inserts
> > typically don't involve automatic indexes, which are used only in
> > queries
> > (selects).
>
> I though it was responsible for the `sqlite_autoindex_"table_name"_N`
> which gets generated and augments the DB size. That's the name and the
> documentation which makes me believe this. I can see these index in
> sqlitebrowser and at the file size.

When you create a table thusly:

create table x (x primary key, y, z);

you are creating a rowid table with columns x, y, an z.  You are also saying that you want x to be the primary key.  Therefore, you will get a table called x containing (ROWID, X, Y, Z).  This is a btree where the index is the rowid.  In order to implement you PRIMARY KEY, an index must be constructed containing the column X and ROWID in the table where the value of X is found.  This index will be called sqlite_autoindex_x_1 because it is the first index created automatically on table x.

You could achieve the same result as follows:

create table x (x, y, z);
create unique index pk_x_x on x(x);

except that now you have given the index on x an explicit name and one does not have to be manufactured for you.

Similarly, if you used:

create table x (x primary key, y unique, z);

you would get a table containing (rowid, x, y, z) and an sqlite_autoindex_x_1 implementing the unique key on x, and an sqlite_autoindex_x_2 implementing the unique constraint (index) on y.  If you wanted to explicitly specify the names of the indexes then you could say:

create table x (x, y, z);
create unique index pk_x_x on x(x);
create unique index uk_x_y on x(y);

The end result is the same.  These are indexes that you have DECLARED that you want and are entirely different from automatic indexes used to increase the performance of queries.

when you create a table thusly

create table x (x primary key, y, z) without_rowid;

you are creating a table without a rowid.  The table x consists entirely of an index where the indexed value is x, and the index "result" is y, z.  There is no actual table.

The problem with your original query is the use of INSERT or REPLACE.  Do not do that.  REPLACE is implemented as a DELETE (which means that all the indexes need to be updated and the tree's rebalanced) followed by an INSERT (which requires all the indexes to be updated and the trees to be rebalanced).  Instead you should determine what you want to do and do it.  For example:

cr.execute('BEGIN IMMEDIATE')
try:
  rowid = cr.execute('select rowid from element_attribute_values where element=? and attribute=? and value=?', (element, attribute, value)).fetchone()[0]
  cr.execute('update element_attribute_values set count=count+1 where rowid=?', (rowid,))
except:
  cr.execute('insert into element_attribute_values values (?, ?, ?, 1)', (element, attribute, value))
cr.execute('COMMIT')


You can speed it up even more by moving your transactions out to cover multiple operations.  
In this way you never need to delete then re-insert the same row.

Remember, the only way to optimize I/O is not to do it.






_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Keith Medcalf
cr.execute('BEGIN IMMEDIATE')
try:
  rowid = cr.execute('select rowid from element_attribute_values where element=? and attribute=? and value=?', (element, attribute, value)).fetchone()[0]
  cr.execute('update element_attribute_values set count=count+1 where rowid=?', (rowid,))
except:
  cr.execute('insert into element_attribute_values values (?, ?, ?, 1)', (element, attribute, value))
cr.execute('COMMIT')


could be replaced by (the below will probably be faster also):

cr.execute('BEGIN IMMEDIATE')
cr.execute('insert or ignore into element_attribute_values values (?, ?, ?, 0)', (element, attribute, value))
cr.execute('update element_attribute_values set count=count+1 where element=? and attribute=? and value=?', (element, attribute, value))
cr.execute('COMMIT')


same caveat that if you want more speed you will have to move your transaction out to cover more updates.





_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Yannick Duchêne
In reply to this post by R Smith
On Fri, 29 Jan 2016 01:34:01 +0200
R Smith <[hidden email]> wrote:

>
> I think the dictionary would be faster for this use-case (mostly cause
> it runs in-memory and we do not require all the DB data guarantees for
> the initial sort). However, the fastest way in SQL would be to use a
> temporary table, do you have any problems with that? It would mean you
> would do the inserts in an insert loop and then simply calculate the
> incidence values and insert to the real table - so it would require a 2
> - step process.
>
> In fact, let me write the SQL quick here. The table create is just like
> before, and then:
>
> BEGIN TRANSACTION;
>
>    -- Make Temporary table
> CREATE TEMPORARY TABLE eav (
>    e TEXT COLLATE NOCASE,
>    a TEXT COLLATE NOCASE,
>    v TEXT COLLATE NOCASE
> );
>
>    -- The insert loop:
> INSERT INTO eav VALUES (:element, :attribute, :value);
>    -- end of Insert loop
>
>
> CREATE INDEX Idx_eav ON eav(e,a,v);
>
> INSERT OR IGNORE INTO element_attribute_values (element,attribute,value,cnt)
>    SELECT e,a,v,COUNT(*)
>      FROM EAV
>     WHERE 1
>     GROUP BY e,a,v;
>
>    -- Cleanup
> DROP TABLE eav;
>
> COMMIT;
>
>
> If that is not a LOT faster then I will be very surprised...

I tried this, using a single transaction for the whole and I measured 15 seconds instead of 22. Given that the program takes 6 seconds without insertions, that's good.

I finally use another way: I use a Python counter dictionary (the `Counter` class in the `collections` module) to compute the counts on a per file basis, then there is an insert‑or‑ignore‑then‑update request for each triplet, which add to "cnt" instead of incrementing it by just one. In fewer words, there is only one insert/update per triplet for a given file. Instead of 103K requests to give 15K rows, there is now 23K requests to give the same 15K rows. Using this with a transaction per file (as the program always did), I measured 17 seconds. So I devised to group transactions transparently, a transaction is really committed only each N times (it takes care of any reminder), and using 25 for N, I measure 11 seconds. I believe that's good enough (I guess it may be good to avoid big transactions).


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Yannick Duchêne
In reply to this post by Keith Medcalf
On Thu, 28 Jan 2016 19:18:06 -0500
"Keith Medcalf" <[hidden email]> wrote:

>
> When you create a table thusly:
>
> create table x (x primary key, y, z);
>
> you are creating a rowid table with columns x, y, an z.  You are also saying that you want x to be the primary key.  Therefore, you will get a table called x containing (ROWID, X, Y, Z).  This is a btree where the index is the rowid.  In order to implement you PRIMARY KEY, an index must be constructed containing the column X and ROWID in the table where the value of X is found.  This index will be called sqlite_autoindex_x_1 because it is the first index created automatically on table x.

Close to what I was suspecting. Thanks for these points.
 
> You could achieve the same result as follows:
>
> create table x (x, y, z);
> create unique index pk_x_x on x(x);
>
> except that now you have given the index on x an explicit name and one does not have to be manufactured for you.
>

I checked these indexes gets created even if I manually define similar indexes. But I don't mind. After some experiments, this seems more related to ROWID, and you seem to confirm a relation, above.


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Keith Medcalf
> > You could achieve the same result as follows:
> >
> > create table x (x, y, z);
> > create unique index pk_x_x on x(x);
> >
> > except that now you have given the index on x an explicit name and one
> does not have to be manufactured for you.
> >
>
> I checked these indexes gets created even if I manually define similar
> indexes. But I don't mind. After some experiments, this seems more related
> to ROWID, and you seem to confirm a relation, above.

Well, yes, of course.  If you ask for two indexes to be created, two indexes will be created.

create table x (x primary key, y, z);
create unique index pk_x_x on x(x);

requests the creation of two indexes.  One by the "primary key" contained in the table definition, and one by the create index.  If you only want one, then only say to create one.

https://www.sqlite.org/optoverview.html#autoindex

see especially the last paragraphs of that section, and also

https://www.sqlite.org/fileformat2.html#intschema


Except for without_rowid tables, specifying "primary key" and "unique" within the table definition is just syntactic sugar for creating the indexes manually using create index; much in the same way ... JOIN ... ON ... is merely syntactic sugar for FROM ..., ... WHERE ... (except in the case of outer joins and the non-standard CROSS JOIN which cannot be expressed without the sugar).





_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: Pragma automatic_index and ROWID: which one is really responsible for automatic indexes?

Yannick Duchêne
In reply to this post by R Smith
On Thu, 28 Jan 2016 22:08:02 +0200
R Smith <[hidden email]> wrote:

> I think you are misunderstanding the Pragma and the idea of automatic
> indices. An automatic Index might be created on a table that doesn't
> have an adequately assigned primary key. It might also be created during
> a query (mostly SELECT or sub-SELECT) for which there is no useful Index
> created by the table designer and the Query planner figures it will be
> quicker to make an Index than to do table scans through the query. This
> habit of creating indices during select queries can be forced to not
> happen by setting the "PRAGMA automatic_index=0;", but this needs to
> happen when you open the DB connection, or at a minimum, before you try
> any query - not after the cursor is created,  by that time the index
> might already be made. (This is why you are not seeing any speed
> improvement).

Here is, from the documentation: https://www.sqlite.org/optoverview.html
At the very bottom, is said:

> Do not confuse automatic indexes with the internal indexes (having names like
> "sqlite_autoindex_table_N") that are sometimes created to implement a PRIMARY
> KEY constraint or UNIQUE constraint. The automatic indexes described here
> exist only for the duration of a single query, are never persisted to disk,
> and are only visible to a single database connection. Internal indexes are
> part of the implementation of PRIMARY KEY and UNIQUE constraints, are
> long-lasting and persisted to disk, and are visible to all database
> connections. The term "autoindex" appears in the names of internal indexes
> for legacy reasons and does not indicate that internal indexes and automatic
> indexes are related.

That's this legacy naming which causes confusion.


--
Yannick Duchêne
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users