extracting domain names from website addresses efficiently

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

extracting domain names from website addresses efficiently

Klaus Maas
I hope someone can help me to improve my SQL.

The solution I came up with becomes unreadable after 3 dots in a website
address (at least for me).

This is my approach:

The domain/subdomain name of a website address is enclosed between '://'
and the following '/'.
So I trim any text before and including '://' and after '/' in line 17 -
AS long

I calculate the number of dots in long - line 16 - AS dots

Because I could not find a way to figure out the position of the 2nd to
last dot directly (pseudo: substr(string, '2nd to last dot')  or
substr(string, nth('.')) ), I work with nested string manipulation.
The problem is that the nested string manipulation becomes unreadable
very quickly.

The case statement (lines 6 to 14) then strips all unwanted text
depending on the number of dots.
For 1 dot:  no stripping required
For 2 dots: a subdomain and the additional dot must be stripped
For 3 dots: the sub-subdomain and the additional w dots must be stripped
etc.
(CASE WHEN dots = 1  isn't really required, because the ELSE clause will
cover this case as well.)

Sample data:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE links(link);
INSERT INTO "links"
VALUES('https://stackoverflow.com/questions/7826876/sqlite-reverse-string-function');
INSERT INTO "links"
VALUES('https://www.pcwelt.de/ratgeber/Virtuelle-Netzwerke-mit-Open-VPN-aufbauen-9889432.html?utm_source=best-of-pc-welt-manuell&utm_medium=email&utm_campaign=newsletter&ext_id=3291863&pm_cat%5B0%5D=Netzwerktechnologie&pm_cat%5B1%5D=Netzwerke+allgemein&pm_cat%5B2%5D=Personal+Computer&pm_cat%5B3%5D=Netzwerk+Hardware&pm_cat%5B4%5D=Apps&pm_cat%5B5%5D=Client+Hardware&r=3677484228301571&lid=784281&pm_ln=3');
INSERT INTO "links" VALUES('http://dot3.dot2.memyself.eu/blabl.html');
INSERT INTO "links"
VALUES('http://dot4.dot3.dot2.memyself.eu/miamia/blabl.html');
INSERT INTO "links"
VALUES('http://dot5.dot4.dot3.dot2.memyself.eu/miamia/blabl.html');
INSERT INTO "links"
VALUES('http://dot6.dot5.dot4.dot3.dot2.memyself.eu/miamia/blabl.html');
INSERT INTO "links"
VALUES('http://dot7.dot6.dot5.dot4.dot3.dot2.memyself.eu/miamia/blabl.html');
INSERT INTO "links"
VALUES('http://dot8.dot7.dot6.dot5.dot4.dot3.dot2.memyself.eu/miamia/blabl.html');
INSERT INTO "links"
VALUES('http://dot9.dot8.dot7.dot6.dot5.dot4.dot3.dot2.memyself.eu/miamia/blabl.html');
COMMIT;

The example below is limited to 5 dots (domain plus 4 subdomain levels).
I did it for 9 dots as well and then the query blows up to 3 screen pages.

links:        name of source table:
link:         field name of website address
long:        field name of trimmed string
dots:        field name of calculated dots
domain:   field name of resulting domain.tld-string (stripped of any
subdomain)

.mode column
.header on
.width 51 -4 51
-- up to 5 dots (= 4 subdomain levels)
SELECT long, dots,
(CASE
WHEN dots = 1 THEN long
WHEN dots = 2 THEN substr(long, instr(long, '.')+1)
WHEN dots = 3 THEN substr(substr(long, instr(long, '.')+1),
instr(substr(long, instr(long, '.')+1), '.')+1)
WHEN dots = 4 THEN substr(substr(substr(long, instr(long, '.')+1),
instr(substr(long, instr(long, '.')+1), '.')+1),
instr(substr(substr(long, instr(long, '.')+1), instr(substr(long,
instr(long, '.')+1), '.')+1), '.')+1)
WHEN dots = 5 THEN substr(substr(substr(substr(long, instr(long,
'.')+1), instr(substr(long, instr(long, '.')+1), '.')+1),
instr(substr(substr(long, instr(long, '.')+1), instr(substr(long,
instr(long, '.')+1), '.')+1), '.')+1), instr(substr(substr(substr(long,
instr(long, '.')+1), instr(substr(long, instr(long, '.')+1), '.')+1),
instr(substr(substr(long, instr(long, '.')+1), instr(substr(long,
instr(long, '.')+1), '.')+1), '.')+1), '.')+1)
ELSE long
END) AS domain
FROM
(SELECT long, length(long) - length(replace(long, '.', '')) AS dots
FROM (SELECT substr(link, instr(link, '://')+3, instr(substr(link,
instr(link, '://')+3), '/')-1) AS long FROM links)
);

Is there a way to do this better/smarter?

Klaus
------------------------------------------------------------------------
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

Igor Tandetnik-2
Something like this (could likely be simplified further, but this should give the idea):

with recursive subdomains as (
   SELECT substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) AS long,
          substr(link, instr(link, '://')+3, instr(substr(link, instr(link, '://')+3), '/')-1) as subdomain
   FROM links
union all
   select long, substr(subdomain, instr(subdomain, '.') + 1)
   from subdomains
   where instr(subdomain, '.') > 0)
select * from subdomains
where (instr(subdomain, '.') = 0 and subdomain = long) OR
       (instr(subdomain, '.') > 0 and instr(substr(subdomain, instr(subdomain, '.') + 1), '.')=0);

The main point is to recursively build a table of all suffixes, then select just the suffixes you want.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

Peter da Silva
In reply to this post by Klaus Maas
This seems like a job for regular expressions.
 

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

Klaus Maas
In reply to this post by Igor Tandetnik-2
Thank you, Igor, for your solution.
I expected that I would be pointed to a recursive approach.
Your solution is quite elegant in my opinion.
My problem, however, is that I have not yet understood recursive CTE's.
Could you give me some pointers for good resources to learn about them?

Well, your code example is helping a lot in understanding them, although
I am still a long way off producing something similar myself without
assistance.

There is 1 item in the last where condition of which I do not understand
the purpose:
(instr(subdomain, '.') = 0 and subdomain = long)
This means that anything in the form of 'ftp://test/' would output the
string between the two delimiters (:// and /), in this case 'test'.
But that is not a domain name in the format domain.tld.
(I am working under the assumption that table links is cleaned up
contains valid links only with the protocol and :// prepended.)
Or am I missing something?.

Klaus

------------------------------------------------------------------------
On 2017-12-11 14:59, Igor Tandetnik wrote:

> Something like this (could likely be simplified further, but this
> should give the idea):
>
> with recursive subdomains as (
>   SELECT substr(link, instr(link, '://')+3, instr(substr(link,
> instr(link, '://')+3), '/')-1) AS long,
>          substr(link, instr(link, '://')+3, instr(substr(link,
> instr(link, '://')+3), '/')-1) as subdomain
>   FROM links
> union all
>   select long, substr(subdomain, instr(subdomain, '.') + 1)
>   from subdomains
>   where instr(subdomain, '.') > 0)
> select * from subdomains
> where (instr(subdomain, '.') = 0 and subdomain = long) OR
>       (instr(subdomain, '.') > 0 and instr(substr(subdomain,
> instr(subdomain, '.') + 1), '.')=0);
>
> The main point is to recursively build a table of all suffixes, then
> select just the suffixes you want.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

Igor Tandetnik-2
On 12/11/2017 12:22 PM, Klaus Maas wrote:
> There is 1 item in the last where condition of which I do not understand the purpose:
> (instr(subdomain, '.') = 0 and subdomain = long)

That's a zero-dot case, when the original domain is simply 'com', say. Added for completeness.

> This means that anything in the form of 'ftp://test/' would output the string between the two delimiters (:// and /), in this case 'test'.
> But that is not a domain name in the format domain.tld.

Feel free to adjust to taste.
--
Igor Tandetnik

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

petern
In reply to this post by Klaus Maas
Klaus.  The CTE manual with good examples is at
https://www.sqlite.org/lang_with.html
IgorT posted some good stuff about your problem using CTE.

FYI. TRIGGER is also recursive.  Could be more efficient if you have to
store them anyway:

CREATE TABLE domain(d TEXT);
CREATE TRIGGER domain_after_ins AFTER INSERT ON domain WHEN instr(NEW.d,'.')
  BEGIN INSERT INTO domain VALUES(substr(NEW.d,instr(NEW.d,'.')+1)); END;

PRAGMA recursive_triggers=1;
INSERT INTO domain VALUES ('a.b.c.d');
INSERT INTO domain VALUES('e.f.g.1.2');
SELECT * FROM domain;
d
----------
a.b.c.d
b.c.d
c.d
d
e.f.g.1.2
f.g.1.2
g.1.2
1.2
2

Peter







On Mon, Dec 11, 2017 at 9:22 AM, Klaus Maas <[hidden email]> wrote:

> Thank you, Igor, for your solution.
> I expected that I would be pointed to a recursive approach.
> Your solution is quite elegant in my opinion.
> My problem, however, is that I have not yet understood recursive CTE's.
> Could you give me some pointers for good resources to learn about them?
>
> Well, your code example is helping a lot in understanding them, although I
> am still a long way off producing something similar myself without
> assistance.
>
> There is 1 item in the last where condition of which I do not understand
> the purpose:
> (instr(subdomain, '.') = 0 and subdomain = long)
> This means that anything in the form of 'ftp://test/' would output the
> string between the two delimiters (:// and /), in this case 'test'.
> But that is not a domain name in the format domain.tld.
> (I am working under the assumption that table links is cleaned up contains
> valid links only with the protocol and :// prepended.)
> Or am I missing something?.
>
> Klaus
>
> ------------------------------------------------------------------------
> On 2017-12-11 14:59, Igor Tandetnik wrote:
>
>> Something like this (could likely be simplified further, but this should
>> give the idea):
>>
>> with recursive subdomains as (
>>   SELECT substr(link, instr(link, '://')+3, instr(substr(link,
>> instr(link, '://')+3), '/')-1) AS long,
>>          substr(link, instr(link, '://')+3, instr(substr(link,
>> instr(link, '://')+3), '/')-1) as subdomain
>>   FROM links
>> union all
>>   select long, substr(subdomain, instr(subdomain, '.') + 1)
>>   from subdomains
>>   where instr(subdomain, '.') > 0)
>> select * from subdomains
>> where (instr(subdomain, '.') = 0 and subdomain = long) OR
>>       (instr(subdomain, '.') > 0 and instr(substr(subdomain,
>> instr(subdomain, '.') + 1), '.')=0);
>>
>> The main point is to recursively build a table of all suffixes, then
>> select just the suffixes you want.
>>
>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

Klaus Maas
In reply to this post by Igor Tandetnik-2
Thank you for your help!


------------------------------------------------------------------------
On 2017-12-11 22:43, Igor Tandetnik wrote:

> On 12/11/2017 12:22 PM, Klaus Maas wrote:
>> There is 1 item in the last where condition of which I do not
>> understand the purpose:
>> (instr(subdomain, '.') = 0 and subdomain = long)
>
> That's a zero-dot case, when the original domain is simply 'com', say.
> Added for completeness.
>
>> This means that anything in the form of 'ftp://test/' would output
>> the string between the two delimiters (:// and /), in this case 'test'.
>> But that is not a domain name in the format domain.tld.
>
> Feel free to adjust to taste.

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

Klaus Maas
In reply to this post by petern
Very interesting.

I must be very slow, because at first I did no get what this was about,
although you fed it me on a spoon.

This is actually very straightforward. I find this easier to understand
than recursive CTEs.

Thank you, Peter and IgorT, for helping me on my way to understand this
powerful feature.

Still some way to go, but that is  part of the fun.


email signature Klaus Maas
------------------------------------------------------------------------
On 2017-12-12 00:22, petern wrote:

> Klaus.  The CTE manual with good examples is at
> https://www.sqlite.org/lang_with.html
> IgorT posted some good stuff about your problem using CTE.
>
> FYI. TRIGGER is also recursive.  Could be more efficient if you have to
> store them anyway:
>
> CREATE TABLE domain(d TEXT);
> CREATE TRIGGER domain_after_ins AFTER INSERT ON domain WHEN instr(NEW.d,'.')
>    BEGIN INSERT INTO domain VALUES(substr(NEW.d,instr(NEW.d,'.')+1)); END;
>
> PRAGMA recursive_triggers=1;
> INSERT INTO domain VALUES ('a.b.c.d');
> INSERT INTO domain VALUES('e.f.g.1.2');
> SELECT * FROM domain;
> d
> ----------
> a.b.c.d
> b.c.d
> c.d
> d
> e.f.g.1.2
> f.g.1.2
> g.1.2
> 1.2
> 2
>
> Peter
>
>
>
>
>
>
>
> On Mon, Dec 11, 2017 at 9:22 AM, Klaus Maas <[hidden email]> wrote:
>
>> Thank you, Igor, for your solution.
>> I expected that I would be pointed to a recursive approach.
>> Your solution is quite elegant in my opinion.
>> My problem, however, is that I have not yet understood recursive CTE's.
>> Could you give me some pointers for good resources to learn about them?
>>
>> Well, your code example is helping a lot in understanding them, although I
>> am still a long way off producing something similar myself without
>> assistance.
>>
>> There is 1 item in the last where condition of which I do not understand
>> the purpose:
>> (instr(subdomain, '.') = 0 and subdomain = long)
>> This means that anything in the form of 'ftp://test/' would output the
>> string between the two delimiters (:// and /), in this case 'test'.
>> But that is not a domain name in the format domain.tld.
>> (I am working under the assumption that table links is cleaned up contains
>> valid links only with the protocol and :// prepended.)
>> Or am I missing something?.
>>
>> Klaus
>>
>> ------------------------------------------------------------------------
>> On 2017-12-11 14:59, Igor Tandetnik wrote:
>>
>>> Something like this (could likely be simplified further, but this should
>>> give the idea):
>>>
>>> with recursive subdomains as (
>>>    SELECT substr(link, instr(link, '://')+3, instr(substr(link,
>>> instr(link, '://')+3), '/')-1) AS long,
>>>           substr(link, instr(link, '://')+3, instr(substr(link,
>>> instr(link, '://')+3), '/')-1) as subdomain
>>>    FROM links
>>> union all
>>>    select long, substr(subdomain, instr(subdomain, '.') + 1)
>>>    from subdomains
>>>    where instr(subdomain, '.') > 0)
>>> select * from subdomains
>>> where (instr(subdomain, '.') = 0 and subdomain = long) OR
>>>        (instr(subdomain, '.') > 0 and instr(substr(subdomain,
>>> instr(subdomain, '.') + 1), '.')=0);
>>>
>>> The main point is to recursively build a table of all suffixes, then
>>> select just the suffixes you want.
>>>
>> _______________________________________________
>> sqlite-users mailing list
>> [hidden email]
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
> _______________________________________________
> sqlite-users mailing list
> [hidden email]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
Reply | Threaded
Open this post in threaded view
|

Re: extracting domain names from website addresses efficiently

Teg-3
In reply to this post by Peter da Silva
Hello Peter,

Monday, December 11, 2017, 9:16:27 AM, you wrote:

PDS> This seems like a job for regular expressions.
PDS>  

PDS> _______________________________________________
PDS> sqlite-users mailing list
PDS> [hidden email]
PDS> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

I'm  with  you. This seems like using the wrong tool for the job. Sure
it can work but it's probably not the most efficient way.

--
 Teg                            mailto:[hidden email]

_______________________________________________
sqlite-users mailing list
[hidden email]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users