WordPress: Find a list of similar (dupe) products

13 110
Avatar for Gigamegs
3 years ago

Imagine you have a large collection of products with multiple attributes, for example for your skincare routine or cooking recipes and you want to find a list of similar (dupe) products.

One way to solve it is to create a post for each product and group it with a selected number of relationships (https://wordpress.org/support/article/taxonomies/).

On the right side in the image you can see such an example with the product and the ingredients (inside top red box).

To order the find list each matched result is assigned a rank (percentage) with these rules:
1. If the matched result has the same number of ingredients the rank is 100% ( 100% match).
2. If the matched result has more ingredients than the rank is the base number of ingredients / matched number of ingredients.
3. If the matched result has less ingredients than the rank is the matched number of ingredients / base number of ingredients.

You can see an example of the final find list in the the above image on the right side (inside the red box).

Unfortunately the query is not trivial and complex and so after many trial and error and tweaking the database and my MySQL query I am finally proud to present and share with you a fast and pure MySQL query to solve this problem specifically for WordPress:

SELECT wp_posts.ID, wp_posts.post_title, GROUP_CONCAT(tt.term_id SEPARATOR ',') as termid, count(tt.term_id) as num, c.num as hits, (if (count(tt.term_id)=c.num,100,if (count(tt.term_id)>c.num,c.num/count(tt.term_id)*100,count(tt.term_id)/c.num*100))) as rank FROM wp_posts FORCE INDEX (idx2)
INNER JOIN (
SELECT object_id as ID, GROUP_CONCAT(tt.term_id SEPARATOR ',') as termid, count(tt.term_id) as num FROM wp_term_relationships as tr FORCE INDEX (idx3) INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id) WHERE tt.taxonomy = 'post_tag' AND tt.term_id IN (548, 669)
GROUP BY ID ) AS c ON wp_posts.ID = c.ID
INNER JOIN wp_term_relationships tr FORCE INDEX (PRIMARY) ON (wp_posts.ID = tr.object_id) INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id)
WHERE post_type = 'post' AND post_status = 'publish' and tt.taxonomy = 'post_tag'
GROUP BY wp_posts.ID
ORDER BY rank DESC, hits desc
LIMIT 30;

This query finds a list of similar or dupe products with the ingredients 548, 669 in the entire WordPress post database, skipping drafts, compute the rank and sort the result and is also amazing fast.

The database contains total 33144 products:
mysql> select count(*) from wp_posts;
+----------+
| count(*) |
+----------+
| 33144 |
+----------+
and total 932 terms:
mysql> select count(*) from wp_terms;
+----------+
| count(*) |
+----------+
| 932 |
+----------+
and each product can have up to 30-40 ingredients (e.g. terms):
mysql> select count(*) from wp_term_relationships;
+----------+
| count(*) |
+----------+
| 128344 |
+----------+

Because the query is complex it needs a lot of time, so I have come up with this solution to solve it

show create table wp_posts;
show index from wp_posts;
create index idx2 on wp_posts(post_type, post_status, post_date, ID);
And
show create table wp_term_relationships;
show index from wp_term_relationships;
create index idx3 on wp_term_relationships(term_taxonomy_id,object_id,term_order);

Both are "covering" indexes and the order is very important. I have experimented with different indexes, for example:

show create table wp_posts;
DROP INDEX idx1 ON wp_posts;
create index idx1.1 on wp_posts(post_type, ID, post_status, post_date);
create index idx1.2 on wp_posts(post_type, post_status, post_date);
create index idx1.3 on wp_posts(post_type, ID, post_status, post_date);
create index idx2 on wp_posts(post_type, post_status, post_date, ID);
create index idx3 on wp_posts(post_type, post_status, ID);
create index idx4 on wp_posts(ID, post_type, post_status);
create index idx5 on wp_posts(post_type, ID, post_status);
create index idx6 on wp_posts(post_type, post_status, post_date, post_date_gmt, ID);
And
show create table wp_term_relationships;
show index from wp_term_relationships;
create index idx1 on wp_term_relationships(term_taxonomy_id,object_id);
create index idx2 on wp_term_relationships(object_id,term_taxonomy_id);
create index idx3 on wp_term_relationships(term_taxonomy_id,object_id,term_order);
And
show create table wp_term_taxonomy;
show index from wp_term_taxonomy;
create index idx1 on wp_term_taxonomy(term_taxonomy_id,term_id);
create index idx2 on wp_term_taxonomy(taxonomy,term_taxonomy_id,term_id);
create index idx3 on wp_term_taxonomy(term_id,taxonomy,term_taxonomy_id);
create index idx4 on wp_term_taxonomy(term_id,term_taxonomy_id,taxonomy);
create index idx5 on wp_term_taxonomy(term_id,term_taxonomy_id);
CREATE UNIQUE INDEX idx_term_taxonomy_id_taxonomy ON term_taxonomy( term_taxonomy_id, taxonomy);
CREATE UNIQUE INDEX idx7 ON wp_term_taxonomy(term_taxonomy_id, taxonomy);
CREATE UNIQUE INDEX idx8 ON wp_term_taxonomy(taxonomy,term_taxonomy_id);
create index idx9 on wp_term_taxonomy(term_taxonomy_id,taxonomy);
create index idx10 on wp_term_taxonomy(taxonomy);

These are only the indexes but it wasn't the fastest indexes.

You can try MySQL explain command:
EXPLAIN SELECT wp_posts.ID, wp_posts.post_title, GROUP_CONCAT(tt.term_id SEPARATOR ',') as termid, count(tt.term_id) as num,
c.num as hits, (if (count(tt.term_id)=c.num,100,if (count(tt.term_id)>c.num,c.num/count(tt.term_id)*100,count(tt.term_id)/c.num*100))) as rank
FROM wp_posts FORCE INDEX (idx2)
INNER JOIN (
SELECT object_id as ID, GROUP_CONCAT(tt.term_id SEPARATOR ',') as termid, count(tt.term_id) as num
FROM wp_term_relationships as tr FORCE INDEX (idx3)
INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id)
WHERE tt.taxonomy = 'post_tag' AND tt.term_id IN (548, 669)
GROUP BY ID
) AS c
ON wp_posts.ID = c.ID
INNER JOIN wp_term_relationships tr FORCE INDEX (PRIMARY) ON (wp_posts.ID = tr.object_id)
INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id)
WHERE post_type = 'post' AND post_status = 'publish' and tt.taxonomy = 'post_tag'
GROUP BY wp_posts.ID
ORDER BY rank DESC, hits desc
LIMIT 30;

You can find an interesting read here:https://stackoverflow.com/questions/32937517/improving-mysql-select-statement-in-wordpress-theme/35051788#comment96844969_35051788

If you want to try my MySQL query yourself you can replace 548, 669 with any number of term id's to match your WordPress database.

To simply get a list with all the products and the ingredients:

Faster:
SELECT wp_posts.ID, wp_posts.post_title, GROUP_CONCAT(tt.term_id SEPARATOR ',') as termid FROM wp_posts FORCE INDEX (idx2)
INNER JOIN (
SELECT DISTINCT object_id as ID
FROM wp_term_relationships as tr FORCE INDEX (idx3)
INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id)
WHERE tt.taxonomy = 'post_tag' AND tt.term_id IN (548, 669)
) AS c
ON wp_posts.ID = c.ID
INNER JOIN wp_term_relationships tr FORCE INDEX (PRIMARY) ON (wp_posts.ID = tr.object_id)
INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id)
WHERE post_type = 'post' AND post_status = 'publish'
GROUP BY wp_posts.ID;

Medium:
SELECT wp_posts.ID, wp_posts.post_name, wp_posts.post_title, GROUP_CONCAT(tt.term_id SEPARATOR ',') as termid
FROM wp_posts FORCE INDEX (idx2)
INNER JOIN (
SELECT p.ID as ID
FROM wp_posts AS p FORCE INDEX (idx2)
WHERE p.post_type = 'post'
AND p.post_status = 'publish'
AND EXISTS
(
SELECT 1
FROM wp_term_relationships AS tr FORCE INDEX (idx2)
WHERE p.ID = tr.object_id
AND EXISTS
(
SELECT 1
from wp_term_taxonomy AS tt FORCE INDEX (PRIMARY)
WHERE tr.term_taxonomy_id = tt.term_taxonomy_id
AND tt.taxonomy = 'post_tag'
AND tt.term_id IN (548, 669) )
)
) AS c
ON wp_posts.ID = c.ID
INNER JOIN wp_term_relationships tr FORCE INDEX (PRIMARY) ON (wp_posts.ID = tr.object_id)
INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id)
WHERE post_type = 'post' AND post_status = 'publish' AND tt.taxonomy = 'post_tag'
GROUP BY wp_posts.ID;

Slower:
SELECT wp_posts.ID, wp_posts.post_name, wp_posts.post_title, GROUP_CONCAT(tt.term_id SEPARATOR ',') as termid
FROM wp_posts FORCE INDEX (idx2)
INNER JOIN (
SELECT tr.object_id as ID
FROM wp_term_relationships AS tr FORCE INDEX (idx3)
WHERE EXISTS
(
SELECT 1
from wp_term_taxonomy AS tt FORCE INDEX (idx7)
WHERE tr.term_taxonomy_id = tt.term_taxonomy_id
AND tt.taxonomy = 'post_tag'
AND tt.term_id IN (548, 669) )

) AS c
ON wp_posts.ID = c.ID
INNER JOIN wp_term_relationships tr FORCE INDEX (idx2) ON (wp_posts.ID = tr.object_id)
INNER JOIN wp_term_taxonomy tt FORCE INDEX (PRIMARY) ON (tr.term_taxonomy_id = tt.term_taxonomy_id)
WHERE post_type = 'post' AND post_status = 'publish'
GROUP BY wp_posts.ID;

I hope you have enjoyed my little post and I hope it also helps a bit! Thank you for your time and effort!

7
$ 0.00
Avatar for Gigamegs
3 years ago

Comments

Hi can I ask you something? Is wordpress is free? Coz I think its a website that had payment,please tell me if there is a free to access, thanks.

$ 0.00
3 years ago

WordPress is free.

$ 0.00
3 years ago

Ah okay, I know its usefull website for being freelancer, but I don't know yet how to use it. Thank you for the information. Maybe I will start searching about it in youtube. It's really helpful your article.

$ 0.00
3 years ago

Please be more careful next time. When i was approving articles in our community, at first i tought you will try to sell us fake face creams. Luckily i pressed page down, where i have noticed the SQL queries....

$ 0.00
3 years ago

What? The title says it all. Are you talking with me? Didn't you read the title? What part is looking like an ad? Care to elaborate? Thanks!

$ 0.00
3 years ago

first i only read dupe products

$ 0.00
3 years ago

ok. thanks

$ 0.00
3 years ago

if i can give you an advice, is to think carefully on the first few lines, and on the title of the articles. from the aspect of psychology, its extremely important.

$ 0.00
3 years ago

IDK how you can associate "dupe products" with fake face creams!? The title also contains other words. IMO it's a very technical and common term. What would be a better title? Any suggestions?

$ 0.00
3 years ago

i dont know, you should come up with an idea about how to grip the attention of your target audience

$ 0.00
3 years ago

OK. I updated the title. Better?

$ 0.00
3 years ago

indeed!

$ 0.00
3 years ago

thanks

$ 0.00
3 years ago