Below is the data in TestingTable1 sorted by date in descending order always
BUYER_ID | ITEM_ID | CREATED_TIME
----------+-----------------+----------------------
1345653 110909316904 2012-07-09 21:29:06
1345653 151851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48
1345653 400307563710 2012-07-09 18:57:33
And if this is the below data in TestingTable2 sorted by date in descending order always
USER_ID | PRODUCT_ID | LAST_TIME
---------+----------------+-----------------------
1345653 110909316904 2012-07-09 22:29:06
1345653 150851771618 2012-07-09 19:57:33
Each row in TestingTable1 should match with TestingTable2, if doesn't match or data is not there in TestingTable2, then I need to show them in the output as in TestingTable1 I have this data but corresponding to TestingTable2 I have this data(which will be wrong data), so that I can see what is mismatch and what data is missing.
I need to compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID. I need to see, if BUYER_ID and USER_ID gets matched then I need to compare ITEM_ID with PRODUCT_ID and CREATED_TIME with LAST_TIME and if there is a mismatch in TestingTable2 after comparing with TestingTable1 in either one of them or both of them, then I need to show the result.
So if you look at the above example- I have three scenarios basically
- Firstly- In
TestingTable1, in the First rowITEM_IDis matching withPRODUCT_IDin the First row ofTestingTable2butCREATED_TIMEis not matching withLAST_TIMEfor the first row in both the tables - Secondly- In
TestingTable1, in the Second rowCREATED_TIMEis matching withLAST_TIMEin the second row ofTestingTable2butITEM_IDis not matching withPRODUCT_IDfor the second row in both the tables - Thirdly- In
TestingTable1, last two lines(rows) are not there at all inTestingTable2. This scenario is not covered in my query that I wrote. I want this scenario also in my query.
So these are the three cases that I need to cover while comparing TestingTable2 with TestingTable1 always. And TestingTable1 is the MAIN table through which comparisons need to be made always, so it means data in TestingTable1 is always accurate.
So I need to show the result like below considering the above example if not matching either one of them or data is not there in TestingTable2 at all- TestingTable1 data then next to it same TestingTable2 data, so that I can see what value was there in TestingTable1 as compared to TestingTable2
BUYER_ID | ITEM_ID | CREATED_TIME | USER_ID | PRODUCT_ID | LAST_TIME
-----------+-----------------+---------------------------+----------------+--------------------+-----------------------
1345653 110909316904 2012-07-09 21:29:06 1345653 110909316904 2012-07-09 22:29:06
1345653 151851771618 2012-07-09 19:57:33 1345653 150851771618 2012-07-09 19:57:33
1345653 221065796761 2012-07-09 19:31:48 NULL NULL NULL
1345653 400307563710 2012-07-09 18:57:33 NULL NULL NULL
Below is the query I wrote that covers only my two scenarios that I mentioned above and it works fine and I will get the output like above leaving last two rows from the Output. But I need to add third scenario also in this(below) query so that it gives output like above.
SELECT *
FROM (SELECT T2.buyer_id,
T2.item_id,
T2.created_time AS created_time,
subq.user_id,
subq.product_id,
subq.LAST_TIME
FROM TestingTable2 subq
JOIN TestingTable1 T2
ON T2.buyer_id = subq.user_id
AND subq.LAST_TIME = ( T2.created_time )
WHERE ( subq.product_id <> T2.item_id )
UNION ALL
SELECT T2.buyer_id,
T2.item_id AS item_id,
T2.created_time,
subq.user_id,
subq.product_id AS product_id,
subq.LAST_TIME
FROM TestingTable2 subq
JOIN TestingTable1 T2
ON T2.buyer_id = subq.user_id
AND subq.product_id = T2.item_id
WHERE ( subq.LAST_TIME <> ( T2.created_time ) )) finalResult
ORDER BY finalResult.BUYER_ID;
Any suggestion will be appreciated.
P.S. I have asked few questions related to JOIN in past few days, but that covers only my two scenarios, not the third scenario that I need in this query.
Update:-
I cannot use NOT IN or NOT EXISTS syntax for SQL, as I am working with Hive and Hive doesn't support NOT IN or NOT EXISTS that so I need some other way to deal with this.
I need to use my query only to modify it to work for Third Scenario, as that way Hive will support SQL syntax.
Below is my SQL Fiddle, that fulfills my two scenario above but not the third scenario. Can anyone help me to modify my query to work for third scenario as well?
http://sqlfiddle.com/#!3/102dd/1/0.
Data in Table1 should be there in Table2, if it is not there, then I need to show the mismatch between data after comparing from Table1 and also it is possible, data from Table1 will not be there in Table2, and I want to show that too.
Updated Output for the fiddle http://sqlfiddle.com/#!3/102dd/3/0
BUYER_ID | ITEM_ID | CREATED_TIME | USER_ID | PRODUCT_ID | LAST_TIME
-----------+-----------------+---------------------------+----------------+--------------------+-----------------------
1345653 151851771618 July, 09 2012 19:57:33 1345653 150851771618 July, 09 2012 19:57:33
1345653 221065796761 July, 09 2012 19:31:48 1345653 221165796761 July, 09 2012 19:31:48
1345653 110909316904 July, 09 2012 21:29:06 1345653 110909316904 July, 09 2012 22:29:06
1345653 400307563710 July, 09 2012 18:57:33 NULL NULL NULL
1345653 310411560125 July, 09 2012 16:09:49 NULL NULL NULL
UPDATED SQL QUERY THAT IS GIVING ME ERROR
I replaced TestingTable1 with this query-
(SELECT BUYER_ID, ITEM_ID, rank(BUYER_ID), CREATED_TIME
FROM (
SELECT BUYER_ID, ITEM_ID, CREATED_TIME
FROM testingtable1
where to_date(from_unixtime(cast(UNIX_TIMESTAMP(CREATED_TIME) as int))) = '2012-07-09'
DISTRIBUTE BY BUYER_ID
SORT BY BUYER_ID, CREATED_TIME desc
) T1
WHERE rank(BUYER_ID) < 5)
And TestingTable2 with this query-
(SELECT USER_ID, PROD_AND_TS.PRODUCT_ID as PRODUCT_ID, PROD_AND_TS.TIMESTAMPS as TIMESTAMPS FROM testingtable2 lateral view explode(PURCHASED_ITEM) exploded_table as PROD_AND_TS where to_date(from_unixtime(cast(PROD_AND_TS.TIMESTAMPS as BIGINT))) = '2012-07-09')
