Need help building query
|
|
Thread rating:  |
Adrienne Boswell - 05 Nov 2007 18:19 GMT Using MS SQL, ASP Classic. Here's the situation. We have a very large database, with indexed views, separated by states. There are a lot of "duplicate" records, eg:
Vendor table 1 Acme Widget 123 This Street SomeTown CA 91206 2 Acme Widget Co 123 This Street SomeTown CA 91206 3 Acme Widget Corp 123 This Street SomeTown CA 91206 4 Acme Widget 234 Some Street SomeTown CA 90212 5 Acme Widget Co 567 Other Street SomeTown CA 90028 6 Acme Widget Co 234 Some Street SomeTown CA 90212
By browsing this, I know that 2 is a dup and 6 are dupes. 3 looks like it could be a dupe, but is not. The city information is actually a zip_id which is tied to the city, state and zipcode tables.
So, I need to run a query to select the duplicates, and then delete them. I know I am going to have to drop and create views, indexes, etc.
Any help appreciated....
-- Adrienne Boswell at work Administrator nextBlock.com http://atlas.nextblock.com/files/ Please respond to the group so others can share
Evertjan. - 05 Nov 2007 18:41 GMT Adrienne Boswell wrote on 05 nov 2007 in microsoft.public.inetserver.asp.db:
> Using MS SQL, ASP Classic. Here's the situation. We have a very > large database, with indexed views, separated by states. There are a [quoted text clipped - 15 lines] > them. I know I am going to have to drop and create views, indexes, > etc. Adrienne,
Either you have to come up with a sound description of what of the near- duplicates you accept as duplicates, or you will have to do it by hand.
It is just as all programming [with the exception of neural networks]: you have to abstract the handwork before you can start programming.
If these firms were in a Arabian soucq, where al carpenters are brothers, share the same name and work next to each other in the same street, there would probably be no duplicates.
So only your local knowledge can guide you.
 Signature Evertjan. The Netherlands. (Please change the x'es to dots in my emailaddress)
Adrienne Boswell - 05 Nov 2007 19:36 GMT > Adrienne Boswell wrote on 05 nov 2007 in > microsoft.public.inetserver.asp.db: [quoted text clipped - 32 lines] > > So only your local knowledge can guide you. Thank you, Evertjan, I was afraid that was the answer... you are always such a big help. Thank you again.
I can always tell my boss she has to pay for a new pair of glasses for me... I'm going to need them going through 100,000+ records. Wish me luck.
-- Adrienne Boswell at work Administrator nextBlock.com http://atlas.nextblock.com/files/ Please respond to the group so others can share
Evertjan. - 05 Nov 2007 20:07 GMT Adrienne Boswell wrote on 05 nov 2007 in microsoft.public.inetserver.asp.db:
>> Adrienne, >> [quoted text clipped - 18 lines] > me... I'm going to need them going through 100,000+ records. Wish me > luck. But:
You could build a programme that selects probable duplicates for you to see.
count = 0 (start where I was yesterday or set counting = true) do dup = 0 if (first 3 zip-chars are same) then dup = dup + .5 if (first 4 zip-chars are same) then dup = dup + .5 if (first 5 zip-chars are same) then dup = dup + 1 if (house number is same then dup = dup + 1 if (first word of name is same) then dup = dup + 1 if (town name is same) then dup = dup + 1 if (telephonenumber is same) then dup = dup + 2 .... etc. if dup>3 then // probability linit if counting then count = count+1 else (show me for possible deletion) if (done for today) then (safe where I was) end if end if loop until eof or (done for today)
First do a counting run to see how many you catch. If it is more than a 1000 make the probability more specific, and les than a 100 lesss specific.
If you get a reasonable number, buy your new spectacles and start deleting by hand.
 Signature Evertjan. The Netherlands. (Please change the x'es to dots in my emailaddress)
Bob Barrows [MVP] - 05 Nov 2007 20:18 GMT >> Adrienne Boswell wrote on 05 nov 2007 in >> microsoft.public.inetserver.asp.db: [quoted text clipped - 42 lines] > me... I'm going to need them going through 100,000+ records. Wish me > luck. You might be able to reduce the number of records you have to peer at by running a query like:
select vendorid,vendorname from vendors where left(vendorname,charindex(' ', vendorname)-1) in ( select left(vendorname,charindex(' ', vendorname)-1) firstword from vendors group by left(vendorname,charindex(' ', vendorname)-1) having count(*) > 1)
 Signature Microsoft MVP -- ASP/ASP.NET Please reply to the newsgroup. The email account listed in my From header is my spam trap, so I don't check it very often. You will get a quicker response by posting to the newsgroup.
Evertjan. - 05 Nov 2007 23:45 GMT Bob Barrows [MVP] wrote on 05 nov 2007 in microsoft.public.inetserver.asp.db:
> You might be able to reduce the number of records you have to peer at by > running a query like: [quoted text clipped - 5 lines] > group by left(vendorname,charindex(' ', vendorname)-1) > having count(*) > 1) You are right, Bob,
but testing for couple of records with slight mismatches between fields in this huge amount of records won't be that easy.
Perhaps only testing in groups with the same zip-code will do for starters?
 Signature Evertjan. The Netherlands. (Please change the x'es to dots in my emailaddress)
Bob Barrows [MVP] - 06 Nov 2007 00:15 GMT > Bob Barrows [MVP] wrote on 05 nov 2007 in > microsoft.public.inetserver.asp.db: [quoted text clipped - 16 lines] > Perhaps only testing in groups with the same zip-code will do for > starters? That's a good starting point. It's going to take a massive effort involving more than one person, I expect.
 Signature Microsoft MVP - ASP/ASP.NET Please reply to the newsgroup. This email account is my spam trap so I don't check it very often. If you must reply off-line, then remove the "NO SPAM"
Adrienne Boswell - 06 Nov 2007 15:43 GMT >> Bob Barrows [MVP] wrote on 05 nov 2007 in >> microsoft.public.inetserver.asp.db: [quoted text clipped - 19 lines] > That's a good starting point. It's going to take a massive effort > involving more than one person, I expect. Actually, I have found that phone numbers are a good place to start. If one phone number has more than one vendor, it's probably a dupe.
Yes, I have to go through them manually, and started yesterday. I got through about 5000 and found 10 duplicates. One of the reasons we have this duplicate problem is that when records were imported several years ago, the creteria was shady (I thought). For example, I would not have imported a record that had only a phone number, and no address, simply because if you are checking for VENDOR+ADDRESS+CITY+ZIPCODE one vendor with no address, and one with will not look like a dupe (even though it is).
Here's the fun part - I have a little under 3,000,000 records to comb through.
 Signature Adrienne Boswell at Home Arbpen Web Site Design Services http://www.cavalcade-of-coding.info Please respond to the group so others can share
|
|
|