Home | Contact Us | FAQ | Search & Site Map | Link to Us
Sign In | Join | Other 45 Sites in Network
Home
Discussion GroupsGeneralPHPASPPerlColdFusionFlashHTML, CSS, ScriptsBrowsers

Webmaster Forum / ASP / Database Access / November 2007



Tip: Looking for answers? Try searching our database.

Need help building query

Thread view: 
Enable EMail Alerts  Start New Thread
Thread rating: 
Adrienne Boswell - 05 Nov 2007 18:19 GMT
Using MS SQL, ASP Classic.  Here's the situation.  We have a very
large database, with indexed views, separated by states.  There are a
lot of "duplicate" records, eg:

Vendor table
1 Acme Widget        123 This Street  SomeTown CA 91206
2 Acme Widget Co   123 This Street  SomeTown CA 91206
3 Acme Widget Corp 123 This Street SomeTown CA 91206
4 Acme Widget        234 Some Street SomeTown CA 90212
5 Acme Widget Co   567 Other Street SomeTown CA 90028
6 Acme Widget Co   234 Some Street SomeTown CA 90212

By browsing this, I know that 2 is a dup and 6 are dupes.  3 looks
like it could be a dupe, but is not.  The city information is actually
a zip_id which is tied to the city, state and zipcode tables.

So, I need to run a query to select the duplicates, and then delete
them.  I know I am going to have to drop and create views, indexes,
etc.

Any help appreciated....

--
Adrienne Boswell at work
Administrator nextBlock.com
http://atlas.nextblock.com/files/
Please respond to the group so others can share
Evertjan. - 05 Nov 2007 18:41 GMT
Adrienne Boswell wrote on 05 nov 2007 in
microsoft.public.inetserver.asp.db:

> Using MS SQL, ASP Classic.  Here's the situation.  We have a very
> large database, with indexed views, separated by states.  There are a
[quoted text clipped - 15 lines]
> them.  I know I am going to have to drop and create views, indexes,
> etc.

Adrienne,

Either you have to come up with a sound description of what of the near-
duplicates you accept as duplicates, or you will have to do it by hand.

It is just as all programming [with the exception of neural networks]:
you have to abstract the handwork before you can start programming.

If these firms were in a Arabian soucq, where al carpenters are brothers,
share the same name and work next to each other in the same street, there
would probably be no duplicates.

So only your local knowledge can guide you.

Signature

Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Adrienne Boswell - 05 Nov 2007 19:36 GMT
> Adrienne Boswell wrote on 05 nov 2007 in
> microsoft.public.inetserver.asp.db:
[quoted text clipped - 32 lines]
>
> So only your local knowledge can guide you.

Thank you, Evertjan, I was afraid that was the answer... you are
always such a big help.  Thank you again.

I can always tell my boss she has to pay for a new pair of glasses for
me... I'm going to need them going through 100,000+ records.  Wish me
luck.

--
Adrienne Boswell at work
Administrator nextBlock.com
http://atlas.nextblock.com/files/
Please respond to the group so others can share
Evertjan. - 05 Nov 2007 20:07 GMT
Adrienne Boswell wrote on 05 nov 2007 in
microsoft.public.inetserver.asp.db:

>> Adrienne,
>>
[quoted text clipped - 18 lines]
> me... I'm going to need them going through 100,000+ records.  Wish me
> luck.

But:

You could build a programme
that selects probable duplicates for you to see.

count = 0
(start where I was yesterday or set counting = true)
do
 dup = 0
 if (first 3 zip-chars are same) then dup = dup + .5
 if (first 4 zip-chars are same) then dup = dup + .5
 if (first 5 zip-chars are same) then dup = dup + 1
 if (house number is same then dup = dup + 1
 if (first word of name is same) then dup = dup + 1
 if (town name is same) then dup = dup + 1
 if (telephonenumber is same) then dup = dup + 2
 .... etc.
 if dup>3 then // probability linit
   if counting then
     count = count+1
   else
     (show me for possible deletion)
     if (done for today) then (safe where I was)
   end if
 end if
loop until eof or (done for today)

First do a counting run to see how many you catch.
If it is more than a 1000 make the probability more specific,
and les than a 100 lesss specific.

If you get a reasonable number,
buy your new spectacles and start deleting by hand.

Signature

Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Bob Barrows [MVP] - 05 Nov 2007 20:18 GMT
>> Adrienne Boswell wrote on 05 nov 2007 in
>> microsoft.public.inetserver.asp.db:
[quoted text clipped - 42 lines]
> me... I'm going to need them going through 100,000+ records.  Wish me
> luck.

You might be able to reduce the number of records you have to peer at by
running a query like:

select vendorid,vendorname from vendors where
left(vendorname,charindex(' ', vendorname)-1) in (
select left(vendorname,charindex(' ', vendorname)-1) firstword
from vendors
group by left(vendorname,charindex(' ', vendorname)-1)
having count(*) > 1)

Signature

Microsoft MVP -- ASP/ASP.NET
Please reply to the newsgroup. The email account listed in my From
header is my spam trap, so I don't check it very often. You will get a
quicker response by posting to the newsgroup.

Evertjan. - 05 Nov 2007 23:45 GMT
Bob Barrows [MVP] wrote on 05 nov 2007 in
microsoft.public.inetserver.asp.db:

> You might be able to reduce the number of records you have to peer at by
> running a query like:
[quoted text clipped - 5 lines]
> group by left(vendorname,charindex(' ', vendorname)-1)
> having count(*) > 1)

You are right, Bob,

but testing for couple of records with slight mismatches between fields in
this huge amount of records won't be that easy.

Perhaps only testing in groups with the same zip-code will do for starters?

Signature

Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Bob Barrows [MVP] - 06 Nov 2007 00:15 GMT
> Bob Barrows [MVP] wrote on 05 nov 2007 in
> microsoft.public.inetserver.asp.db:
[quoted text clipped - 16 lines]
> Perhaps only testing in groups with the same zip-code will do for
> starters?

That's a good starting point. It's going to take a massive effort involving
more than one person, I expect.

Signature

Microsoft MVP - ASP/ASP.NET
Please reply to the newsgroup. This email account is my spam trap so I
don't check it very often. If you must reply off-line, then remove the
"NO SPAM"

Adrienne Boswell - 06 Nov 2007 15:43 GMT
>> Bob Barrows [MVP] wrote on 05 nov 2007 in
>> microsoft.public.inetserver.asp.db:
[quoted text clipped - 19 lines]
> That's a good starting point. It's going to take a massive effort
> involving more than one person, I expect.

Actually, I have found that phone numbers are a good place to start.  If
one phone number has more than one vendor, it's probably a dupe.

Yes, I have to go through them manually, and started yesterday.  I got
through about 5000 and found 10 duplicates. One of the reasons we have
this duplicate problem is that when records were imported several years
ago, the creteria was shady (I thought).  For example, I would not have
imported a record that had only a phone number, and no address, simply
because if you are checking for VENDOR+ADDRESS+CITY+ZIPCODE one vendor
with no address, and one with will not look like a dupe (even though it
is).

Here's the fun part - I have a little under 3,000,000 records to comb
through.  

Signature

Adrienne Boswell at Home
Arbpen Web Site Design Services
http://www.cavalcade-of-coding.info
Please respond to the group so others can share

 
Sign In
Join
My Latest Posts
My Monitored Threads
My Blog
My Photo Gallery
My Profile
My Homepage

Start New Thread
Enable EMail Alerts
Rate this Thread



©2009 Advenet LLC   Privacy Policy - Terms of Use
This website includes both content owned or controlled by Advenet as well as content owned or controlled by third parties.