Créer une présentation
Télécharger la présentation

Download

Download Presentation

Truth Discovery with Multiple Conflicting Information Providers on the Web

492 Vues
Download Presentation

Télécharger la présentation
## Truth Discovery with Multiple Conflicting Information Providers on the Web

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Truth Discovery with Multiple Conflicting**InformationProviders on the Web KDD 07**Motivation**• Example: Authors of books • We tried to find out who wrote the book “Rapid Contextual Design”. • Many different sets of authors from different online bookstores Accurate information Incomplete information**Motivation**• Is the world-wide web always trustable? • Unfortunately, the answer is “NO”. • There is no guarantee for the correctness of information on the web.**Motivation**• Different web sites often provide conflicting information on a subject. • 54% of Internet users trust news web sites • 26% for web sites that sell products • 12% for blogs**Veracity**• Veracity i.e., conformity to truth • How to find true facts from a large amount of conflicting information on many subjects that is provided by various web sites. • This paper invent an algorithm called TRUTHFINDER, • A web site is trustworthy if it provides many pieces of true information, and a piece of information is likely to be true if it is provided by many trustworthy web sites.**Authors**Emotions Authors Emotions Bookstore Readers Books News Problem Definitions Facts: properties of the objects**Problem Definitions**• Definition 1: (Confidence of facts.) • The confidence of a fact f (denoted by s(f)) is the probability of f being correct, according to the best of our knowledge. • Definition 2: (Trustworthiness of web sites.) • The trustworthiness of a web site w (denoted by t(w)) is the expected confidence of the facts provided by w.**Problem Definitions**• Implication between facts • Imp( f1 f2 ) : • how much f2’s confidence should be increased or decreased according to f1’s confidence. • Imp( f1 f2 ) is a value between -1 and 1. • A positive value indicates if f1 is correct, f2 is likely to be correct. • A negative value means if f1 is correct, f2 is likely to be wrong. • Imp( f1 f2 ) = sim(f1, f2) - base_sim, • where sim(f1, f2) is the similarity between f1 and f2, and base_sim is a threshold for similarity. T F**Computational Model**• Heuristic 1: Usually there is only one true fact for a property of an object. • Heuristic 2: This true fact appears to be the same or similar on different web sites. • Heuristic 3: The false facts on different web sites are less likely to be the same or similar. • Heuristic 4: In a certain domain, a web site that provides mostly true facts for many objectswill likely provide true facts for other objects.**Computational Model**• If a fact is provided by many trustworthy web sites, it is likely to be true • If a fact is conflicting with the facts provided by many trustworthy web sites, it is unlikely to be true. • A web site is trustworthy if it provides facts with high confidence • Web site trustworthiness and fact confidence can be determined by each other • True facts are more consistent than false facts (Heuristic 3)**Computational Model- Basic Inference**• Basic Inference t(w1) = 0.9 and t(w2) = 0.99 t(w2) = 1.1 × t(w1) (w2) = 2× (w1)**Computational Model- Basic Inference**• f1 is provided by w1 and w2, if f1 is wrong then both w1 and w2 are wrong • the probability that both of them are wrong is • (1 − t(w1)) · (1 − t(w2)) • the probability that f1 is not wrong is • 1 − (1 − t(w1)) · (1 − t(w2)) • s(f) can be computed as :**Computational Model- Inferences between Facts**• Adjusted confidence score New score of the fact**Computational Model- Handling Additional Subtlety**• Different web sites are not independent with each other • dampening factor • the confidence of a fact f can easily be negative**Computational Model- Iterative Computation**New score of the fact**Experiments**• Baseline • Voting : • This method chooses the fact that is provided by most web sites. • = 0.5 • = 0.3 • Book Authors Dataset: • 1,265 computer science books • Using ISBN search on www.abebooks.com • 894 bookstores and 34,031 listings • 5.4 different authors / book**Experiments**• randomly select 100 books • manually find out their authors • accuracy : • Partially correct facts: • last name , first name and middle name : 3:2:1 • for example : “Graeme Simsion” is 5/6 (omit middle name) • If f1 has x authors and f2 has y authors, and there are z shared ones, then imp(f1 f2) = z/x − base-sim • base-sim = 0.5**Experiments- Book Authors Dataset**• One book may make multiple errors • Miss authors: • only provide subset of all authors**Experiments- Query with Google**• Google is good at finding authoritative web sites. • But do these web sites provide accurate information ? • Compare the online bookstores • highest ranks by Google vs. • highest trustworthiness found by TruthFinder • Querying Google with “bookstore” • Find all bookstores that exist in their dataset from the top 300 Google results.**Conclusion**• Veracity problem • TRUTHFINDERAlgorithm