Identifying and characterizing highly similar notes in big clinical note datasets

Identifying and characterizing highly similar notes in big clinical note datasets Journal of Biomedical Informatics 82 (2018) 63–69 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin Identifying and characterizing highly similar notes in big clinical note datasets a,b, a c a Rodney A. Gabriel , Tsung-Ting Kuo , Julian McAuley , Chun-Nan Hsu UCSD Health Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA Department of Anesthesiology, University of California, San Diego, 200 West Arbor Dr, San Diego, CA 92103, USA Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA ARTIC L E I NF O ABSTRAC T Keywords: Background: Big clinical note datasets found in electronic health records (EHR) present substantial opportunities Electronic medical record to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to- De-deduplication exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable Natural language processing algorithm to de-duplicate notes and further characterize the sources of duplication. Methods: We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Journal of Biomedical Informatics Elsevier

Identifying and characterizing highly similar notes in big clinical note datasets

Loading next page...
 
/lp/elsevier/identifying-and-characterizing-highly-similar-notes-in-big-clinical-0SdU0xNvHP
Publisher
Elsevier
Copyright
Copyright © 2018 Elsevier Ltd
ISSN
1532-0464
eISSN
1532-0480
D.O.I.
10.1016/j.jbi.2018.04.009
Publisher site
See Article on Publisher Site

Abstract

Journal of Biomedical Informatics 82 (2018) 63–69 Contents lists available at ScienceDirect Journal of Biomedical Informatics journal homepage: www.elsevier.com/locate/yjbin Identifying and characterizing highly similar notes in big clinical note datasets a,b, a c a Rodney A. Gabriel , Tsung-Ting Kuo , Julian McAuley , Chun-Nan Hsu UCSD Health Department of Biomedical Informatics, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA Department of Anesthesiology, University of California, San Diego, 200 West Arbor Dr, San Diego, CA 92103, USA Department of Computer Science and Engineering, University of California, San Diego, 9500 Gilman Dr, La Jolla, CA 92093, USA ARTIC L E I NF O ABSTRAC T Keywords: Background: Big clinical note datasets found in electronic health records (EHR) present substantial opportunities Electronic medical record to train accurate statistical models that identify patterns in patient diagnosis and outcomes. However, near-to- De-deduplication exact duplication in note texts is a common issue in many clinical note datasets. We aimed to use a scalable Natural language processing algorithm to de-duplicate notes and further characterize the sources of duplication. Methods: We use an approximation algorithm to minimize pairwise comparisons consisting of three phases: (1) Minhashing with Locality Sensitive Hashing; (2) a

Journal

Journal of Biomedical InformaticsElsevier

Published: Jun 1, 2018

References

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off