![]() |
Removing Near-Duplicate Rows, Leaving Those w/Most Data in Specific Columns
I have a spreadsheet with about 20,000 rows of data and I am trying to
eliminate unnecessary rows. Columns B&C contain numbers, and columns D through AR contain text. Many rows contain identical information in columns B, C, D, E, & F, but the information in the remaining columns may differ. I would like to eliminate rows with duplicate data in columns B through F, retaining the rows with the greatest number of entries in columns G through AR. An example of what I would like to get to is shown below. BEFORE B C D E F G H I J K L Row 1 12 34 DD EE FF GG HH II JJ KK LL Row 2 12 34 DD EE FF GG HH II JJ Row 3 12 34 DD EE FF GG HH II AFTER B C D E F G H I J K L Row 1 12 34 DD EE FF GG HH II JJ KK LL I am not familiar with Visual Basic, so I would prefer to handle this with a formula, if possible. Can anyone help? Thanks! Sandi |
How about a little work:
First, do this against a copy--just in case! Insert two new columns A:C in that worksheet. In A1, put this formula: =row() In B1: =d1&CHAR(1)&e1&CHAR(1)&f1&CHAR(1)&g1&CHAR(10)&h1 (it concatenates the stuff in C:G (formerly A:F) Then in C1, put this: =COUNTA(J1:AU1) (formerly G:AR) And drag down 20000 rows. select A:C edit|copy edit|paste special|Values (things will work faster if you have values, not formulas) Add a new row 1 (for headers). Now select the whole range and sort by column A in ascending order and a secondary key of column B (descending order). The top one of each group will be the one that had the most entries. Now select column B. And do Data|Filter|Advanced filter filter in place and Unique records only. Now the rows you can see are the ones you want. Select those visible rows edit|copy paste to a new worksheet. Sort this data by column A to put it back in the original order. delete columns A:C foofoo wrote: I have a spreadsheet with about 20,000 rows of data and I am trying to eliminate unnecessary rows. Columns B&C contain numbers, and columns D through AR contain text. Many rows contain identical information in columns B, C, D, E, & F, but the information in the remaining columns may differ. I would like to eliminate rows with duplicate data in columns B through F, retaining the rows with the greatest number of entries in columns G through AR. An example of what I would like to get to is shown below. BEFORE B C D E F G H I J K L Row 1 12 34 DD EE FF GG HH II JJ KK LL Row 2 12 34 DD EE FF GG HH II JJ Row 3 12 34 DD EE FF GG HH II AFTER B C D E F G H I J K L Row 1 12 34 DD EE FF GG HH II JJ KK LL I am not familiar with Visual Basic, so I would prefer to handle this with a formula, if possible. Can anyone help? Thanks! Sandi -- Dave Peterson |
All times are GMT +1. The time now is 06:35 PM. |
Powered by vBulletin® Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
ExcelBanter.com