ExcelBanter - Removing Near-Duplicate Rows, Leaving Those w/Most Data in Specific Columns

ExcelBanter (https://www.excelbanter.com/)

- Excel Discussion (Misc queries) (https://www.excelbanter.com/excel-discussion-misc-queries/)

- - Removing Near-Duplicate Rows, Leaving Those w/Most Data in Specific Columns (https://www.excelbanter.com/excel-discussion-misc-queries/20226-removing-near-duplicate-rows-leaving-those-w-most-data-specific-columns.html)

Removing Near-Duplicate Rows, Leaving Those w/Most Data in Specific Columns

I have a spreadsheet with about 20,000 rows of data and I am trying to
eliminate unnecessary rows. Columns B&C contain numbers, and columns D
through AR contain text. Many rows contain identical information in
columns B, C, D, E, & F, but the information in the remaining columns
may differ. I would like to eliminate rows with duplicate data in
columns B through F, retaining the rows with the greatest number of
entries in columns G through AR.

An example of what I would like to get to is shown below.

BEFORE
B C D E F G H I J K L
Row 1 12 34 DD EE FF GG HH II JJ KK LL
Row 2 12 34 DD EE FF GG HH II JJ
Row 3 12 34 DD EE FF GG HH II

AFTER
B C D E F G H I J K L
Row 1 12 34 DD EE FF GG HH II JJ KK LL

I am not familiar with Visual Basic, so I would prefer to handle this
with a formula, if possible.

Can anyone help?

Thanks!

Sandi

How about a little work:

First, do this against a copy--just in case!

Insert two new columns A:C in that worksheet.

In A1, put this formula:
=row()

In B1:
=d1&CHAR(1)&e1&CHAR(1)&f1&CHAR(1)&g1&CHAR(10)&h1
(it concatenates the stuff in C:G (formerly A:F)

Then in C1, put this:
=COUNTA(J1:AU1)
(formerly G:AR)

And drag down 20000 rows.
select A:C
edit|copy
edit|paste special|Values
(things will work faster if you have values, not formulas)

Add a new row 1 (for headers).

Now select the whole range and sort by column A in ascending order and a
secondary key of column B (descending order).

The top one of each group will be the one that had the most entries.

Now select column B.
And do Data|Filter|Advanced filter
filter in place and Unique records only.

Now the rows you can see are the ones you want.
Select those visible rows
edit|copy
paste to a new worksheet.

Sort this data by column A to put it back in the original order.
delete columns A:C

foofoo wrote:

I have a spreadsheet with about 20,000 rows of data and I am trying to
eliminate unnecessary rows. Columns B&C contain numbers, and columns D
through AR contain text. Many rows contain identical information in
columns B, C, D, E, & F, but the information in the remaining columns
may differ. I would like to eliminate rows with duplicate data in
columns B through F, retaining the rows with the greatest number of
entries in columns G through AR.

An example of what I would like to get to is shown below.

BEFORE
B C D E F G H I J K L
Row 1 12 34 DD EE FF GG HH II JJ KK LL
Row 2 12 34 DD EE FF GG HH II JJ
Row 3 12 34 DD EE FF GG HH II

AFTER
B C D E F G H I J K L
Row 1 12 34 DD EE FF GG HH II JJ KK LL

I am not familiar with Visual Basic, so I would prefer to handle this
with a formula, if possible.

Can anyone help?

Thanks!

Sandi

--

Dave Peterson