How can I split the string elements into disjoint groups in java?

The lines are as follows


How to find a set of unique strings and break it into non-intersecting groups by the following criterion: if two lines have coincidences of non-empty values in one or more columns, they belong to the same group. For example, lines


belong to one group. Initially I thought to make through a three of HashSet (for each column) to quickly see if the string is included in the list of unique values, then adding either to the list of already grouped rows or to the list of unique rows. But the algorithm in this case has a performance bottleneck: if you want to merge groups, you must go through each group in the list. Algorithm on a large amount of data (> 1 million records), with a large number of mergers, works slowly. If the mergers are small (about thousands), it works quickly. I caught the stuck in this place and do not know how to optimize this bottleneck or whether it is necessary to use other data structures and algorithms. Can someone tell me which direction to dig. I will be grateful for any thoughts on this matter.


I'd suggest the following approach:

  • Create a Set<String> ungroupedLines, initially containing all the lines. You'll remove the lines as you assign them to groups.
  • Build three Map<String, Collection<String>>-s, as you've suggested, one per column.
  • Initialize an empty Collection<Collection<String>> result.
  • While ungroupedLines is not empty:
    • Create a new Collection<String> group.
    • Remove an element, add it to group.
    • Perform "depth-first search" from that element, using your three maps.
      • Ignore (skip) any elements that have already been removed from your ungroupedLines.
      • For the rest, remove them from ungroupedLines and add them to group before recursing on them.
      • Alternatively, you can use breadth-first search.
    • Add group to result.


 ? Mirror half of a N x N adjacency matrix in Javascript
 ? How to explore the structure and dimensions of an object in Octave?
 ? Why does, in python, Stack's pop method and Queue's dequeue method behave differently when both have same code?
 ? Performant way to convert adjacency list to links for undirected graph
 ? Objects Unable to be read twice with ObjectInputStream
 ? Linked List: How to Sort Doubly Linked List?
 ? Why is the output of this program 312213444 and not 3122134?
 ? Equals method of a linked list not working
 ? How do I build a combination tree in Python?
 ? sort graph edges in increasing order