"Byte" is a colloquial term that became popularised and ultimately a ratified standard. It's just a play on "bit", and then from byte we get "nybble" (four bits, half a byte).
In early computer development, there was no standardisation on eight bits as a convenient 'packet' of data - the internal organisation of computers had varying groupings for data width and address width, and whatever the data width it was called a "word" - and it was necessary to be clear what a word was for any particular computer architecture. I worked on computers with 12-bit and 24-bit words.
Multiples of four bits was settled on quite early, because that was efficient for representing and manipulating BCD (Binary Coded Decimal, where decimal digits 0-9 are individually represented in four binary bits). Some computers worked specifically in BCD because it made input and output of human-readable numbers efficient, whereas modern computers can perform the transformations so fast it is better that they perform internal operations in pure binary.
Binary multiples of 8 bits (8, 16, 32 etc) became standard when chip manufacturers decided that an 8-bit data bus in and out of memory chips was the best compromise. Why binary multiples (dropping the 24-bit word)? Because logical operations frequently want to specify a single data bit within the data word, so a data field within the instruction code (again a binary word - people rarely see the nuts and bolts of computer operation these days) has to pinpoint the relevant bit, and this is most efficient if the number of bits to choose from is a power of 2.
Only then did the word "byte" (and its definition as a group of eight binary bits) become commonplace, and then gradually "word" came to mean two bytes (16 bits) and "long word" four bytes - for when the data quantities to be represented/manipulated exceed the range available in one byte (256 values).
In summary: "byte" is a relatively recently coined term, popularised when computing became available to the enthusiast/hobbyist, and originally would have been indeterminate in what data width it meant. Before that, "octet" was in circulation within the computer engineering community (although principally applied to serial data transmission rather than parallel data widths) - and it was clear what it meant by its prefix "oct".
The complication does not end there. Any form of data transmission is "lossy" - ie there is a chance of corruption over the transmission link, and what the receiver gets is not necessarily what the transmitter sent. That is the case even over a short piece of wire, although over a short piece of wire (unless you are hitting the upper limits of data rate) the chance is so slight one does not normally have to worry about it. In order to overcome the loss characteristics of any particular type of transmission link, data is encoded in a way to improve the likelihood of perfect reception. One common way of doing this is 8-5 encoding, where eight bits on the transmission represent 5 bits of actual data (used on disk drives). Thus one octet does not necessarily a byte make (in this example, it would take eight octets to make five bytes).